[referencer] referencer 1.1-pre

Aurélien Naldi aurelien.naldi at gmail.com
Mon Jan 14 09:15:26 EST 2008


On lun, 2008-01-14 at 06:09 -0500, jcspray at icculus.org wrote:
> Quoting Aurélien Naldi <aurelien.naldi at gmail.com>:
> >> There's a newline in the middle of the DOI in the paper.  The DOI regex is
> >> picking up another DOI later on which has the 'junk' on it.  Regexing out
> >> DOIs is always going to be a bit hit and miss.
> >
> > Yes, this is (and will remain) a tricky thing, but the first DOI appears
> > nicely in the output of pdftotext. This is not a regression anyway. How
> > does referencer extract the text ? Is it using some other external tool
> > or doing the job by itself ?
> 
> libpoppler
> 
> > About the doi detection, one thing freaks me out (even if I have not
> > seen it happen yet): a pdf could contain the doi of some other document
> > as a way to quote it. Did referencer already pick the wrong doi in such
> > case for someone ?
> 
> Of course.

How annoying...
I have one more problem with doi guessing: I have here some (quite a lot
of them) papers with some "metadata" in a small column on the left of
the first page, including:

"This article's doi:
<the doi is here>"

Referencer does not catch it as it only expects spaces between "doi:"
and the doi itself. It does work if I modify it to allow newlines as
well (replacing ":? *" with ":?[ \n]*" as a quick test. Can you include
this or is it some other problem I did not think about ?
For now I'm happy, the doi detection works for 95% of the papers I have
here (except the 50+ that does not contain a doi :/ )

PS: current svn gives me a crash at startup, which was not here a few
hours ago.

-- 
Aurelien Naldi




More information about the referencer mailing list