[referencer] getting metadata from pubmed and some comments

John Spray jcspray at icculus.org
Tue Nov 6 17:10:04 EST 2007


On Tue, 2007-11-06 at 21:35 +0100, Aurélien Naldi wrote:
> I think I can help with this...
> Does referencer have (or plans to) a plugin system or something to add
> metadata fetchers easilly ?

The interface for metadata-fetching code isn't well defined at present
but it shouldn't be too difficult to do, and I would appreciate
assistance with it.  Here are some quick thoughts on the possibilities.

In the current situation, there are the following methods:

void BibData::guessDoi (Glib::ustring const &raw_)
void BibData::guessArxiv (Glib::ustring const &raw_)
void BibData::getCrossRef ()
void BibData::getArxiv ()

The guess methods scan the raw text for regexes that look like
identifiers.  The get methods use the Transfer class to get the
necessary URLs and then use  BibData::parseCrossRefXML and
BibUtils::parseBibUtils respectively to convert populate the BibData
with the downloaded metadata.

The guess functions are called from Document::readPDF when a PDF is
first added: this loads the raw text of a pdf and processes it
page-by-page (to avoid loading the whole thing when the identifier is on
the first page).

The get functions are called in Document::getMetaData, and the document
determines which kinds of metadata it can get in
Document::canGetMetaData.

Both of the existing mechanisms (arxiv and crossref) could be
implemented as descendants of a MetadataFetcher abstract class with
guess() and get() methods.  However, for efficiency the guess() methods
should probably be combined into a global guessing function which uses
regexes provided by the MetadataFetcher implementations.

To manage N MetadataFetchers we would need at least

      * Priority information: which are our favourite fetchers?  Would
        probably put things like pubmed and arxiv at the top and use
        general DOI stuff like cross refas a backup.  Preferably provide
        UI for setting this.
      * Enable/disable information, and associated UI so that plugins
        which are broken for a given user or just spurious and wasting
        time can be disabled.
      * Hooks for the fetchers to provide preferences UI.
      * Some mechanism for the fetchers to share identifiers: more than
        one is going to understand DOIs.  So perhaps this leads to a
        IdentifierScanner class (for the guess methods) and a separate
        MetadataFetcher class (for the get methods).  Then need a
        general way of specifying which fetchers can deal with which
        identifiers.

Once the interface is well defined, a plugin system becomes possible.
But I think the first step is definitely to refine the interface within
the existing monolithic C++.

If you start hacking on this then feel free to grab me on google talk
for questions (jcspray attt gmail.com).

> I'm not sure about the "initial" field, can't it be deduced from the
> "given name" one ?

I guess one usually either has the initials or the first names (not
both), so it's not much of an issue.

> One annoying thing with having separated fields, is about copy/pasting
> the whole list of authors. Maybe keeping a large field can be convenient
> for this use case ?

Or having in general an entry for adding authors (type "John Spray"
without changing fields) which then appear in a list automatically
parsed, such that the user can tweak them to his liking.  The challenge
is doing it in a way that takes a minimal amount of space on the screen.

Cheers,
John




More information about the referencer mailing list