[referencer] getting metadata from pubmed and some comments

Wed Nov 7 03:39:28 EST 2007

Le mardi 06 novembre 2007 à 22:10 +0000, John Spray a écrit :
> The interface for metadata-fetching code isn't well defined at present
> but it shouldn't be too difficult to do, and I would appreciate
> assistance with it.  Here are some quick thoughts on the possibilities.
> 
> In the current situation, there are the following methods:
> 
> void BibData::guessDoi (Glib::ustring const &raw_)
> void BibData::guessArxiv (Glib::ustring const &raw_)
> void BibData::getCrossRef ()
> void BibData::getArxiv ()
> 
> The guess methods scan the raw text for regexes that look like
> identifiers.  The get methods use the Transfer class to get the
> necessary URLs and then use  BibData::parseCrossRefXML and
> BibUtils::parseBibUtils respectively to convert populate the BibData
> with the downloaded metadata.
> 
> The guess functions are called from Document::readPDF when a PDF is
> first added: this loads the raw text of a pdf and processes it
> page-by-page (to avoid loading the whole thing when the identifier is on
> the first page).
> 
> The get functions are called in Document::getMetaData, and the document
> determines which kinds of metadata it can get in
> Document::canGetMetaData.
> 
> Both of the existing mechanisms (arxiv and crossref) could be
> implemented as descendants of a MetadataFetcher abstract class with
> guess() and get() methods.  However, for efficiency the guess() methods
> should probably be combined into a global guessing function which uses
> regexes provided by the MetadataFetcher implementations.
> 
> To manage N MetadataFetchers we would need at least
> 
>       * Priority information: which are our favourite fetchers?  Would
>         probably put things like pubmed and arxiv at the top and use
>         general DOI stuff like cross refas a backup.  Preferably provide
>         UI for setting this.
>       * Enable/disable information, and associated UI so that plugins
>         which are broken for a given user or just spurious and wasting
>         time can be disabled.

These two things can easilly go together in the pref UI: a list with
"up" and "down" buttons and a checkbox on each row. Similar UI exist in
other gnome applications.

>       * Hooks for the fetchers to provide preferences UI.

Yes, having it in the UI would be nice, but I have no idea what this
should look like (a separate dialog, show them inside the main pref
dialog ? would each fetcher have to define its own UI or coult it be
somehow generic ?). In the meantime, it may be easier to start with
gconf-only prefs. A "metadata" subdirectory with keys prefixed by the
name of the metadata fetcher looks sane to me.

>       * Some mechanism for the fetchers to share identifiers: more than
>         one is going to understand DOIs.  So perhaps this leads to a
>         IdentifierScanner class (for the guess methods) and a separate
>         MetadataFetcher class (for the get methods).  Then need a
>         general way of specifying which fetchers can deal with which
>         identifiers.

You mentioned a Document::canGetMetaData function, maybe it should be
the other way around: always call the enabled fetchers on the document,
and let the fetcher decide if it can do something or not, depending on
what was found by the "getWhatever" functions ?
Also, I am all for the combined ID guesser, as many (?) fetchers may
reuse the same stuff. ID guessing regex could be shared through another
object giving the regex for a given type of ID. It should also be give
the web link for a given value of the ID. Then all fetchers just need to
say which one they need and the common ID guesser can use only the regex
required by the active fetchers. I'm not sure wether a fetcher should be
able to rely on several IDs.
Oh, this could even be used to know which ID a fetcher can use and allow
to call only the relevant ones (voiding my previous statement).

Related to this, I can only see a "doi" field, where is the arxiv ID
stored ? It may be nice to have a set of "ID" associated with the
document, each with a name and a value.

> 
> Once the interface is well defined, a plugin system becomes possible.
> But I think the first step is definitely to refine the interface within
> the existing monolithic C++.
> 
> If you start hacking on this then feel free to grab me on google talk
> for questions (jcspray attt gmail.com).
> 
> > I'm not sure about the "initial" field, can't it be deduced from the
> > "given name" one ?
> 
> I guess one usually either has the initials or the first names (not
> both), so it's not much of an issue.
> 
> > One annoying thing with having separated fields, is about copy/pasting
> > the whole list of authors. Maybe keeping a large field can be convenient
> > for this use case ?
> 
> Or having in general an entry for adding authors (type "John Spray"
> without changing fields) which then appear in a list automatically
> parsed, such that the user can tweak them to his liking.  The challenge
> is doing it in a way that takes a minimal amount of space on the screen.

If you want to gain some space, then showing only one large field if
better, but it has to be larger: some papers have MANY authors, editing
them in a small text field is painfull.
Showing the large field by default with a "detail" expander to reveal
the list of small ones may be a good compromise (the "real" data being
stored in a clean list and the large field being just an automatic
aggregation of it)

Some more things that I do not like:
* putting the proxy stuff in the pref dialog is probably not a good
thing, why not just a button to launch the corresponding gnome capplet ?
* I'm not fond of having the "property dialog" as a dialog. I would
prefer a panel inside the main window for this (maybe as I am used to
jabref ?)

Best regards
-- 
Aurelien