[referencer] Looking up a DOI (was: Re: [referencer] Re: Plugin for fetching data from Isi-WebOfScience)

Tue Jul 8 09:35:44 EDT 2008

Quoting Michael Banck <mbanck at gmx.net>:
>> Note that the triple needs to resolve to something machine-parseable,
>> rather than human-readable HTML.  Many journals have a "download
>> citation" link or so associated with article pages which would be useful
>> for this.  XML preferred to bibtex since it tends to have fewer
>> idiosynracies.
>
> While this would be certainly welcome, I think initially it would be
> less work to just extract the DOI from the article's HTML page, and do a
> metadata search on it using the available plugins (crossref/pubmed/
> arxiv).

Yes, but my point about it being machine-parseable stands: regexing  
the DOI out of a webpage is not necessarily trivial, especially in  
pages including lists of citations and their DOIs.  But yes,  
downloading the metadata from elsewhere once a DOI is found is  
perfectly acceptable.

> This reduces the problem to constructing the unique URL of the article,
> and extracting the DOI from the HTML page, something users without any
> python or XML-parsing knowledge could do for their journals.
>
> Constructing a unique URL is not always possible (e.g. Science Direct
> seems to use md5sum hashes for each article as URL), but seems to work
> for a lot of cases.  For example, for the American Institue of Physics
> (AIP) journals, the URL is as follows:
>
> http://link.aip.org/link/?$JOURN/$VOLUME/$PAGE_OR_ARTICLE_ID/1
>
> where $JOURN is a 6-char/digit ID of the journal (e.g. JAPIAU for  J.
> Appl. Phys. or JCPSA6 for J. Chem. Phys.).

Alright, that's 2/50.  Here's a sketch of how I see that information:

-> User selects a journal
-> Journal maps to a lookup function, in this case AIP
-> User selects remaining fields required by lookup function, in this  
case volume and page.
-> Lookup function is invoked with journal key, volume and page,  
translates this to a URI, downloads it, and applies its regex to it to  
extract a DOI.

Here's the set of information I think is needed.  Anything missing?

<journal>
<name>J. Appl. Phys.</name>
<alias>Journal of Applied Physics</alias>
<key>JAPIAU</key>
<lookup>AIP</lookup>
</journal>

<journal>
<name>J. Chem. Phys.</name>
<alias>Journal of Chemical Physics</alias>
<key>JCPSA6</key>
<lookup>AIP</lookup>
</journal>

<journal_lookup>
<name>AIP</name>
<!-- %0 is always the journal key -->
<uri>http://link.aip.org/link/?%0/%1/%2/1</uri>
<fields>
<field name="Volume" id="2"/>
<field name="Page" id=3"/>
</fields>
<regex>(DOI.*)$</regex>
</journal_lookup>