Idea: using google scholar to fix crossref.org author issue

Michael Banck mbanck at gmx.net
Mon Jul 14 10:46:53 EDT 2008


Hi,

the issue that crossref.org only returns the first author of a paper is
a serious shortcoming as the usual style of citation in many journals
includes listing all authors unless there is more than a few.

Using the PubMed pluging to lookup metadata instead will give good
results, but PubMed is not supporting a lot of journals in the
physics/chemistry field.

Yesterday, I tried to following:

 1. Go to http://scholar.google.com
 3. Go to "Advanced Scholar Search".
 4. Put the DOI into the "with *all* of the words" field
 4a. Alternatively, put the volume and page as seperate numbers into the
     "with *all* of the words" field.
 5. Put the first author's last name in the "Return articles written by"
    field.
 6. Put the full journal title into the "Return articles publised in"
    field.
 7. Put the publication year in both of the "Return articles published
    between" fields
 8. Click on "Search Scholar"; if you get one link back, the authors are
    listed after "Key authors: " below the entry.
 9. Enjoy your full author list

I don't think Google Scholar has an API to make this easier, but it
should be possible to construct a search URL for it including author,
year, journal and DOI and/or volume/page information, check the
resulting HTML whether there is one hit, and extract the author
information from it.

A Google Scholar lookup URL would look like this:
http://scholar.google.com/scholar?q=$VOLUME+$PAGE+author%3A$AUTHOR&as_publication=$JOURNAL+NAME&as_ylo=$YEAR&as_yhi=$YEAR&btnG=Search

However, Google Scholar doesn't seem to allow wget or curl to access it,
not sure about python's httplib.  If you set the user-agent to something
else, it works though.

One more thing, you click on "Scholar Preferences" and Select "Show
links to import citations into BibTeX" at the bottom, then "Save
Preferences".  This adds a "Import into BibTeX" link to the search
result, which basically has all the information you'd get from PubMed as
well AFAICT (title, full author list, journal, volume, issue, page,
year, publisher).  However, that needs a cookie in your web browser, so
is probably not feasable to do from wget/curl or httplib.

It would be nice if one could enhance crossref.org metadata with the
above transparently from referencer.  I am not sure referencer allows
for "improve metadata" plugins though.  The Web of Science plugin seems
to be of this quality (adding full abstracts as well), as could be a
theroretical aps.org plugin I mentioned some days ago.

The fact that Google Scholar search might work with the usual reference
information alone ((first) author, journal, volume, page, year) would
also make it possible to use it as a "Add reference from citation data"
entry in case a DOI is not readily available.  One additional problem
here is that Google Scholar seems to have problems with the usual
abbreviations of journal titles in article references, e.g. it needs
"Chemical Physics" as journal, the usual abbreviation "Chem Phys" only
finds some [CITATION] links, not the article itself.

In case somebody figurs out how to access the bibtex data from Google
Scholar resultl links through httplib et al. a possible workflow could
be:

 1. Select "Add reference from citation data" in referencer
 2. Select a journal from a precompiled list 
 3. Enter first author, volume, page, year in respective fields
 4. Referencer does a Google Scholar search, gets article title, full
    author list and issue number of the journal
 5. Referencer does a journal lookup via volume, issue, page, gets the
    DOI
 6. crossref/pubmed lookup is performed to possible get any missing
    data.

Point 2. would be needed as Google needs quite precise names for the
journal.

Point 5. would be the theoretical journal-specific lookup plugin we
discussed some days ago.

Even if you have the DOI for the citation, selecting the journal and
entering author, volume, page and year might be easier than the cryptic
DOI numbers.

This is all just some brainstorming I did earlier, I wanted to write it
down for others to comment before I forget it.


cheers,

Michael



More information about the referencer mailing list