Isi web of knowledge and referencer

Tue Mar 4 08:45:59 EST 2008

Hi all

This is my first post but I hope it won't be the last

I have found referencer pretty interesting. Its main feature (in my opinion)
is its capability to read a pdf-file and get metadata automatically from
crossref.
However, as a scientist, I use everyday the most famous scientific database
(isi-web-of-knowledge)

I have made a little script (in a couple of days so it's very very quick and
dirty) that takes a referencer database file and use some fields in it to
obtain further information about the paper from isi-web.

It doesn't work perfectly and some extra tuning is needed. But I was
guessing if something similar could be incorporated in referencer. I mean,
pick a field of a referencer record and complete other fields from the
information obtained from isi-web. If so, i think it would be very valuable.

Here I attach the two files needed to run my script (you also need a
subscription to isi-web, but all the universities have one collective
subscription)

The first file is called refine_reflist.sh and it contains the following:
#################
#!/bin/bash
if [ $# -ne 2 ]; then
        echo "Sintaxis: $0 infile.pdf outfile.bib (it appends if exist and
creates if not)"
        exit
fi
rm -f .temp_query_* temp_answer_*;
#The main functionality is contained in the following awk script
#but I'm pretty sure that it would be simpler in pyhton (which I don't know
#very well)
awk -F ">" '
BEGIN {
# Prints the initial xml fields of the referencer file
        doit="false";
        print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>"
        print "<library>"
        print "<manage_target braces=\"false\"
utf8=\"false\"></manage_target>"
        print "<taglist>"
        print "</taglist>"
        print "<doclist>"

}
/<doc>/ {
print "<doc>"
        getline;
        safe=$0;
        gsub("<",">");
        while ($2!="/doc") { # while you are in this record...
                #print "1:",$1,"2:",$2,"3:",$3,"4:",$4;
                #if($2=="bib_doi" && $3!="") {  doit="true"; print doit,NF;}
                if($2=="bib_title") {title=$3; gsub(" ","%20",title);
gsub(/[(")]/,"",title)  } #extract the paper title
                if($2=="bib_year") {year=$3; }
                if($2=="bib_authors" && $3=="") {doit="true";}
                else print safe;
                getline;
                safe=$0;
                gsub("<",">");
        }
        if(doit=="true") { #creates a query_file for isi (I learn how this
was done with a sniffer)
                temp_file=sprintf(".temp_query_%s_%s",title,year);
                printf "GET
/esti/cgi?databaseID=WOS&SID=V2gB%40mljF1oPdjhlcF2&rspType=endnote&method=searchRetrieve&firstRec=1&numRecs=2&query=TI%3D(%s)%20and%20PY%3D(%s)
HTTP/1.1",title,year > temp_file;
                print "User-Agent: Mozilla/5.0 (compatible; Konqueror/3.5;
Linux) KHTML/3.5.8 (like Gecko)">> temp_file;
                print "Accept: text/html, image/jpeg, image/png, text/*,
image/*, */*">> temp_file;
                print "Accept-Encoding: x-gzip, x-deflate, gzip, deflate">>
temp_file;
                print "Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5">>
temp_file;
                print "Accept-Language: en">> temp_file;
                print "Host: estipub.isiknowledge.com">> temp_file;
                print "Connection: Keep-Alive">> temp_file;
                print "">> temp_file;
                print "QUIT" >> temp_file;

                doit="false";
                fflush(temp_file);
#the following uses netcat to perform a query in isi-web. Be careful with
the "^M" character it is written with Ctrl-v-m no simple ^M
                comando=sprintf("cat %s |nc estipub.isiknowledge.com 80
|tail -n +10 - |head -n -11 |sed 's/^M//g' >>
.temp_answer_%s_%s",temp_file,title,year);
                system(comando);
#The following command parses the isi-web query and add the learnt fields to
the referencer record
                comando2=sprintf("awk -f extract_field.awk
.temp_answer_%s_%s",title,year);
                system(comando2);
        }
print "</doc>"
}
END{
        print "</doclist>"
        print "</library>"

}
' $1 >$2
rm -f .temp_query_* temp_answer_*;
###############

to run it simply type:
./refine_reflist.sh old.reflist new.reflist

the other file you need is called extract_field.awk:
################
BEGIN{
        FS=">";
        IGNORECASE=1;
        authcount=0;
        keycount=0;
}
/AuCollectiveName/{
        gsub("<",">");
        author[++authcount]=$3;
}
/<keyword>/{
        gsub("<",">");
        keywords[++keycount]=$3;
}
/article_no/ && /doi/ {
        gsub("<",">");
        doi=$3;
}
END{
        printf("<bib_authors>");
        for(i=1;i<authcount;i++) printf "%s and ",author[i]; printf
"%s</bib_authors>\n",author[authcount];
        printf("<bib_extra key=\"Keywords\">");
        for(i=1;i<keycount;i++) printf "%s,",keywords[i]; printf
"%s</bib_extra>\n",keywords[keycount];
        printf("<bib_extra key=\"Doi\"> %s </bib_extra>\n",doi);
}

###############

I would like to contribute to the referencer project, and I hope that this
idea could be incorporated (and of course improved) for future versions.

Another source of inspiration could be a Mac program called "papers". I've
seen it in action and it's all I would like for referencer.

Best regards,

Mario
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://icculus.org/pipermail/referencer/attachments/20080304/918ffee6/attachment.htm>