How to Build Your Own. Citation Index. First-hand ... Source title code. â Vol./num./page. â More and more frequently DOI ... Building a Citation Index for the.
How to Build Your Own Citation Index First-hand experience with WoS, Scopus, and CSA reference data Philipp Mayr, Frank Sawitzky, Andreas Strotmann (GESIS – Leibniz Institute for the Social Sciences, Cologne)
Background I ●
Author cocitation and collaboration network mining and visualization –
e.g. Bubela, Strotmann, et al (2010) Cell Stem Cell: Researcher commercialization activity tends to reduce their collaboration breadth
Background II Citation Index for the Social Sciences ●
GESIS' Sowiport portal –
18 databases, including 6 CSA databases, all social sciences
–
CSA comes with cited refs for some docs
–
SSOAR – extract refs from OA full text and index in Sowiport
–
Crawl Google Scholar for citations to “our” docs
Two Models of Citation Graphs Bipartite (Classic IR) Model: Citing and Cited Partitions
Uniform Model: Interconnected Documents
• Citing nodes: full bibliographic records • Cited nodes: „keys“, e.g.
• All nodes: bibliographic records
– First author name & initials + Year of publication + Journal key, + volume, +number, +page
– Citing nodes full records – Cited nodes mostly simplified records – „Matched“ cited nodes have full records
Citation Matching • Goal: Citation network – Unique nodes for documents
• Sub tasks: – Match cited references to each other – Match cited references to full records – Match full records across databases
Scopus Citations • Cited reference info contains – Up to 8 author names (family+inits) • Including last author • Frequently as cited (not standardized or corrected)
– Publication year, title, journal name/vol./nr./p. • Frequently as cited
– Reasonably well parsable, not normalized
Matching Scopus Citations to Scopus Full Records External matching: Scopus search engine ●
●
„Algorithm“: parse Scopus reference into subfields, construct complex search queries for Scopus engine, download resulting full records, choose best fit High precision searches: complex searches allowed, many searchable fields –
●
Improve recall by successively vaguer queries
Small number of downloads allowed, so many queries needed to construct sizable citation index
Matching Scopus Citations to PubMed Full Records CrossDB External Match: Scopus/Medline ●
„Algorithm“: parse Scopus reference, construct PubMed batch citation matcher queries, download matched PubMed(!) records –
Only for biomedical fields
–
Result is a citation network of PubMed records, not Scopus
–
Requires matching of Scopus citing records as well ● ●
Either direction (ScopusPubMed) Both include PubMed IDs
Matching Web of Science References to WoS Full Records WoS cited reference info contains ●
First author (last name plus initials)
●
Publication year
●
Source title code
●
Vol./num./page
●
More and more frequently DOI
No title included!
Matching WoS Cited References to WoS Record External matching via WoS web search ●
Only small queries supported –
●
Crucial search fields not supported (vol., num.) –
●
●
Many downloads necessary Therefore highly ambiguous results to be expected
Requires translation of source title from code to full Requires algorithmic filtering of correct hit from long result list
Matching WoS references to WoS ● ●
●
Internal Matching Kompetenzzentrum Bibliometrie has full local copy of WoS data Experiment: good „match key“ to support this? –
Dinkel (2011), ISSI
–
Results in error estimates for references
Building a Citation Index for the Social Sciences: CSA ●
Basis: Cambridge Scientific Abstracts (Social Sciences) –
●
●
To be extended with additional sources of cited refs info
Nationwide licensing scheme for Germany administered at GESIS Six CSA/Proquest databases incorporated into GESIS' „Sowiport“ social sciences portal –
Now including ~8.5 mio cited references ● ●
No matchings to full records provided by Proquest Early experimental results available on portal –
Focus on precision, not recall
CSA References in GESIS' Sowiport Database ●
Each full record contains „references“ and „cited-by“ information –
●
Some with actionable links to full records
Combines WoS/Scopus and Google Scholar approaches to citation index construction
CSA reference information ●
Fields: citing ID, reference ID, authors, title, year, publisher, source title/num/vol/p., ISSN –
Format changes, though
●
Mostly automatically parsed, as fields frequently mis-assigned
●
Example (book):
200601317Voice UK No More Abuse.2000 Derby: Voice UK
Citation Matching in CSA „Algorithm“: ●
Internal matching –
●
●
Parse references; construct search queries (Solr) – exact title and year – or fuzzy title and year and ISSN; choose first match Favors precision over recall –
●
However, across multiple CSA databases
Fuzzy match only for journal literature, for example
Research to be continued!
Experiments - Datasets Caveat ●
●
Scopus/PubMed and WoS experiments run on stem cell research field (biomedical area) –
< 100k citing docs, ~1mio references
–
>95% refs are to journal articles
CSA experiment run on social sciences databases –
~1mio full records, ~10mio references ● ●
Only recent records contain refs Many(!!) refs to non-journal articles
Some Rough Numbers ●
Scopus ↔ PubMed full record matching –
●
●
>95% match rate
Scopus references → Scopus/PubMed full record –
~90% match rate „exact“ + ~5% fuzzy match
–
~1% false positives needed to be filtered out
WoS references → WoS full record ~90% match rate – >>50% false positives needed to be filtered out CSA references → CSA full record –
●
– –
~30% match rate ~1% false positives
Discussion CSA matching is much(!) harder ●
Social science publication culture –
Books & chapters, and articles ●
–
Multilingual publishing ● ●
–
English is not the only language Docs may be cited in translation
Broad referencing behaviour ●
●
Published in roughly equal numbers, books cited most
Large proportion of references to non-source items
Biomedical publication culture –
>>90% references to international journal articles
–
Near-complete coverage in WoS/Scopus/PubMed databases
Discussion => A first-try high-precision match rate of ~30% is an excellent result ●
●
Close to expected rate of references to journal articles Plenty of research opportunities to improve matching of non-journal literature references to source records –
e.g. to GESIS' own SOLIS / SOFIS / SSOAR databases
–
e.g. by crawling Google Scholar for reference links
–
You are invited to try your hands at this, too! ●
See: GESIS Application Laboratory
Outlook – What we should be doing Towards a distributed semantic citation index ●
Based on digital full-text collections (cooperate with publishers)
●
Reference extraction (with contexts) –
●
Reference matching –
●
Enables referential semantics
Open reference semantics information exchange –
●
Enables sentiment analysis (important in social sciences)
„ paper indexed in our collection cites paper indexed in yours“
Semi-automatic / Computer-aided –
Algorithms + professional indexers (authority files) + crowd sourcing