How to Build Your Own Citation Index

How to Build Your Own Citation Index First-hand experience with WoS, Scopus, and CSA reference data Philipp Mayr, Frank Sawitzky, Andreas Strotmann (GESIS – Leibniz Institute for the Social Sciences, Cologne)

Background I ●

Author cocitation and collaboration network mining and visualization –

e.g. Bubela, Strotmann, et al (2010) Cell Stem Cell: Researcher commercialization activity tends to reduce their collaboration breadth

Background II Citation Index for the Social Sciences ●

GESIS' Sowiport portal –

18 databases, including 6 CSA databases, all social sciences

–

CSA comes with cited refs for some docs

–

SSOAR – extract refs from OA full text and index in Sowiport

–

Crawl Google Scholar for citations to “our” docs

Two Models of Citation Graphs Bipartite (Classic IR) Model: Citing and Cited Partitions

Uniform Model: Interconnected Documents

• Citing nodes: full bibliographic records • Cited nodes: „keys“, e.g.

• All nodes: bibliographic records

– First author name & initials + Year of publication + Journal key, + volume, +number, +page

– Citing nodes full records – Cited nodes mostly simplified records – „Matched“ cited nodes have full records

Citation Matching • Goal: Citation network – Unique nodes for documents

• Sub tasks: – Match cited references to each other – Match cited references to full records – Match full records across databases

Scopus Citations • Cited reference info contains – Up to 8 author names (family+inits) • Including last author • Frequently as cited (not standardized or corrected)

– Publication year, title, journal name/vol./nr./p. • Frequently as cited

– Reasonably well parsable, not normalized

Matching Scopus Citations to Scopus Full Records External matching: Scopus search engine ●

●

„Algorithm“: parse Scopus reference into subfields, construct complex search queries for Scopus engine, download resulting full records, choose best fit High precision searches: complex searches allowed, many searchable fields –

●

Improve recall by successively vaguer queries

Small number of downloads allowed, so many queries needed to construct sizable citation index

Matching Scopus Citations to PubMed Full Records CrossDB External Match: Scopus/Medline ●

„Algorithm“: parse Scopus reference, construct PubMed batch citation matcher queries, download matched PubMed(!) records –

Only for biomedical fields

–

Result is a citation network of PubMed records, not Scopus

–

Requires matching of Scopus citing records as well ● ●

Either direction (ScopusPubMed) Both include PubMed IDs

Matching Web of Science References to WoS Full Records WoS cited reference info contains ●

First author (last name plus initials)

●

Publication year

●

Source title code

●

Vol./num./page

●

More and more frequently DOI

No title included!

Matching WoS Cited References to WoS Record External matching via WoS web search ●

Only small queries supported –

●

Crucial search fields not supported (vol., num.) –

●

●

Many downloads necessary Therefore highly ambiguous results to be expected

Requires translation of source title from code to full Requires algorithmic filtering of correct hit from long result list

Matching WoS references to WoS ● ●

●

Internal Matching Kompetenzzentrum Bibliometrie has full local copy of WoS data Experiment: good „match key“ to support this? –

Dinkel (2011), ISSI

–

Results in error estimates for references

Building a Citation Index for the Social Sciences: CSA ●

Basis: Cambridge Scientific Abstracts (Social Sciences) –

●

●

To be extended with additional sources of cited refs info

Nationwide licensing scheme for Germany administered at GESIS Six CSA/Proquest databases incorporated into GESIS' „Sowiport“ social sciences portal –

Now including ~8.5 mio cited references ● ●

No matchings to full records provided by Proquest Early experimental results available on portal –

Focus on precision, not recall

CSA References in GESIS' Sowiport Database ●

Each full record contains „references“ and „cited-by“ information –

●

Some with actionable links to full records

Combines WoS/Scopus and Google Scholar approaches to citation index construction

CSA reference information ●

Fields: citing ID, reference ID, authors, title, year, publisher, source title/num/vol/p., ISSN –

Format changes, though

●

Mostly automatically parsed, as fields frequently mis-assigned

●

Example (book):

200601317Voice UK No More Abuse.2000 Derby: Voice UK

Citation Matching in CSA „Algorithm“: ●

Internal matching –

●

●

Parse references; construct search queries (Solr) – exact title and year – or fuzzy title and year and ISSN; choose first match Favors precision over recall –

●

However, across multiple CSA databases

Fuzzy match only for journal literature, for example

Research to be continued!

Experiments - Datasets Caveat ●

●

Scopus/PubMed and WoS experiments run on stem cell research field (biomedical area) –

< 100k citing docs, ~1mio references

–

>95% refs are to journal articles

CSA experiment run on social sciences databases –

~1mio full records, ~10mio references ● ●

Only recent records contain refs Many(!!) refs to non-journal articles

Some Rough Numbers ●

Scopus ↔ PubMed full record matching –

●

●

>95% match rate

Scopus references → Scopus/PubMed full record –

~90% match rate „exact“ + ~5% fuzzy match

–

~1% false positives needed to be filtered out

WoS references → WoS full record ~90% match rate – >>50% false positives needed to be filtered out CSA references → CSA full record –

●

– –

~30% match rate ~1% false positives

Discussion CSA matching is much(!) harder ●

Social science publication culture –

Books & chapters, and articles ●

–

Multilingual publishing ● ●

–

English is not the only language Docs may be cited in translation

Broad referencing behaviour ●

●

Published in roughly equal numbers, books cited most

Large proportion of references to non-source items

Biomedical publication culture –

>>90% references to international journal articles

–

Near-complete coverage in WoS/Scopus/PubMed databases

Discussion => A first-try high-precision match rate of ~30% is an excellent result ●

●

Close to expected rate of references to journal articles Plenty of research opportunities to improve matching of non-journal literature references to source records –

e.g. to GESIS' own SOLIS / SOFIS / SSOAR databases

–

e.g. by crawling Google Scholar for reference links

–

You are invited to try your hands at this, too! ●

See: GESIS Application Laboratory

Outlook – What we should be doing Towards a distributed semantic citation index ●

Based on digital full-text collections (cooperate with publishers)

●

Reference extraction (with contexts) –

●

Reference matching –

●

Enables referential semantics

Open reference semantics information exchange –

●

Enables sentiment analysis (important in social sciences)

„ paper indexed in our collection cites paper indexed in yours“

Semi-automatic / Computer-aided –

Algorithms + professional indexers (authority files) + crowd sourcing