An Extension to JabRef for Extraction and Processing ...

97 downloads 2422 Views 1MB Size Report
extraction, citation parsing, relevant BibTeX entry fetching and linking of In-Cite ..... downloaded and stored on the hard drive of user's computer. Once scholarly ...
An Extension to JabRef for Extraction and Processing of Scholarly Articles Sehrish Amjad, Hamid Mukhtar, Muddassir Malik National University of Sciences and Technology, Islamabad (NUST), Pakistan {11msitsamjad, hamid.mukhtar, muddassir.malik}@seecs.edu.pk Abstract— Effective management of bibliographic information and citations of relevant work in an appropriate manner is considered as an integral part of all scholarly articles. To ease the process of literature management and to facilitate researchers, several reference managers have been developed offering different features. Almost all reference managers support extraction of metadata from articles but unfortunately very few provide facility of extraction of full-text from articles, their sub sections search and parsing of citations. Being a researcher, management of references is one of the most complicated aspects. Thus to smooth the progress of researchers, scholarly articles organizer is introduced in this paper as an extension of already developed open source reference manager JabRef. The proposed extension of JabRef basically performs the management, organization and processing of academic papers. And it has the special features of content reading, metadata extraction, citation parsing, relevant BibTeX entry fetching and linking of In-Cite and Out-Cite information. Keeping in view the fact that scholarly objects basically represent research output in the form of metadata, parsed contents and citations. Accompanying BibTeX entries with scholarly objects is really important because it can be easily handled by the end-users and can be read automatically by many repositories and several other tools. Thus the key inspiration behind the designing of the proposed add-on is to enhance the functionality of JabRef by reducing the complexities involve in handling academic literature and simplifying the process.

Keywords—BibTeX; Citation Extraction; Reference Management;

Parsing;

Information

I. INTRODUCTION Conducting research in a systematic way and to arrive at the desired end goal requires skills which can only be acquired through adoption of some methodology. In fact, almost all of the debutant researchers and students face a lot of difficulties and problems in managing their research. Even many of the mid-careers as well as some experienced researchers consider management of research related activities quite challenging task to do. Several problems encountered by students and novice researchers are, • •

Poor organization of research material which include searching, storing and writing of articles Reference management and difficulty in selection of tool for management of research related material

One of the most important problems is that research material is not organized systematically. Much of the valuable time of the scholars and researchers is spent in search of books, newspapers, reports, and articles of the relevant area under study rather than searching relevant information from them. Effective management of literature is tedious and timeconsuming procedure by numerous researchers, particularly when it is done manually. Thus efficient management of scholarly contents which mainly includes articles, their respective citations and bibliographic information is considered as a fundamental part of research. Extraction of references as well as bibliographic information and keeping track of their proper use is one of the most complex aspects of research for both novice as well as experienced researchers. Hence to ease the process of literature management and to facilitate researchers, various reference managers (RMs) have been developed such as Citavi [7], Docear [2], EndNote [4], JabRef [1], Mendeley [3], Papers [8], RefWorks [6] and Zotero [5]. All these RMs offer list of common functionalities and help out researchers and scholars in performing three fundamental steps which include exploring, saving, and writing of papers. RMs assist researchers in finding relevant scholarly literature, authorize them to save articles, their bibliographic information and references in an exclusive database for future use and retrieval, and enables scholars to include citations and bibliography in a selected citation style while preparing some document. With the help of RMs, scholars keep track of the scientific literature they read, and perform editing of the scientific papers they write. Reference management applications have been commercially available for a long period of time, although freely accessible solutions offer competitive features with comparable functionalities and are increasingly attaining importance [10]. It is an obvious fact that the fame and utilization of proprietary software cannot compete with other open source software. Due to high cost, less development support, security and customization issues related with proprietary software, most of the researchers prefer to use open source software. Therefore by realizing the problem of researcher associated with use of proprietary software and to provide exceptional support in managing literature, an innovative approach is presented based on the idea of enhancement of already developed open source software. The core motivation behind the proposed approach is not only fulfilling the functionality gap but also considering enhancement of open source software as an essential task. And

this task can be accomplished by providing bug free, customized development support to the community of open source developers. As PDF format is the highly common and most preferable form for disseminating and storing the contents of a scholarly article. Therefore by understanding the importance of PDF in literature management, many RMs are trying to integrate PDF organizers, metadata fetcher, title extractor and PDF viewers for scholarly papers. From the list of open source RMs, almost all RMs have incorporated the feature of direct import from bibliographic databases but only few of them support the insertion of full-text papers and articles because complete text of academic literature is usually costly or hard to find. The inspiration behind this proposed improvement is to set up equality between old and newly developed RMs at the functionality level. This paper contributes by proposing and implementing extension to JabRef which will remarkably facilitates researchers by meeting their changing requirements and in bringing research to the higher level. In this paper, Section II describes the related work. In Section III, detailed description of proposed extension to JabRef is provided. Section IV highlights the implementation aspects with screenshots and Section V comprises of conclusion and future work. II. RELATED WORK Researcher’s landscape has altered tremendously since the development of first RM in 1980. Today there are growing number of RMs moving from the PC’s revolution in1980 to the World Wide Web and lastly Web 2.0. In present times, emergence of RMs is at corresponding level. To maintain this standard, RMs are customized to the requirements of users with different expectations and needs. The newer RMs not only provide the features expected in a traditional reference organizer but have also incorporated social, collaborative and full text article management related features. A. Reference Managers Several RMs exist, each with special strengths and limitations. This paper has thoroughly analyzed, tested and reviewed seven RMs with the hope of providing overview of their functionality. The brief description of each of the RM is provided below Citavi [7] was developed in 2006 by Swiss Academic Software. Core features of Citavi are reference management, task planning and knowledge organization. Online search capability of Citavi enables user to search catalogs and databases from within the program. Docear [2] is an academic literature suite developed in 2009 by Otto-von-Guericke University Magdeburg and University of California Berkeley. It is free of charge and open source, based on JabRef and Freeplane. Docear not only works as a RM but also as a digital library, PDF and file manager, mind mapping and note taking tool. Docear makes use of Docear's PDF Inspector [16] for extraction of document’s metadata based on stylistic analysis but its functionality is only limited to extraction of titles from articles. It works seamlessly with many existing tools like Microsoft Word, Mendeley, and Foxit Reader.

EndNote [4] is a commercial RM developed by Thomson Reuters in 1988. It is among one of the most famous RMs available for Windows and Mac OS for more than two decades. It not only supports importing of collection of references from bibliographic databases, online resources and full-text PDFs but also exporting of reference into BibTeX. EndNote provides plug-ins for Microsoft Word and OpenOffice. JabRef [1] RM has been developed in 2003 by JabRef developers. It is an open-source bibliography manager wellrecognized in community of LaTeX users. It executes on Java and is therefore well-compatible with Mac, Linux and Windows. The native document layout and format supported by JabRef is BibTeX because it is extensively utilized and most established formats of file for storage of bibliographic data. The benefit of using BibTeX is that it is a standard LaTeX bibliography format. Exporting of information in a standardized pattern is necessary because it enables end-users to keep backup of their bibliography separately from the RM, to switch from one RM to another, or to make use of multiple RMs in parallel [10]. Through JabRef, formatting of references and citations can be directly done in LaTeX, hence giving control and access over an extensive collection of citation styles required by the publisher. JabRef can be easily used in combination with LaTeX, MS Office, and OpenOffice.org with the suitable plug-ins. JabRef is the only RM that has the capabilities of embedding and reading BibTeX metadata using the Extensible Metadata Platform (XMP) standard as a wrapper that stores BibTeX metadata [11]. JabRef is mature and has been in use for years by scholars all over the world. Due to its popularity and extensive use, some of the of the newly developed RM and research visualization tools such as Docear, Action Science Explorer [9] also incorporate it as component instead of developing RM from scratch. Mendeley [3] was developed in 2008 by a London-based startup. The significance of Mendeley lies in its collaborative and networking features, and also facilitating the process of PDF files management. Mendeley basically utilizes SVMs for extraction of metadata from articles. It provides desktop as well as web version with appropriately synchronized citation information, offering access from numerous computers, collaboration and sharing facility with other users. Papers [8] is commercial RM developed in 2007 by Springer. Its main strength which distinguishes it from other RMs is its tremendous management of PDF manuscripts including extraction of metadata and its refined user interface, while collaboration related features are less enhanced as compared to several other products. Papers makes use of Citation Style Language for citation of references and provides support for word processor plug-in. Zotero [5] is another popular RM developed in 2006 by New Media (CHNM) and George Mason University’s Center for History. It is free of charge and open-source plug-in developed for Firefox browser. It offers the facility of saving citation information into user’s library and to the Zotero server without navigating away from the web page.

C. Comparison and Analysis of Reference Managers Table 1 summarizes the feature comparison of several above mentioned reference management software. TABLE I.

ANALYSIS TABLE OF REFERENCE MANAGERS

Zotero

Papers

Mendeley

JabRef

Docear

EndNote

Features

Citavi

Reference Managers

Database Search

D U D D U D U

Data Import

D U D D D D D

Export from Databases

D D D D D D D

Capturing of Metadata from Web Pages

D U D U D D D

Full Text Search

D U D U D D D

Support User Created Doc Type

U U D D U U U

Support User Generated Fields

D U D D U U U

Generation of Indices

D D D D D D D

Completion of Metadata

D U D D D D D

Support Inter-linking of Refs

D D U D U U D

Support Linking/Integration of Docs

D U D U D D D

Editing of PDF Docs

U D D U D D U

Duplicate Checking

D U D D D D D

Global Changes

D D D D D D U

Creation and Allocation of Folders/Groups

D D D D D D D

Support Full or Customize View of Refs

D D D D D D D

Searching and Sorting

D D D D D D D

Sharing of Refs

U U D U D D D

Zotero

Papers

JabRef

Mendeley

EndNote

Features

Citavi

Reference Managers Docear

B. Information Extraction Tools For the purpose of extracting information from articles, there are a list of tools, publication, web services and data sets that are completely or partially related to extracting information from scholarly articles (just to provide a point of reference for anyone interested in exploration of topic). The main focus of most of these extraction tools is on extraction of metadata and Citation from articles. Most commonly used publicly available extraction tools for information extraction are SVM Header Parser [14], Grobid [15], ParsCit [20] ,Docear's PDF Inspector [16], Mendeley [3], PDFMeat [17], SciPlore Xtract [18], HMM Metdata Extractor [19]. SVM Header Parse [14] is basically metadata extractor based on SVMs and is part of the CiteSeerX [13] package. Grobid [15] and ParsCit [20] extract metadata and citation using CRFs. PDFMeat [17] fetches appropriate terms from article and then queries Google Scholar [12] to retrieve metadata. SciPlore Xtract [18] reads article information based on a stylistic analysis of XML. HMM Metadata Extractor [19] is a citation parsing tool based on Hidden Markov Models.

Collaboration

D D D U D D D

Social Networking

U U U U D U D

Support Generation and Export of Refs

D D D D D D D

Support Citation Styles

D D D D D D D

Word Processor Integration

D D D D D D D a.

D=yes and U=No

D. Analysis Conclusion From the above analysis table, it can be observed that although JabRef provides the facility of direct search and download from major databases such as PubMed, IEEEXplore, ACM digital library but this facility is only limited with the BibTeX entries hence not providing the support of full text article view and download from external resources. Another drawback of JabRef is that it gives users the ability to link to an article to each citation entry but this facility is provided only for those articles which are available to hard drive. As JabRef is widely accepted RM in the research community. Hence by accepting the importance of JabRef in the management of scholarly contents and existence of list of information extraction and processing tools, there is a need to overcome these disadvantages of JabRef and make best possible utilization of above mentioned information extraction tools. In this paper a unique idea is presented based on the development of an add-on which can be easily incorporated into JabRef to support not only the downloading of BibTeX entries but also their respective full text articles from external resources, parsing and linking of these articles with the relevant bibliographic entry, auto-extraction of title, metadata, abstract and fetching In-Cite and Out-Cite information relevant to the downloaded BibTeX entry as per requirement of the user.

Fig. 1 Comparison of current and proposed approach

III. PROPOSED ENTENSION TO JABREF A. Software Architecture Software architecture of proposed add-on is provided in fig. 2 given below. The architecture mainly describes the arrangements of only import component of JabRef but all the components of proposed add-on. Overview of interaction between the JabRef and proposed add-on and data flow is also presented. The architecture highlights that the main medium of communication between JabRef and proposed extension is database. JabRef shares the URL of article and proposed addon which include article downloader, viewer, parser, extractor, citation fetcher after performing all the necessary steps such as parsing, metadata extraction, content extraction and citation parsing, stores all the data in the database where JabRef can easily access and present results to the end-users.

TABLE II.

CITATION PARSING PEFORMED BY PARSCIT

Input Citation: R. G. Smith. “The contract net protocol: Highlevel communication and control in a distributed problem solver”. IEEE Transactions on Computers, C-29(12):1104–1114, Dec 1980. XML output: R G Smith 29 12 1980 1104-1114 The contract net protocol: High-level communication and control in a distributed problem solver. IEEE Transactions on Computers BibTeX Output: @Article{d1e5, author="Smith, R. G.", title="The contract net protocol: High-level communication and control in a distributed problem solver.", journal="IEEE Transactions on Computers,"}

Fig. 2 Architecture of Proposed Extension of JabRef B. Workflowof Proposed Exntesion of JabRef The first step involve in the workflow of the proposed addon is getting the scholarly article, which at that time may simply exist on some external repository and need to be downloaded and stored on the hard drive of user’s computer. Once scholarly article is downloaded, add-on parses that file using developed technique and automatically extracts basic parts of its content such as abstract, title, introduction, citations and the Digital Object Identifier on the top page of the manuscript and retrieves metadata from external repositories. ParsCit [20] citation parser parses each and every citation of the downloaded article and returns not only parsed citations but also their relevant BibTeX entry. Results returned by ParsCit after successfully performing the citation parsing are given below

Most of the time citation information provided by ParsCit in the form of BibTeX is not enough to give complete idea of any articles or journal. Example of this problem is shown in Table II. Therefore to complete all necessary details of BibTeX entry, proposed add-on explore further details of that particular entry from external services. Parsing of multiple citations exist in an article requires a lot of time in a single threaded environment. Hence to improve efficiency and to overcome delay, multithreading approach is adopted. After gathering all the information of Out-Cite entries, proposed add-on performs the fetching of In-Cite entries from external services using cited-by detail of main BibTeX entry. Once downloading of all the In-Cite and Out-Cite entries information is completed, linking is performed to provide overview of relationship between main article with other relevant articles. For the purpose of linking, unique key identifier is incorporated in each and every downloaded BibTeX entry. When linking is done successfully, all the information is stored to database from where JabRef can easily access and give understandable display to the user. Regular expression and string matching approach is adopted for extraction of contents from parsed scholarly article and for searching of relevant BibTeX entry URL from automatically fetched contents provided by external resources. Detail description of work is presented in Fig 3. IV. IMPLEMENTATION As JabRef is developed in Java, therefore to avoid compatibility issues proposed add-on is also developed using Java. Different Java libraries are used for parsing, viewing and storing information into database. For viewing of the downloaded article, icepdf-viewer library is utilized. After successful downloading of article, Apache Tika Java library is

Fig. 3 Workflow of Proposed Extension of JabRef

Output provided by Jabref before Incorporating Proposed Exntesnion Fig. 4 Screenshots of JabRef Article Extractor and Processor

Output provided by Jabref after Incorporating Proposed Extension

used to extract the metadata from article and to store all the contents of article in a simple text file. Later this text file is sent through command to Perl based content extractor and citation parser ParsCit for extraction of different part of article. Fig 4 provides some screenshots to give general idea about difference of functionality between already developed system and proposed extension. Fig. 4a: Already developed JabRef Fetcher Preview Dialogue displaying links for all entries fetched from external resources Fig. 4b: Modified JabRef Fetcher Preview Dialogue modified by add-on displaying links along with article (PDF) for each and every entry fetched from external resources Fig. 4c: Already developed JabRef Fetcher Preview Dialogue displaying Selection of entry for Downloading Fig. 4d: Modified JabRef Fetcher Preview Dialogue modified by add-on Displaying Full Text article and Selection of entry for Downloading Fig. 4e: Already developed JabRef Import Inspection Dialogue displaying BibTeX Key Generation after downloading Fig. 4f: Modified JabRef Import Inspection Dialogue modified by add-on displaying BibTeX Key Generation after downloading (Display of PDF icons ensures the successful downloading and parsing of article) V. CONCLUSION AND FUTURE WORK The paper proposes article parsing, metadata extraction, content and citation writing facility which together with already existing tools and standards makes it possible to construct an efficient and well-organized workflow for management of scholarly content. The goal of the developed add-on is to bring scholarly article viewing, downloading as well as respective handling closer to the end-users, and realization of the fact that content, metadata and bibliographic information of articles are central elements of the academic objects and automatically reutilize-able in digital libraries and repositories of scholarly literature. Future work includes refining, testing and extending other components of JabRef such as collaboration and social networking. In JabRef there is no native way to access its bibliographic database from another computer. This can be frustrating when user does not always work on the same computer. It is also recommend to the RM’s community of developers that they incorporate PDF management, metadata extraction, bibliographic information parsing and In-Cite and Out-Cite information fetching and linking utility to their products. ACKNOWLEDGEMENT All of the authors would like to thank JabRef Team, and Cody Dunne, Research Scientists at IBM Watson Lab for discussion and guidance of this work.

REFERENCES [1] JabRef Development Team, JabRef, 2014. [Online]. Available: http://jabref.sourceforge.net [2] J. Beel, B. Gipp, S. Langer, and M. Genzmehr, “Docear: An Academic Literature Suite for Searching, Organizing and Creating Academic Literature,” Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, ACM, 2011, pp. 465–466. [3] Mendeley Ltd, “Mendeley,” http://www.mendeley.com/, July, 2014. [4] Thomson Reuters, “EndNote,” http://www.endnote.com/, July, 2014. [5] Center for History and New Media, “Zotero,” http://www.zotero.org/, July, 2014. [6] RefWorks COS, “RefWorks,” http://www.refworks.com/, July, 2014. [7] Swiss Academic Software, "Citavi," http://wwww.citavi.com, July, 2014. [8] A. Griekspoor, T. Groothuis, "Papers," http://wwww.papersapp.com, July, 2014. [9] C. Dunne, B. Shneiderman, R. Gove, J. Klavans, and B. Dorr, “Rapid understanding of scientific paper collections: integrating statistics, text analysis, and visualization,” University of Maryland, Human-Computer Interaction Lab Tech Report HCIL-2011, 2011. [10] M. Fenner, K. Scheliga, S. Bartling, "Reference Management," In Opening Science, S. Bartling and S. Friesike, Ed. London: Springer International Publishing, 2014, pp. 125-137. [11] S. Polyakov, W. E. Moen, "Expanding Metadata Reuse with an Islandora Metadata Extraction Utility," in Proc. int. conf. Open Repositories, 2013. [12] Google, “Google Scholar,” http://scholar.google.com/, July, 2014. [13] C. L. Giles, K. D. Bollacker, and S. Lawrence, “CiteSeer: an automatic citation indexing system,” in Proc. ACM conf. Digital Libraries, 1998, pp. 89–98. [14] SVM Header Parse, “SVM Header Parse,” http://sourceforge.net/projects/citeseerx/, July, 2014. [15] Grobid, “Grobid,” https://github.com/kermitt2/grobid, July, 2014. [16] J. Beel, S. Langer, M. Genzmehr, and C. Müller, “Docears PDF Inspector: Title Extraction from PDF files,” in Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’13), 2013, pp. 443–444. [17] PDFMeat, “PDFMeat,” http://code.google.com/p/pdfmeat/, July, 2014. [18] SciPlore Xtract, “SciPlore Xtract,” http://sciplore.org/, July, 2014. [19] HMM Metadata Extractor, “HMM Metadata Extractor,” http://gales.cdlib.org/~egh/hmm-citation-extractor/,July, 2014. [20] ParsCit, “ParsCit,” http://aye.comp.nus.edu.sg/parsCit/, July, 2014.

Suggest Documents