2010 International Conference on Education and Management Technology(ICEMT 2010)
GATE Framework Based Metadata Extraction from Scientific Papers Tin Huynh, Kiem Hoang Department of Computer Science, University of Information Technology - Vietnam National University, HCM City 6 Quarter, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, VietNam Email: {tinhn,
[email protected]}
For example, base on the extracted metadata of the specific paper we can recognize and know other papers which the specified paper is referenced in. This information can used to assign a score for each document and this score can be used for ranking documents on searching. On the other words, the extracted metadata can be used to enrich the domain ontology. In order to implement our idea, we are building an ontology for the computer science papers (CSPOnt). This is the core concept of our research. These extracted metadata can used to enrich that ontology. So, in this paper we present an approach to extract metadata for scientific articles based on layout information and rules built base on patterns by using JAPE Grammar and the ANNIE plug-in of GATE. We also develop a software tool to extract metadata automatically. In section 2, we briefly survey current related research on metadata extraction from electronic documents and their applications in digital libraries. In section 3 will present our approach, the system architecture for metadata extraction and rules defined base on JAPE Grammar and ANNIE plugin of GATE. The experimental results and the evaluation are presented in section 4. The final section will be the conclusion and some discussion on our current research and future works.
Abstract— In this paper we propose a method to extract automatically metadata (title, authors, affiliation, email, references, etc) from science papers by combining the layout information of papers with rules which are defined by using JAPE Grammar rules of GATE 1 . After metadata extracted automatically from digital documents, user can interact and correct them before they are exported to XML files. Developing a tool to extract metadata from digital documents is a very necessary and useful task for building collections, organizing and searching documents in digital libraries. The extraction method is tested on computer science paper collections selected from international journals, proceedings downloaded from digital libraries such as ACM, IEEE, Springer and CiteSeer. Keywords: Information extraction, metadata, automation.
I. INTRODUCTION Nowadays, most of electrical materials in digital libraries are available and searchable on the internet. Just after a few years of activities, digital libraries hold large collections of books, thesis, journals, papers, etc with various category, format and topic. So, recently there are a lot of universities, organizations, research invested on developing projects for designing digital libraries, finding standards, methods, or building software systems, tools which support to organize, manage the document collections in digital libraries efficiently [1][2][9]. According to [3], the data exchange standard on the Internet which is approved by the national standards organization of the United States to replace the old and no longer suitable standards is ANSI / NISO Z39.85 – 2001. The major contents of this standard describe 15 data fields and it is also known as the Dublin Core Metadata2. These fields are very useful and popular to include with digitized documents for exchange over the Internet. Extracting and creating metadata for electronic documents helps to arrange documents in a scientific way and support users can search them easily. Creating metadata manually is a time consuming task. According to [10] it will take 60 years for a person to create metadata for 1 million documents. The goal of our research is to find methods and to build tools to identify metadata elements for electronic document. After that we will find the relationship between different documents in the collection base on the extracted metadata. 1 2
II. RELATED WORK Automatic metadata extraction is a task of information extraction. Automatic metadata extraction methodologies can be classified into two main categories: machine learning methods [4][5][7][11] and other methods which based on rules combined with dictionaries and ontology [8][10][12]. According to [5], machine learning for information extraction include symbolic learning, inductive logic programming, grammar induction, Support Vector Machine, Hidden Markov models (HMMs), and statistical methods. Using machine learning methods for metadata extraction achieved impressive results. In paper [5], authors suggested using SVM for automatic metadata extraction. Their extraction process include two phases, the first they use SVM to classify each line of the heading of papers into one or more of 15 classes which associate with Dublin Core Metadata Standard. And then they extract metadata from classified lines by finding out best boundaries of metadata in lines using rules, format notations such as punctuations, capital letters combined with dictionaries. They compare their method with other machine learning by precision, recall, accuracy and F-measure and their experimental shown that SVM outperforms other machine learning
http://gate.ac.uk/ http://dublincore.org/
978-1-4244-8618-2/10/$26.00 © 2010 IEEE
188
2010 International Conference on Education and Management Technology(ICEMT 2010)
methods. In [7], authors also suggested automatic metadata extraction by using CRF (Conditional Random Fields) and their approach gave a comparable result with SVM in [5]. In [11], paper’s authors build a package named PDF2gsdl which is used to extract title, authors from PDF papers and this package can be combined with Greenstone software 3to create metadata automatically for documents, collections in digital libraries [1][2]. The approach of authors in [11] is based on machine learning which learns to build a Neural classifier using layout information, position, font size, neighbor area as features for learning. And the accuracy for the title field is about 93% and 70% for the author field, their experiment tested on 45 papers from one conference proceeding. In [5][7] the Precision is from 86 % to 99%, the Recall is from 45% to 100%, the Accuracy is from 96% to 100% (depends on various metadata). Although these machine learning methods for metadata extraction from research papers achieved rather impressive results, but we know that for machine learning methods generating the labeled training data is rather expensive price that has to be paid for. So recently there are other research for metadata extraction based on rules, patterns and ontology [6][8][10][12]. In paper [6], authors suggested a method to extract the logical structure (title, heading, authors, page header, footnote, definition, section, theorem, etc) from articles in mathematics. And then they proposed a mathematical knowledge browser which helps people can read mathematical documents easily. Their meta-information and logical structure extraction algorithm include two steps: segmenting areas in a page (page number, running header, captions of tables and figures, footnotes and headings) by spacing, style difference, keywords. After that appropriate tags (meta-information) are assigned to these segmented areas base on layout, position, style information. They tested on 29 mathematical papers and the accuracy is about 93%. In [8] authors proposed a method to enrich the Artist ontology by extraction related information such as date-ofbirth, place-of-birth, affiliation, married-date, and the artist’s history from the search results on internet. To do it they used GATE annotation to recognize location, person, datetime components and they combined with the Artequakt Ontology (Concept-Relation-Concept) [8] to recognize the relationship between different entities (location, person, date-time) in a sentence. In [10], authors proposed metadata extraction process included two steps for heterogeneous collections: first step a new document is classified, assigned to a group of documents of similar layout and in the second step they associated a template which includes a set of rules designed to extract metadata with each class of similar layout documents.
font styles (size, bold, italic, etc), position combined with keywords, patterns defined by using JAPE Grammar and the existing NE (Named entities) dictionaries of GATE. A. The system architecture
Download PDF papers
PDF papers
Build corpus
PDF papers with Font-Annotation
Corpus of PDFs
Review papers to define JAPE patterns for metadata
Metadata Patterns Existing Dictionaries, Rules of ANNIE plug-in
Metadata Extraction
User-Interaction
…
Metadata for each paper
XML
Enrich Ontology
CSPOnt
Figure 1: The pattern-based metadata extraction system architecture
In the fig.1 above, firstly the scientific papers are downloaded from the internet, after that we will review and categorize them base on the layout information to define patterns and rules for metadata extraction (title, authors, abstract, affiliation, email, references). With GATE APIs, our system can read many different formats of digital document such as *.doc, *.pdf, *.html, *.xml, *.rtf, etc. But now we just consider PDF papers. Next step, we will build the corpus of downloaded PDF papers and we also annotate font style of these PDFs, i.e., we convert PDF files into text files in the corpus and keep their original font style by using GATE APIs. We used ANNIE 4 plug-in of GATE and inherited existing dictionaries, ontology, rules from ANNIE that help us to recognize name entity (Location, Person, Organization, Date-Time).
III. OUR APPROACH Our approach in this paper is a variant of pattern and rule-based methods. We used the Apache PDFBox library included in GATE to process the layout and style of PDF documents. In our approach, metadata extracted base on the
4 3
Annotate Font Style
http://www.greenstone.org/
189
http://gate.ac.uk/ie/annie.html
2010 International Conference on Education and Management Technology(ICEMT 2010)
GATE is distributed with an information extraction system called ANNIE (A Nearly-New Information Extraction). ANNIE includes components that support for IE and ANNIE also provides JAPE grammar to define new rules, i.e., we can define our metadata extraction patterns base on the existing dictionaries, rules of ANNIE by using JAPE Grammar. Our rules are defined by combining font style, position and the metadata extraction JAPE patterns. Extracted metadata can be corrected and validated by users before exporting to XML files. Metadata can be used to organize e-documents in digital libraries or enriching domain ontology that we will do in the next step.
And there are zero or many tokens and space tokens before and after the ‘@’ symbol (the listing no.2). Other patterns are produced similarly to identify affiliation_lines and author_lines. •
Step_2: Similar to rules in the listing no.1 and the listing no.2 we can define specific patterns to extract authors, affiliations, emails from the candidate lines identified in step no.1 (affiliation_lines, email_lines, author_lines). LISTING 2: JAPE RULE TO IDENTIFY AN EMAIL LINE
Phase:emailLine Input: Token SpaceToken Address Rule: EmailLine
B. Metadata extraction algorithm + Title (Size, Position): the title of paper is an annotation that has the biggest font-size and its position is in the heading of the paper (in the first page and before abstract keyword if the abstract keyword existed)
Priority: x ( ( {Token} ({SpaceToken.kind=="space"})?
+ Abstract (Keyword, Position) it should be in the first page and there is one or many control characters before the “abstract” keyword. The character “.” can exist or not after the “abstract” keyword. And the next tokens will be zero or many space tokens or control characters, e.g new-line characters (see the listing no.1).
)* ( {Token.string=="@"} | {Address.kind="email"}
LISTING 1: JAPE RULE TO IDENTIFY THE ABSTRACT KEYWORD
) ({SpaceToken.kind=="space"})? ( {Token} ({SpaceToken.kind=="space"})? )* ({SpaceToken.kind=="control"}) ): emailLine --> : emailLine.EmailLine = {rule = "EmailLine "}
Phase:extractAbstractWord Input: Token SpaceToken Rule: AbstractWord Priority: x ( ({SpaceToken.kind=="control"})+ ({Token.string=="Abstract"} | {Token.string=="ABSTRACT"} ({Token.string=="."})? ({SpaceToken.kind=="space"} | {SpaceToken.kind=="control"})* ):abstract_word --> :abstract_word.AbstractWord = {rule = "AbstractWord"}
+ Reference (Keyword, Position, Neighbor-Area)
+ Authors, Affiliation, Emails: the extraction process includes 2 steps •
Step_1: classify lines into different classes (affiliation_lines, email_lines, author_lines) base on combination our defined patterns with existing dictionaries, rules of GATE that are used to recognize Organization, Location, Person, Email, Address. Listing no.2 below is an example pattern to identify an email line. The listing no.2 said that an email line should be a line that contain ‘@’ symbol or an email address identified based on existing rules of GATE (Address.kind = “email”).
•
Step 1: define a JAPE rule to identify the ‘Reference’ keyword. It should be all capital letter or only the capital first letter. There is one or many control characters before and after the “Reference” keyword.
•
Step 2: define a JAPE rule to identify the beginning and ending position of one reference in the reference section.
(Control character [one or many token] or (numbers) or {Person} Control character) (Control character is a line break in this case) C. Metadata extraction software tool We developed a tool that helps to download papers from internet or digital libraries. The downloaded papers are used to create a corpus by using GATE APIs. Now we focus only on processing PDF format of papers. After the corpus is
190
2010 International Conference on Education and Management Technology(ICEMT 2010)
created, tool will annotate font styles of papers in the corpus. At the same time, our system also load the predefined metadata patterns to annotate parts relate to metadata such as abstract keyword, reference keyword, author lines, affiliation lines, email lines and extract metadata as well. The tool provided a visual Graphic User Interface, so user can view the result of extracted metadata and correct extracted metadata and then export to XML file which is used for other research and application. The tool is built by Java, so it can run on any platforms such as Windows, Linux. IV. EXPERIMENTAL EVALUATION
D. Bainbridge, J. Thompson, and I. Witten, Assembling and enriching digital library collections, In Proc. Joint Conference on Digital Libraries, pages 323–334, 2003. [2] D. Bainbridge, K. J. Don, G. R. Buchanan, I. H. Witten, S. Jones, M. Jones, and M. I. Barr, Dynamic digital library construction and configuration, In Proc. European Conference on Digital Libraries, pages 1–16, 2004. [3] http://www.nlv.gov.vn/nlv/index.php/en/2008060697/DUBLINCORE/XML-Metadata-va-Dublin-Core-Metadata.html [4] K. Seymore, A. McCallum, R. Rosenfeld, Learning hidden Markov model structure for information extraction, In: AAAI, Workshop on Machine Learning for Information Extraction, 1999. [5] H. Han, C.L. Giles, E. Manavoglu, H. Zha, Z. Zhang, E.A. Fox, Automatic document metadata extraction using support vector machines, In: Proceedings of the 3rd ACM/IEEECS Joint Conference on Digital Libraries, International Conference on Digital Libraries, pages 37–48. IEEE Computer Society Press, Washington, DC, 2003. [6] K. Nakagawa, A. Nomura, and M. Suzuki, Extraction of Logical Structure from Articles in Mathematics, MKM, LNCS 3119, pages 276-289, Springer Berlin Heidelberg from Articles in Mathematics, 2004. [7] F. Peng, A. McCallum, Accurate Information Extraction from Research Papers using Conditional Random Fields, Information Processing and Management: an International Journal, Pages: 963 – 979, 2006. [8] H. Alani, S. Kim, D. E. Millard, M. J. Weal, P. H. Lewis, W. Hall and N. R Shadbolt, Automatic Extraction of Knowledge from Web Documents, In: 2nd International Semantic Web Conference Workshop on Human Language Technology for the Semantic Web abd Web Services, October 20-23, Sanibel Island, Florida, USA, 2003. [9] J. Greenburg, K. Spurgin, A. Crystal, Final Report for the Automatic Metadata Generation Applications (AMeGA) Project, UNC School of Information and Library Science. http://ils.unc.edu/mrc/amega/, 2005. Last visited date 30/04/2010. [10] P. Flynn, L. Zhou, K. Maly, S. Zeil, and M. Zubair, Automated Template-Based Metadata Extraction Architecture, ICADL 2007, LNCS 4822, pages 327–336, 2007. © Springer-Verlag Berlin Heidelberg, 2007. [11] S. Marinai, Metadata Extraction from PDF Papers for Digital Library Ingest, 10th International Conference on Document Analysis and Recognition. ICDAR-IEEE, pages 251-255, 2009. [12] B. A. Ojokoh, O. S. Adewale and S. O. Falaki, Automated document metadata extraction. Journal of Information Science, pages 563-570, 2009.
tp (1) R = tp (2) (tp + tn) (tp + fp )
2 × P × R (3) ( P + R) (tp:true positive; fp:false positive; tn:true negative) F=
TABLE 1: THE EXPERIMENTAL RESULT FOR METADATA EXTRACTION Metadata Title Authors Affiliation Email Abstract References
Precision (%) 100.00 92.72 95.83 100.00 96.55 97.44
Recall (%) 100.00 89.47 92.00 100.00 93.33 88.05
REFERENCES [1]
For our experiments we downloaded computer science papers from the CiteSeer Digital Library (http://citeseer.ist.psu.edu/) and run metadata extraction with our software tool for 200 papers. We used the Precision (P), Recall (R) and F-measure (F) to evaluate the performance of our approach and tool. Similar to [5], we define these measures as following P=
Core Metadata Standard and also define patterns, rules to extract metadata inside the reference section of papers and the relationship of extracted metadata. This next research will helps user to know whether a paper is referred by others or we can check if one reference in the specified paper is true or exist on the internet. This research also helps to build and enrich one ontology related to Computer Science Paper (CSPOnt) used for other research and applications such as semantic searching, document retrieval or question answering systems in digital libraries.
F-Measure (%) 100.00 91.07 93.87 100.00 94.92 92.51
The our experimental in the table no.1 show that our approach achieved impressive results and it is rather high and is comparable with other methods. V. CONCLUSION AND FUTURE WORK In this paper, we have presented an approach for automatic metadata extraction from scientific PDF papers base on rules, patterns. We inherited the previous research and used existing tools that related to information extraction and we also implemented a tool to extract automatically metadata for scientific papers. Base on the experiment in section 4, we can see that our approach achieved impressive results (see the table 1) and it is rather high and is comparable with other methods. This approach is simple to implement and we can improve the performance by defining better patterns, rules. The problem is we need to consider carefully when we define rules and patterns. We have survey a lot of paper templates, so it is a time-consuming task and it needs domain knowledge. Next step in the near future, we will try to combine machine learning with this pattern-based method to improve the accuracy and extract other metadata based on the Dublin
191