a document engineering approach to automatic

DERI – DIGITAL ENTERPRISE RESEARCH INSTITUTE

A DOCUMENT ENGINEERING APPROACH TO AUTOMATIC EXTRACTION OF SHALLOW METADATA FROM SCIENTIFIC PUBLICATIONS

Tudor Groza

Siegfried Handschuh Ioana Hulpus

DERI T ECHNICAL R EPORT 2009-06-01 J UNE 2009

DERI Galway IDA Business Park Lower Dangan Galway, Ireland http://www.deri.ie/

DERI – DIGITAL ENTERPRISE RESEARCH INSTITUTE

DERI T ECHNICAL R EPORT DERI T ECHNICAL R EPORT 2009-06-01, J UNE 2009

A DOCUMENT ENGINEERING APPROACH TO AUTOMATIC EXTRACTION OF SHALLOW METADATA FROM SCIENTIFIC PUBLICATIONS

Tudor Groza 1

Siegfried Handschuh 2

Ioana Hulpus 3

Abstract. Semantic metadata can be considered one of the foundational blocks of the Semantic Web and Desktop. This report describes a solution for automatic metadata extraction from scientific publications, published as PDF documents. The proposed algorithms follow a low-level document engineering approach, by combining mining and analysis of the publications’ text based on its formatting style and font information. We evaluate them and compare their performance to other similar approaches. In addition, we present a sample application that represent the use-case for the metadata extraction algorithms. Keywords: semantic metadata, document engineering, font mining.

1

DERI (Digital Enterprise Research Institute), National University of Ireland, Galway , IDA Business Park, Lower Dangan, Galway, Ireland. E-mail: [email protected]. 2 DERI (Digital Enterprise Research Institute), National University of Ireland, Galway , IDA Business Park, Lower Dangan, Galway, Ireland. E-mail: [email protected]. 3 Cyntelix Corp. , Galway University Road, Galway, Ireland E-mail: [email protected]. Acknowledgements: The work presented in this report has been funded in part by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2). The authors would like to thank Alexander Schutz for his support and fruitful discussions. c 2009 by the authors Copyright

DERI TR 2009-06-01

I

Contents 1

Introduction

2

Metadata study 2.1 Metadata usefulness . . . . . . . 2.2 Metadata formats and storage . . 2.3 Incentives for metadata creation 2.4 Conclusions . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

2 2 3 4 4

Metadata extraction process 3.1 First page processing phase . 3.2 Full content processing phase 3.2.1 Sections extraction . 3.2.2 References extraction

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

5 5 7 7 10

3

1

. . . .

. . . .

4

Evaluation

10

5

Applications 5.1 Metadata Extraction Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Personal Scientific Publication Assistant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12 12 12

6

Related Work

13

7

Conclusion

14

DERI TR 2009-06-01

1

1

Introduction

The aim of the Semantic Web and Semantic Desktop is to integrate and link personal and social information, by using ontologies and machine-understandable data (or semantic metadata). Thus, semantic metadata can be considered one of the foundational blocks of the Semantic Web and Desktop. Our interest is focused in particular on semantic metadata in the area of scientific publications 1 . Analyzing its current status in terms of creation (generation) and use, we observe that although its role is well understood and agreed, it is still not well enough supported. With the emergence of the Linked Open Data (LOD) 2 initiative, an increasing number of data sets were published as linked (meta)data. This lead to an important boost in terms of data being openly published, and inherently created. Consequently, a good (side) effect was the community’s growth, based on the attention received from other communities having slightly overlapping directions. In the case of scientific publications, the metadata published in external repositories (part or not of the LOD cloud) leaves room for improvement. Efforts like the one started by Möller et al. [11] represent pioneering examples. The Semantic Web Dog Food Server they have initiated, stores metadata extracted from publications from the International and European Semantic Web conferences (and later other Semantic Web events). Still, in order to really make a difference, they should be adopted at a larger scale. In terms of the method of creation, the most commonly used one is a combined manual and automatic creation. Usually, the authors are required to provide the necessary details about their publications via the conference submission systems. The data produced here, is then processed (more or less) automatically and transformed into semantic metadata. The first step of this process is extremely important, as it provides the authors’ incentive for the creation of the metadata, i.e. they are required to do it. On the other side, considering the metadata embedded in the publications 3 , in order to have a (close to) real picture of its status, we performed a small study. We gathered a corpus of around 1400 publications (more precisely 1367) from conferences like the European and International Semantic Web Conferences (ESWC and ISWC), the World Wide Web Conference (WWW), the International Conference on Knowledge Capture (K-Cap), the European Conference on Knowledge Acquisition (EKAW) and others, from different years. Our main reason for choosing these conferences, as opposed to others, was because one would expect in their case to find publications containing metadata (considering that Semantic Web is present at least as a main track, if not as the actual focus). The documents forming our corpus were collected directly from the conferences’ web sites, and not from arbitrary web locations (e.g. personal web-sites). This aspect is especially important, because in most cases the camera-ready versions were post-processed by the proceedings’ publisher, and thus are consistent and uniform in terms of encoding and metadata content. The results of the analysis were: • 48,06% of the publications have no metadata at all, • 31,67% have some form of metadata, usually the title or one of the authors, but in most cases it is not usable, while only • 20,26% have clean metadata containing the title and all the authors. In conclusion, more the three quarters of the corpus had none or unusable metadata. 1

By metadata we mean the shallow metadata of the publication, represented by title, author, abstract, keywords or references http://linkeddata.org/ 3 Here, we consider only the publications published as PDF documents, as the PDF File format supports embedded metadata using XMP and a series of simple DublinCore terms like dc:title or dc:creator 2

2

DERI TR 2009-06-01

These two facts, i.e. the slow creation / adoption of metadata repositories and the lack of metadata embedded in the publications, partly show the missing support for the creation and use of metadata. At the same time, they represent also our motivation for the work presented in this report. Firstly, as our goal is to improve the process of creation of metadata, we will have a closer look at the some metadata-related opinions, collected by means of a study that we have performed. The results of the study will also shape some of the non-functional requirements of the use-case applications that we will propose. Consequently (and as main contribution), we propose a set of algorithms for the automatic extraction of shallow metadata from scientific publications (published as PDF documents). The algorithms follow a low-level document engineering approach, by combining mining and analysis of the publications’ text based on its formatting style and font information. The remainder of this report is structured as follows: We start in Sect. 2 by presenting the survey we have performed and its results. In Sect. 3 we detail the algorithms for extracting the authors, sections and references from a publication. Sect. 4 shows the algorithms’ evaluation results, while in Sect. 5 we present a series of applications in which our work can be seamlessly integrated. Before concluding with Sect. 7, in Sect. 6 we discuss existing related approaches.

2

Metadata study

As already mentioned, we performed this study with the goal of collecting metadata-related opinions and habits, with a particular focus on four aspects: • usefulness of metadata for scientific publications • known metadata formats • storage of metadata • incentives for creating metadata We do not claim that the study has a high statistical significance, neither that it represents a full-blown social study, but rather that it has a more informal purpose. Firstly, it supports our idea that researchers agree with the important role that metadata has; and secondly, it shows that our engineering decisions are indeed useful within the scope of improving the status of metadata creation. The study was set up in the form of an online survey. The survey contained ten short questions and was open for everyone to fill in 4 . We received a total of over 50 entries. In the following, we present the results and conclusions extracted from the survey.

2.1

Metadata usefulness

Figure 1 A and B shows the opinions regarding the usefulness of both shallow and deep metadata 5 . As expected, in the case of shallow metadata, the usefulness decreases from operations that can cope with surface semantics (e.g. general search – 85.4%) to operations that need deep semantics (e.g. the understating the content of a publication – 18.8%). On the other hand, in the case of deep metadata, the usefulness flows in the opposite direction, i.e. from 85.4% on understanding a publication to 45.8% on general search. In 4 5

The complete survey including the results can be found at http://smile.deri.ie/metadata-usefulness We defined deep metadata as the metadata hidden in the semantics of the text, e.g. claims, arguments, positions, etc

DERI TR 2009-06-01

3

Figure 1: [A] - Shallow metadata usefulness; [B] - Deep metadata usefulness

Figure 2: [A] - Metadata formats; [B] - Metadata storage addition, it is interesting to observe that the percentage of researchers considering the metadata not useful is very close to zero.

2.2

Metadata formats and storage

In terms of metadata formats, without providing a fixed list to choose from, DublinCore 6 was nominated by 40% of researchers, then MARC21 and SALT (Semantically Annotated LATEX) [3] with 16% and SWRC [13] with 10%. Here, we observe the tendency of knowing the simple formats, as opposed to the more complex ones. An interesting result was received in regards to the metadata storage. Contrary to our expectancies, 62.5% opted for metadata embedded in the publications, and only 12.5% for metadata stored externally. Parts of the rest of 25% that fit into the ’not sure’ category, were actually in favor of a hybrid approach, i.e. both embedded and externally stored. Among the advantages of having the metadata embedded into the publications, were mentioned: • the mobility of metadata and the fact that it remains close to the actual content of the publication, • the possibility of including contextual elements in it, or • the coherence over time. 6

DublinCore Metadata Initiative – http://dublincore.org/

4

DERI TR 2009-06-01

Figure 3: Metadata Creation Incentives: [A] - Manual Creation; [B] - Hypothetical semi-automatic; [C] Hypothetical writing environment change Some of its disadvantages would be the lack of an immediate benefit for the author, or the need of schema alignment, aspects on which the other option scores better. Having the metadata externally, and probably published on the web, makes it easier to link, analyze and retrieve it, with the remark that this process works efficiently if and only if, it has stable URIs.

2.3

Incentives for metadata creation

This last aspect concentrates on the authors’ incentive for creating metadata for their publications. As shown in part A of Fig. 2.3, 60.4% do not create (on an ordinary basis) metadata for their publications, while only 16.7% do. Among the remaining 22.9% are the ones that create metadata when this is imposed by certain situations. Although these results look rather negative, the same researchers would have the incentive of creating metadata, if provided with a semi-automatic extraction tool (c.f. part B of Fig. 2.3 – 62.2%) and just 2.7% of them would refuse to do it. The vast majority presented by the chart (including here also the 35.1% of researchers who would do it conditionally) comes with a big price, i.e. the total amount of time they would allocate for learning the extraction tool varies from only five minutes to a maximum of 15 minutes. Surprisingly, by analyzing the last part of the figure (part C) we learn that the authors would do even more drastic changes to their usual writing habits, if besides the automatic metadata extraction, they would have immediate benefit and feed-back. This is related to changing their writing environment (for scientific publications), where 42.6% of researchers had a positive feed-back and 40.4% had a conditional positive one.

2.4

Conclusions

The results presented above help us extract three small conclusions, some of them already known somehow implicitly: • If present (or created), metadata is useful • When creating metadata, authors, in general, will preferably opt for simple and straightforward schemas, as for the way of storing it, a hybrid approach (both embedded and external) is desirable. • Even if the own incentive of creating metadata is missing, the authors will have the patience (and allocate from their own time) of trying tools that do metadata extraction. In addition, they are willing to make changes in their usual habits if these tools provide immediate benefit and feed-back.

DERI TR 2009-06-01

5

In the remainder of this paper, we present our contribution toward supporting especially this last conclusion, by proposing a series of algorithms for automatic shallow metadata extraction. At the same time, we are leaving the developer / user the choice open for both the option of using a particular metadata schema type and the option of embedding the metadata in a given publication or exporting it for use in an external (centralized) environment.

3

Metadata extraction process

The first step, in achieving our goal of performing automatic extraction of shallow metadata from scientific publications (published as PDF documents), was to design an extraction process comprising a set of algorithms, dealing individually with each component of the shallow metadata. We followed a low-level document engineering approach, by combining mining and analysis of the text based on the publication’s formatting style, font encoding and size. We have two extraction phases, each comprising several algorithms: • First page processing, including: Title extraction, Abstract extraction and Authors extraction • Full content processing, including Sections and References extraction. The extraction process starts with a series of pre-processing steps. We will describe them, and also take this opportunity to introduce the terminology used in the algorithms. Firstly, we use PDFBox 7 to perform raw extraction of the internal PDF objects, called positions. These are represented by pieces of text having attached the page coordinates and font information. Usually these pieces are formed by a maximum of three characters, which are grouped together due to the internal alignment of the PDF glyphs. Additional details on the internal PDF structure and glyphs can be found in [4]. Depending on the extraction phase we need to run one of the two following pre-processing steps, that consist in general of merging the positions into TextChunks according to the phase’s specific needs: • for the first page processing phase, we merge the consecutive positions that have the same font information and are on the same line (i.e. the Y coordinate is the same), while • for the full content processing phase, we merge all consecutive positions placed on the same line, even if they have different font information, and the resulted TextChunks get assigned the font information of the first position in line. If the following lines have the same font, they are appended to the previous TextChunk.

3.1

First page processing phase

This phase comprises three steps: Title, Abstract and Authors extraction. We will detail only authors’ extraction, as it represents the most interesting and complex step among the three. Algorithm 1 details the exact procedure of the extraction. In order to have a clear picture of how it works, Fig. 4 depicts an example of the algorithm’s main steps, applied on a publication that has the authors structured on several columns. Part A of the figure shows the way in which the authors’ columns containing the names and affiliations are linearized, based on the Y coordinate. The arrows in the figure show the exact linearization order. Considering the previous linearization, part B of the figure depicts how the names are extracted, based on the variations on the Y axis. The variations on the X axis can be represented in a similar manner. 7

http://www.pdfbox.org/

6

DERI TR 2009-06-01

Figure 4: [A] - Authors processing step 1; [B] - Authors processing step 2

Algorithm 1 Authors extraction algorithm. 1: positions ←− select positions between title and abstract 2: rows ←− findRows 3: columns ←− checkIfColumnsExist() 4: if columns then 5: firstLine ←− head(rows) 6: for each line in rows do 7: if line.font() == firstLine.font() then 8: authorsChunk.add(line) 9: end if 10: if (line.X - prevLine.X > 80) and (line.Y - prevLine.Y > 15) then 11: authors.add(split(authorsChunk, ’and’, ’,’)) 12: authorsChunk ←− [] 13: Goto Line 5 14: end if 15: end for 16: else 17: firstLine ←− head(rows) 18: authorsChunk.add(firstLine) 19: for each line in rows except firstLine do 20: if line.font() == firstLine.font() then 21: authorsChunk.add(line) 22: end if 23: end for 24: authors ←− split(authorsChunk, ’and’, ’,’) 25: end if

DERI TR 2009-06-01

3.2

7

Full content processing phase

3.2.1

Sections extraction

We consider a given document in form of a tuple D = (T, Φ, L, C, F, O), where each of the elements is defined as follows: • T - the set of all text chunks in the document; • Φ - the set of all fonts in the document; • L : T → Int - a function mapping each text to its length in words; • C : T → (Int, Int) - a function mapping each text chunk to its coordinates on the page; • F : T → Φ - a function mapping each text chunk to its font, with size(F) = m • O : T → Int - a function mapping each text chunk to the order number inside the document. The first step is to create the histogram H, mapping each existing font in the document to the number of characters, H : Φ −→ Int. Computing H, allows us to find the font of the content fc as: fc = H−1 (max H(fi )). Next, we make two important assumptions: (i) the section titles have a different font fi ∈F

than the font of the content, and (ii) the section titles match the pattern P, where P = (1[0-9]? | [2-9])(\. [1-9]?)* [text]. This leads to the section definition as a pair (t, idx), where t ∈ T and idx represents the index of the section, e.g. 1, 1.1, 1.2, etc. The set of section candidates can be determined as S = {(t, idx) /t ∈ T ∧ F(t) 6= fc ∧ L(t) < 20 ∧ idx matches P}. With S comprising the section candidates, the set of font candidates is narrowed to: ΦS = {f /f ∈ Φ ∧ ∃s(t, i) ∈ S ∧ F(t) = f }. In order to create the tree of sections, we need to define a rule for filtering each section’s children. Thus, for a section c defined as a pair (tc , idxc ) to be a child of section s, s = (ts , idxs ), it must match the pattern idxs .[d]{2}?}. The set of children of section s is will be defined as C(s). From here, we can create a preliminary version of the tree, as follows: T ree = {(s, c)/s ∈ S ∧ c = C(s)}. This preliminary version has a high chance in containing duplicates in idx, therefore we need to filter them, by applying the following steps: • We group all the candidates in accumulators, where an accumulator is defined as a triple (f, Sf , w), f ∈ ΦS , Sf represents the set of candidates c = (t, idx) with F(t) = f , and w represents the weight of the accumulator. The initial weight is w0 = 1 and Acc denotes the set of all accumulators. • We define A : ΦS → Acc as a function mapping a section to its accumulator. • We define prevSi as the node that is located before the section Si in a depth first traversal of the tree. • We define nextSi as the node that is located after the section Si in a depth first traversal of the tree. • We compute the final tree FT as the pair (root, sections) by applying algorithm 2.

8

DERI TR 2009-06-01

Algorithm 2 Tree computation algorithm. 1: FT ← T ree 2: for each section si (ti , idxi ) in sections do 3: if ∃sj (tj , idxj ∈ sections such that idxi = idxj ∧ j > i then 4: if (O(previ ) < O(ti ) < O(nextj )) ∧ ¬(O(previ ) < O(tj ) < O(nextj )) then 5: FT .remove(sj ) 1 6: A(sj ).weight ← wcur − size(A(s j )) 7: else 8: if A(sj ).weight < A(si ).weight then 9: FT .remove(sj ) 10: else 11: if A(sj ).weight > A(si ).weight then 12: FT .remove(si ) 13: else 14: if F(tj ) < F(ti ) then 15: FT .remove(sj ) 16: else 17: FT .remove(si ) 18: end if 19: end if 20: end if 21: end if 22: end if 23: end for

DERI TR 2009-06-01

9

Algorithm 3 References extraction algorithm. 1: for each textChunk in TextChunks do 2: if abs(textChunk.X - prevTextChunk.X) > 200 and abs(textChunk.Y - prevTextChunk.Y) > 100 then 3: columns.add(currentColumn) 4: currentColumn ←− new column 5: currentColumn.add(textChunk) 6: else 7: currentColumn.add(textChunk) 8: end if 9: end for 10: for each column in columns do 11: removeHeadersAndFooters 12: minX ←− minimum of line.X 13: for each line in column do 14: if abs(line.X - minX) < 5 then 15: references.add(currentReference) 16: currentReference ←− new reference 17: currentReference.add(line) 18: else 19: currentReference.add(line) 20: end if 21: end for 22: end for 23: findReferencePattern

10

DERI TR 2009-06-01

Figure 5: Example of points of interest for the references extraction algorithm 3.2.2

References extraction

Algorithm 3 details the references extraction process. To get better picture of the structuring of the references and of the points of interest for the actual extraction process, we provide an example in Fig. 5. The points A and B are important for establishing the existence of the two columns. From the rest of the points, xmin1 and x1 are used to extract each reference on the first column and xmin2 and x2 are used to extract each reference on the second column.

4

Evaluation

We performed the evaluation of the algorithms presented in the previous section, by testing them on a representative part of the corpus introduced in Sect. 1. We selected 1203 from the 1367 publications, formatted with the ACM or Springer LNCS styles, these being the two most common formatting styles used for publishing scientific articles. Before presenting the results, there are some remarks that need to be specified. Firstly, as mentioned in Sect. 1, the documents forming the corpus are consistent and uniform in terms of encoding and metadata content, individually for each conference. This ensures a relative uniformity among the documents having the same formatting style and processed by the same publisher. Secondly, as we described in [4], performing extraction from PDF documents represents a cumbersome task, as it depends on a series of factors, like encoding, encryption, etc. Therefore, the evaluation results presented here, considers only the documents for which the actual extraction was valid (i.e. the PDF parser was able to read the document). Last, but not least, the evaluation results present the algorithms working on a best effort basis (no additional information is provided about the publications). Nevertheless, our next step will be to provide the means for optimizing the algorithms for particular styles and formats. For the actual computing of the performance measures, i.e. accuracy, precision and recall, we used contingency tables similar to Table 1, that represents the contingency table for titles. Then we defined each of the measures as follows:

P recision =

A A Recall = A+B A+C

DERI TR 2009-06-01

11

Table 1: Contingency table for performance measures Is title Is NOT title Extracted

A

B

NOT Extracted

C

D

Table 2: Performance measures of the metadata extraction algorithms Accuracy Precision Recall F-measure Title

0.95

0.96

0.98

0.96

Authors

0.90

0.92

0.96

0.93

Abstract

0.96

0.99

0.96

0.97

Sections

0.92

0.97

0.93

0.94

References

0.91

0.96

0.93

0.94

Accuracy =

F measure =

A+D A+B+C +D

2 ∗ P recision ∗ Recall P recision + Recall

The results are summarized in Table 2. Fig. 6 depicts a graphical visualization of the algorithms’ accuracy. Overall (part A of Fig. 6), the title and abstract extraction algorithms performed the best, with an accuracy of 95% and respectively 96%. We observed that most of the cases in which the two algorithms failed to produce a result were documents that the PDF parser managed to read but failed to actually parse. Thus, if we would eliminate this set of documents, the accuracy would probably increase with an additional 2%. The authors extraction has an accuracy of only 90%. This is mostly due to the lack of adherence to the formatting style, or presence of symbols close to the authors’ names, mapping the author to a particular affiliation. The results of the remaining two algorithms were somehow surprising for us. The sections extraction algorithm performed extremely well, with an 92% accuracy, meaning that in those cases, it managed to extract the complete tree of sections from the paper. On the other hand, the references extraction algorithm did not perform as well as we expected, and had an accuracy of only 91%.

Figure 6: Evaluation results

12

DERI TR 2009-06-01

Part B of Fig. 6 depicts a comparison of the results split between the two chosen formatting styles. It is interesting to observe that in the case of the LNCS style, the algorithms have a quasi-uniform performance. This is because, LNCS does not allow too many variations from the original scheme. At the same time, in the case of the ACM style, the results vary because the style can suffer important modifications. For example, the authors extraction has a quite low accuracy. The reason is the fact that, even if the style specifies that the authors should be aligned on columns, with a maximum of four authors, we found plenty of cases in which none of these two rules were respected. The overall findings of the evaluation are that: (i) in general, the accuracy of all algorithms would increase if the authors would adhere strictly to the guidelines provided by the publishers regarding the formatting styles (in particular, this is extremely important for the authors extraction), and (ii) providing customized versions of the algorithms for particular styles, would boost the accuracy, even if the previous issue persists.

5

Applications

Although one could envision a lot of applications making direct use of the algorithms for extracting metadata from scientific publications, we propose, in the following, two such simple use-cases: • a web-based application, both usable directly by machines and human-friendly, with the goal of performing metadata extraction and exporting it in various formats, and • a desktop application, that besides metadata extraction, can be used for learning information about publications, based on linked metadata. This application is especially suited for early stage-researchers that need an easy way for finding relevant work existing in a particular field.

5.1

Metadata Extraction Service

We implemented the web-based application as an extensible REST service 8 . The main functionality of the service is to export the metadata extracted from a given file (based on its URL) in one of the existing formats 9 . The file can specified via the service’s parameters, as described on the service’s page. The service’s architecture allows us to add, in a transparent way, new modules that perform the export in other formats than the ones currently existing. In order to make the application human-friendly as well, not only machine usable, the same service, if used in a web browser, will return a formatted and readable version of the metadata. An interesting fact to be noted, is that we do not duplicate the metadata information, but delegate its transformation in the readable version to the user’s browser by attaching an XSL sheet to the resulted metadata. Thus, the HTML is created on demand, at the user’s request.

5.2

Personal Scientific Publication Assistant

In contrast to the above described web application, the desktop application is dedicated entirely to a human user’s needs. As mentioned, we designed it to be especially suited for early-stage researchers that are in the phase of researching the state of the art of a particular field. The application’s main goal is to enrich the 8 9

http://140.203.154.177:8080/metadata-extraction/ Currently the only existing format is SALT, but we plan to add others as well in the near future.

DERI TR 2009-06-01

13

Figure 7: Personal Scientific Publication Assistant screenshot information space around a publication by using the extracted metadata to query known web repositories of scientific publications, like DBLP or ACM Portal (see Fig. 7). The information expansion is done in multiple directions, based on the title of the publication, authors and references. We believe that this approach will help students (and not only) to learn the most relevant authors and publications in a certain area. Overall, the application provides the functionalities offered also by the web application and additional ones, like for example embedding metadata in the given files. Seen from the semantic desktop perspective, this can help in enriching the personal information space, and eventually find new links between the different information elements present on the desktop. As we followed again an open architecture approach, the application is highly modularized. Each type of metadata export or linking functionality is exposed by a single dedicated module. Thus, providing more export formats or linked information from different repositories, resumes at implementing and adding new modules to the application.

6

Related Work

There have been several methods used for automatic extraction of shallow metadata, like regular expressions, rule-based parsers or machine learning. Between the two trends, regular expressions and rule-based systems [6] have the advantage that they do not require any training and are straightforward to implement. Nevertheless, their dependence on the application domain and the need for an expert to set the rules or regular expressions causes these methods to have limited use. On the other hand, machine learning methods are robust and adaptable and, theoretically, can be used on any document set. Their main disadvantage is the rather expensive price to be paid for the labeled training data. Machine learning techniques for information extraction include symbolic learning, inductive logic programming, Support Vector Machines, Hidden Markov models or statistical methods. Hidden Markov models (HMMs) are the most widely used generative learning method for representing and extracting information from sequential data. However, HMMs are based on the assumption that features of the model they represent are not independent from each other. Thus, HMMs have difficulty exploiting regularities of a semi-structured real system. Maximum entropy based Markov models [10] and conditional random fields [9] have been introduced to deal with the problem of independent features.

14

DERI TR 2009-06-01

Table 3: Comparison between different approaches of metadata extraction Approach Shek et. al [12] Hu et. al [7] Han et. al [5] Han et. al [5] Groza et. al

Visual/spatial + knowledge and rule based approach Machine learning SVM HMM Visual/spatial heuristics

Document type

Format style

Title

Authors

Abstract

Sections

References

PostScript

Indifferent

0.92

0.87

−

0.76

−

Word, PowerPoint Indifferent Indifferent

Indifferent Indifferent Indifferent

0.96* 0.98 0.98

− 0.99 0.93

− 0.97 0.98

− − −

− − −

PDF

Mainly ACM and LNCS

0.95

0.90

0.96

0.92

0.91

Support Vector Machines (SVMs) for metadata extraction were especially studied and applied by Han et al. [5]. The roots of their findings start from Chieu [1] who suggested that the information extraction task also can be addressed as a classification problem. At the same time, their main inspiration came from the work performed in the high dimensional feature spaces handling for classification problems by Dumais [2] or Joachims [8]. If most of the above mentioned methods consider only the content, or the text itself, other approaches consider the content’s environment, i.e. the document format in which the text is found. Successful work has been reported in automatic metadata extraction from HTML documents using natural language processing methods [14] or machine learning [7], or from Postscript documents using a rule-based approach [12]. Table 3 summarizes a comparison between the different approaches mentioned above and ours. As in can be observed close in performance to the HMM and machine learning approaches, but scores less than SVM. At the same time, it performs better than the other visual/spatial approach of Shek et. al [12]. Some remarks worth to be noted here: the comparison between the two visual/spatial approaches and the machine learning / SVM ones might not be considered completely appropriate. This is mainly because the latter represent learning methods which can easily cope with general format, while the first ones are ’static’ methods. Nevertheless, the learning methods impose a high cost due to their need of accurate training data, while the static methods have no training associated. Regarding our own approach, we would like to emphasize the the algorithms work on a best effort basis. Thus, tuning them with particular information for particular styles would improve their overall performance.

7

Conclusion

In this report we described a solution for automatic metadata extraction from scientific publications, published as PDF documents. We started by performing a small study in order to acquire some metadata-related opinions and habits. The results of the study helped us in shaping some of the non-functional requirements of our proposed solution. For the actual metadata extraction, we followed a low-level document engineering approach, by combining mining and analysis of the publications’ text based on its formatting style and font information. The algorithms we have designed were then used in two sample applications. For the future, we consider the following list of improvements and additions: (i) customize the algorithms for particular styles and formats in order to boost their accuracy, (ii) add new export modules to the metadata extraction service, and (iii) add new linking modules to the personal scientific publication assistant, and include statistical analysis on publications and authors.

DERI TR 2009-06-01

15

References [1] H. L. Chieu and H. T. Ng. A maximum entropy approach to information extraction from semistructured and free text. In Proceedings of the 18th National Conference on Artificial Intelligence (AAAI 2002), pages 786–791, 2002. [2] S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th International Conference on Information and Knowledge Management, pages 148–155, 1998. [3] Tudor Groza, Siegfried Handschuh, Knud Möller, and Stefan Decker. SALT - Semantically Annotated LATEX for Scientific Publications. In ESWC 2007, Innsbruck, Austria, 2007. [4] Tudor Groza, Knud Möller, Siegfried Handschuh, Diana Trif, and Stefan Decker. SALT: Weaving the claim web. In ISWC 2007, Busan, Korea, 2007. [5] Hui Han, C. Lee Giles, Eren Manavoglu, Hongyuan Zha, Zhenyue Zhang, and Edward A. Fox. Automatic document metadata extraction using support vector machines. In JCDL ’03: Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries, pages 37–48, 2003. [6] Hui Han, Eren Manavoglu, Hongyuan Zha, Kostas Tsioutsiouliklis, C. Lee Giles, and Xiangmin Zhang. Rule-based word clustering for document metadata extraction. In SAC ’05: Proceedings of the 2005 ACM symposium on Applied computing, pages 1049–1053, 2005. [7] Yunhua Hu, Hang Li, Yunbo Cao, Dmitriy Meyerzon, and Qinghua Zheng. Automatic extraction of titles from general documents using machine learning. In JCDL ’05: Proceedings of the 5th ACM/IEEECS joint conference on Digital libraries, pages 145–154, 2005. [8] T. Joachims. A statistical learning model of text classification with Support Vector Machines. In Proceedings of the 24th ACM International Conference on Research and Development in Information Retrieval, pages 128–136, 2001. [9] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, pages 282–289, 2001. [10] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy markov models for information extraction and segmentation. In Proceedings of the 17th International Conference on Machine Learning, pages 591–598, 2000. [11] Knud Moeller, Tom Heath, Siegfried Handschuh, and John Domingue. Recipes for Semantic Web Dog Food – The ESWC and ISWC Metadata Projects. In Proceedings of the 6th International Semantic Web Conference (ISWC 2007), Busan, Korea, 2007. [12] Eddie C. Shek and Jihoon Yang. Knowledge-based metadata extraction from postscript files. In In Proceedings of the 5th ACM Conference on Digital Libraries, pages 77–84, 2000. [13] Y. Sure, S. Bloehdorn, P. Haase, J. Hartmann, and D. Oberle. The SWRC ontology - semantic web for research communities. In Proceedings of the 12th Portuguese Conference on Artificial Intelligence (EPIA 2005), Covilha, Portugal, December, 2005.

16

DERI TR 2009-06-01

[14] Ozgur Yilmazel, Christina M. Finneran, and Elizabeth D. Liddy. Metaextract: an nlp system to automatically assign metadata. In JCDL ’04: Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, pages 241–242, 2004.