Digests Compilation

656

Proceedings of I-KNOW ’09 and I-SEMANTICS ’09 2-4 September 2009, Graz, Austria

Applying Ontological Framework for Finding Links into the Future from Web Muhammad Tanvir Afzal (Institute for Information Systems and Computer Media, Graz University of Technology, Graz, Austria Email: [email protected])

Abstract: The tremendous growth in the size of Web has created a challenge to find contextspecific information. After finding a specific content on the Web, users need to explicitly query search engines for the related information. In many cases, millions of generic hits are returned by general search engines which do not accomplish the users’ intent. To cope with this scenario, same author has already implemented a feature called ‘Links into the Future’ within a digital journal. When a user is viewing a research paper, Links into the Future provides the user with most relevant papers that have been made available after the publication date of the focused paper. The users do not need to search the related papers instead the related papers are pushed into the users’ local context. The same author has also proposed an ontological representation for this concept. However, the current work describes the complete system architecture and shows that how this ontological framework is applied to extend the notion for finding Links into the Future from Web documents. This work also shows that how the proposed ontological framework is instantiated and how individual ontology plays its role in finding the most related Links into the Future from Web documents. Keywords: Links into the future, Ontologies, Ontological framework, Information extraction. Categories: H.3.1, H.3.3, H.3.7

1

Introduction

The Web has become a major platform for sharing different kind of resources. However, with the evolution of Web, it has becomes very difficult to find task-related context-specific information. This is mainly because of the unstructured information on the Web. Furthermore, the user local context and task at hand is not considered for resource discovery [Heath et al 2005]. The Semantic Web, on the other hand, is struggling to structure the resources in a formal way which can further be interpreted and processed by machines intelligently. Information supply paradigm as proposed by [Broder 2006] is about presenting users with the most relevant information by looking users ‘local context. Inspiring from this paradigm, an idea of Links into the Future is proposed and implemented in this paper. This feature provides users with the most relevant information that has been made available on the Web after the publication date of the focused content. This feature can be realized in different contexts. For example when a user is reading a particular news article; Links into the Future provides the users with other relevant news articles that were published afterwards. The discovered articles may contain news resources indicating positive or negative sentiments about the news article in

M. T. Afzal: Applying Ontological Framework for ...

657

focus. When a user is reading a book, Links into the Future provides user with links to related books or the books that are extended versions of the focused book. However, the scope of this paper is to apply this feature to find related papers from Web. The idea of Links into the Future was originally proposed by [Maurer 2001]. An ontological representation for this idea was presented by the same author in [Afzal et al 2007a]. The same author also showed that how this feature can be implemented within a digital Journal in [Afzal et al 2007b]. Details of general rules to find future related papers from Web can be found in [Afzal 2009]. However, the current work is a follow up publication describing the complete system architecture to implement this idea for Web documents. This feature has been implemented in Journal of Universal Computer Science (J.UCS) [J.UCS 2009]. A future link from paper “a” to paper “b” (Future_Link (a,b)) exists, if paper “b” is written by the same author/s as that of paper “a” and the topics of both papers are similar, as shown in equation (1). Authors (b) € Authors (a)

Topics (b) € Topics (a))

Future_Link (a, b)

(1)

The later sections of this work are divided as follows: Section 2 describes the underlying system architecture. Section 3 explains the working of ontological framework followed by the description of knowledge extraction module in section 4. Experimental results are discussed in section 5.

2

System architecture

This section presents the system architecture of the proposed system. Ontology based knowledge extraction from Web documents has been focused in project Artequakt [Alani et al 2003]. Alani et al. proposed an ontology-based knowledge extraction from text documents in artist domain. To extract some domain specific information from web documents is not a trivial issue. The proposed system can be seen in Figure 1. We are interested to find related future papers from different sources like Web, CiteSeer, DBLP. However, the focus of this paper is limited to find related papers from Web. There are mainly three modules of the system: knowledge extraction, ontology framework and visualization to the user. The details of these modules have been defined in the forthcoming sections.

3

Working of Ontological Framework to Represent Future Links

This section describes that how ontological framework defined in [Afzal et al 2007a] and shown in Figure 1 is applied for finding related documents from Web. 3.1

Author’s Publication

This ontology conceptualizes authors’ papers stored at different sources including J.UCS and Web. Initially, this ontology was populated from J. UCS authors (more than 2,100) and papers (more than 1,400) published by J.UCS. The newly discovered related papers from Web are linked in this ontology.

658


Figure 1: System Architecture 3.2

Paper’s metadata

When a paper is published by J.UCS, a detailed metadata file (including paper’s title, list of authors, ACM categories etc.) is generated. The paper’s metadata ontology was populated from metadata files [Afzal et al 2007a]. 3.3

Author’s Onamasticon (Lexicon of author’s names)

Author disambiguation is an important task in finding related papers. The same author can be referred with different name variations. For example, “Full name”, “Initial., last name”, “Last name, initial.” Author’s Onamasticon ontology represents these name variations. The ontology was populated with the described set of variations for all authors. This ontology alone is not sufficient for authors’ disambiguation. For example searching papers written by author “H. Maurer” may result in some false positives like including papers written by “Henry Maurer” or so. This aspect has been solved by author’s specialization ontology discussed in section 3.4 along with some general rules as described in the section 4.


3.4

659

Author’s Specialization

When a paper is published by J.UCS, it is annotated with ACM categories by the authors of the paper. These categories are used to describe an author’s specialization. This ontology was populated from the metadata files of J.UCS papers. 3.5

Future Links

This ontology conceptualizes all candidate Links into the Future for all J.UCS papers. The same paper may be acquired from different sources which enhance its importance and used to rank accordingly. 3.6

Community (Co-Authors)

This ontology conceptualizes the co-authors of a paper. If a paper is written by three authors, then all papers written by any of these three authors in the same area and published afterwards are considered as Links into the Future. 3.7

Author’s Future Papers

This ontology extends the concept of “author’s publication ontology” where every paper is further linked with the future papers found from the Web. The newly discovered links from Web are dynamically incorporated. More than 500 Links into the Future were found for 250 unique papers within J. UCS.

4

Knowledge Extraction

To find some specific information and relationship between documents from text/XML document stored at Web is a great challenge [Alani et al 2003]. But finding documents that are Links into the Future for a paper is a different task. We are interested in finding documents on the Web which were published by J.UCS authors. Knowledge extraction uses the ontology framework to extract related information from web. We have subdivided knowledge extraction module into three sub modules. 4.1

Document retrieval and pre-processing

This module is responsible for filtering the documents retrieved from the Web. SOAP APIs of Google [Google API 2009], Yahoo [Yahoo API 2009], and Microsoft Live [MSN API 2009] were used to extract information from the Web. The query is designed to search particular formats (PDF, PS, and DOC) of documents. The query is formulated for searching papers from Web based on “author’s publication”, “Onamasticon” and “community” ontology. For example a typical query looks like {abstract references "Hermann Maurer” “H Maurer” “Maurer H” filetype: PDF}. This query formulation helped to reduce undesired information. Duplicate records are subsequently removed. Documents in PS and DOC file formats are first converted to PDF using MiKTeX [MiKTeX 2009] and Openoffice [Open Office 2009] respectively. Then pdfbox [PDFBox 2009], a Java library is used to convert PDF to plain text for further analysis.

660 4.2


Noise Filtering

Extracted dataset from 4.1 may contain documents other than research papers. The examples of these documents include CV, business card, publication list of authors etc. Two simple rules are devised to filter out only the papers: (1) Title of the paper followed by author name and abstract should exist in the same page. (need not be in the first page). Authors’ full name is subsequently searched to disambiguate author/s names. (2) Word “reference” or “references” should exist, followed by proper sequence starting with any of the patterns: “[author]”, “[1]”, “1” and (). 4.3

Information Component Extraction

This module uses the ontological framework (as described in section 3) to extract the relevant information. The list of authors is extracted and “author’s publication ontology is populated”. The papers discovered from Web are not annotated with ACM categories which might have been helpful to find their relatedness with JUCS papers (source papers). Relatedness is calculated based on similarity scores. To calculate similarity score we first removed papers’ header and references to only focus on the paper’s content. Then Yahoo term extractor [YahooTermExtractor 2009] was used to extract key terms. Similarity is calculated by taking a dot product of key terms extracted from source and candidates papers. All papers matching the defined threshold are filtered and considered as Links into the Future for the source paper. The newly discovered Links into the Future are updated in all relevant ontologies as described in section 3.

5

Experimental Results

We have manually checked the overall performance of the system for randomly selected authors as shown in Table 1. Columns in Table 1 represent result of filtering processes as described in section 4. We observed substantial amount of noise reduction in all phases. For example if user perform a query like {“Hermann Maurer”} in a search engine to find the papers written by the author. 168,000 results are returned. These results include video files, xml document, and other content which is not only the research papers. However, the formulated query defined in section 4.1 retrieved 112

documents. These documents further contain duplicate entries and after removing duplicates, we left with 75 unique results. However, these 75 documents may contain some research papers and some other related documents as discussed in section 4.2. The paper classification rules helped to filter only the research papers out of these documents and we left with 12 research papers only. The same process was done by using Yahoo and MSN searches. Then the universal set of all searches resulted in 23 unique papers. These 23 papers were considered candidate for similarity algorithm defined in section 4.3 which filtered out the 17 actual future links for the focused paper. The system has shown that it can reduce millions of generic hits returned by a general search engine to a few most relevant future related papers.

661


Author Maurer H.

Abraham A.

Bulitko V

Shum S. B

Abecker A.

Focused paper in J.UCS

Search Engine

Formulated Query

After removal duplicate

Classified Papers

Digital Libraries as Learning and Teaching Support vol. 1 Issue 11 A Novel Scheme for Secured Data Transfer Over Computer Networks Vol. 11 Issue 1

Google

112

75

12

Yahoo

495

86

19

Google

148

62

13

Yahoo

263

87

41

On Completeness of Pseudosimple Sets Vol.1 issue 2 Negotiating the Construction and Reconstruction of Organisational Memories Vol. 3 issue 8 Corporate Memories for Knowledge Management in Industrial Practice: Prospects and Challenges Vol. 3 Issue 8

Google

21

21

7

Yahoo

45

28

13

Google

103

81

11

Yahoo

546

104

26

Google

69

59

9

Yahoo

335

65

14

Unique paper

Actual Future Links

23

17

33

22

17

3

28

21

17

15

Table 1. Links into the Future results for selected authors

6

Conclusion

The working of ontological framework along with the role of individual ontology in finding Links into the Future from Web was explained in details. The results have shown that the system was able to find the most relevant Links into the Future from Web. The discovered Links into the Future are pushed to the user’s local context. This has reduced the human effort to locate related papers from Web. This feature is available in JUCS (http://www.jucs.org) since 2007. We have received a number of emails from users of J.UCS who are finding this system extremely useful.

Reference [Afzal 2009] Afzal, M.T.: “Discovering Links into the Future on the Web”; proc. WEBIST’09, Lisboa, Portugal (2009), 123-129. [Afzal et al 2007a] Afzal, M.T., Abulaish, M.: “Ontological Representation for Links into the Future”; proc. ICCIT’07, Gyeongju, Korea (2007), 1832-1837. [Afzal et al 2007b] Afzal, M.T., Kulathuramaiyer, N., Maurer, H.: “Creating Links into the Future”, J.UCS (Journal of Universal Computer Science), vol. 13, issue 9 (2007), 1234-1245. [Alani et al 2003] Alani, H., Kim, S., Millard, D., Weal, M., Hall, W., Lewis, P. and Shadbolt, N.: "Automatic Ontology-Based Knowledge Extraction from Web Documents."; IEEE Intelligent Systems, 18,1 (2003), 14-21. [Broder 2006] Broder, A.: “The Future of Web Search: From Information Retrieval to Information Supply”; book chapter in Lecture Notes in Computer Science, 4032 (2006), 362. [Google API 2009] Google Search API, http://code.google.com/apis/soapsearch/reference.html [Heath et al 2005] Heath, T, Dzbor, M, and Motta, E.: “Supporting User Tasks and Context: Challenges for Semantic Web Research.”; Proc. Of the Workshop on End-user Aspects of the

662


Semantic Web (UserSWeb), European Semantic Web Conference (ESWC2005), Heraklion, Crete, 2005. [J.UCS 2009] Journal of Universal Computer Science, http://www.jucs.org. [Maurer 2001] Maurer, H.: “Beyond Digital Libraries, Global Digital Library Development in the New Millennium”; Proc. NIT’ 01, Beijing (2001), 165-173. [MiKTeX 2009] MiKTeX, a PS to PDF converter, http://www.miktex.org/ [MSN API 2009] MSN Live API, http://msdn2.microsoft.com/enus/library/bb266180.aspx [Noy et al 2001] Noy, N., McGuinnes, D.L.: “Ontology Development 101: A Guide to Creating Your First Ontology”, Stanford University (2001). (http://www.ksl.stanford.edu/people/dlm/papers/ontology-tutorial-noy-mcguinnessabstract.html). [Open Office 2009] Open Office, a DOC to PDF converter, http://www.openoffice.org/. [PDFBox 2009] PDFBox, a PDF to text converter, http://www.pdfbox.org/ [Yahoo API 2009] Yahoo Search API, http://www.programmableweb.com/api/yahoo- search.

[YahooTermExtractor 2009] Yahoo Term Extractor service, http://developer.yahoo.com/search/content/V1/termExtraction.html

Digests Compilation

Digests Compilation

Suggest Documents