A Ranking Algorithm integrating Vector Space Model with Semantic ...

1 downloads 0 Views 570KB Size Report
Sep 3, 2012 - They read the documents, index the data present in them, and cluster them .... The value of co calculated as: ..... T. Dillon, Cha. 8. Advances in.
A Ranking Algorithm integrating Vector Space Model with Semantic Metadata Saurabh Wadwekar

Debajyoti Mukhopadhyay

WiDiCoReL Research Lab Maharashtra Institute of Technology

WiDiCoReL Research Lab Maharashtra Institute of Technology

S. No. 124, Paud Road, Kothrud

S. No. 124, Paud Road, Kothrud

Pune 411038, India +91-99601-79733

Pune 411038, India +91-77091-52655

[email protected]

[email protected] performed according to similarities in meaning of the content. Initially document vectors are constructed using the standard vector space model. Vector space model was originally proposed by Gerard Salton, A Wong and C.S. Yang [2].It is a simple as well as an efficient model to index data. In vector space model, the documents are represented as vectors. The relevance weight of the keywords present in them is calculated using the tf-idf (termfrequency inverse document frequency) method. The weight of any keyword with respect to a document is given by:

ABSTRACT Semantic Web is defined as a structured collection of information, which specifies the rules of inferences required to derive a specific piece of information from it. Data in semantic web is organized in a structured manner, and the relationships clearly defined amongst them. The potency of semantic web lies in is its capacity to comprehend the meaning of the given data, and connect it with other available pieces of information, to produce a new set of tangible knowledge. The algorithm developed in the paper takes advantage of this property of semantic web, and generates clusters of data based on the semantic metadata present in them. A system has been proposed to analyze the observations of algorithm implementation. The system is studied alongside an existing semantic web search engine, to assess the quality of its search results.

W (t,d)= t-f * log(n/t(n)) Here, t-f is the term frequency, n is total number of documents and t (n) is number of documents in which the term appears. The vector space model is effective in offering search results on keyword basis, but fails to take into account the semantic nature of the data. This often leads to ‘false positives’ or ‘false negatives’. This limitation can be handled by creation of ontological domain vectors, which is explained in the subsequent section. The ontological domain vector comprises of semantic information related to a single document. Clusters of documents can be prepared by determining vectors with least distances. In the IR process developed here, user is offered a choice to drill down on his preferred search result. This drill down evaluates the closest documents (semantically) to the selected result, and offers next set of results accordingly. One advantage of such drill down is user can narrow down his search to his topic of interest without specifying the topic of interest beforehand.

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – Search process, Retrieval Models

General Terms Algorithms, Performance, Design, Experimentation, Theory, Verification.

Keywords Clustering, Semantic web, mining, vector space, information retrieval, document vector, ontological domains

1. INTRODUCTION

This paper initially outlines research advances in the domain of ontological based clustering and ranking algorithms. It then proceeds to describe the design rationale behind the new system and its specifications. Finally, sIRch, the developed system is tested against user queries and its performance is evaluated alongside an existing semantic web search engine. The paper concludes by noting the advantages of the approach used and stating its scope of improvement.

Conventional retrieval models place greater emphasis on the keywords present in the document than the semantic meaning of entire data [13]. They read the documents, index the data present in them, and cluster them according to the keywords present in them. Standard clustering algorithms include hierarchical clustering, centroid-based clustering and density-based clustering. In most cases, clustering calculates the proximities between the documents by taking into account the common keywords in them. This often leads to ambiguity since the keywords translate to different meanings corresponding to the context in which they are used. In the algorithm proposed in this paper, clustering is

2. LITERATURE REVIEW Researchers have carried out substantial amount of work in proving that clustering is greatly improved by taking ontology into consideration. Hotho et al [3] described the improvements in text based clustering method by mapping the terms in the document to a set of concepts. William De Luca et al [4] used the WordNet ontology to classify the documents into different ‘Sense Folders’, proving that ontological classification provided better results. The method provides disambiguation information of a query with respect to documents of a result set. Aussenac-Gilles [5]

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CUBE 2012, 3-5 September, 2012, Pune, India. Copyright 2012 ACM 978-1-4503-1185-4/12/09…$10.00.

623

also developed an algorithm which performed search within a context. The search results were from a hierarchical set of concepts extracted from ontology. Controlled vocabulary was used here, and documents were clustered according to the information space defined by the users. This type of clustering would be effective for specialized needs, as the information space needs to be defined before clustering. Zhanget al[6]showed that ‘re-weighting’ the words in a vector to find semantically similar core words yields better results on PubMed document sets. Exploiting the semantic metadata has always shown to generate better clusters, with documents of similar meaning grouped together. A need was felt by us to develop an algorithm which would work on any data set and offer search results through semantic analysis. Also, the search results should be accurate and fast enough for practical purpose. The work carried out by us is primarily focused on creation on ontological vectors for each document, which represent the concepts and social contexts extracted from the document. Clustering is as a result of calculations upon these vectors. A system has been proposed which implements the developed algorithm to generate efficient search results. The principles of the system can be applied to any information retrieval systems without modifications.

Table 1. Semantic information extracted from a document containing data about ‘router’

3. PROBLEM DESCRIPTION A system is to be designed which clusters data based on semantic analysis and ranks them accordingly. The algorithm developed should work with any kind of data set and provide swift search results. Primary aim in developing the system is to offer semantically relevant results. In this paper, an effort is made to develop a system which would offer high quality semantically relevant search results. The developed algorithm is implemented on top of vector space model.

Term

Domain

Technology Internet

Topic

Computing

Topic

Network architecture

Topic

Computer networking

Topic

Open Shortest Path First

Topic

Routing table

Topic

Server appliance

Topic

Router

Topic

Networking hardware

Topic

Internet protocols

The domain ‘Technology Internet’ summarizes the entire meaning of the document under its domain. Different topics which are covered in the document can be condensed into terms such as ‘Networking hardware’, ‘Internet Protocol’. The vector of any document could be generically specified as:

4. PROPOSED SYSTEM 4.1 Requirements

1 = (a. 1 + b. 2 +…. + n.

A system which would offer semantically relevant results by clustering the documents on the basis of ontological information present in them. The response time of the system should be small enough for practical use and it should not restrict itself to result set of any particular domain. Foremost step in satisfying this requirement is creation of ontological domain vectors of each document.

=

) + (p. 1 + q. 2 + … + z. 3)

+

Where 1, 2, are ontological domains and a, b, c are the weights associated with them. 1, 2, are the topics, and p, q, t their respective weights. and are generic vectors representing the ontological domain and topic of the document. For example, according to the semantic information extracted, the ontological domain vector for a file containing content about ‘Cisco’ would be –

4.1.1 Generic domain vector equation Ontological domain vector constitutes the main part of the clustering process in our algorithm. For each document, a vector describing its metadata is constructed. Semantic metadata extraction from a piece of data yields the following information: 1. 2.

Type

1 = (a. 1 + b. 2) + (p. 1 + q. 2 + r. 3 + s. 4 + u. 5 + v. 6 ) Where 1 is domain ‘Technology Internet’, 2 is domain ‘Business Finance’, 1 is topic ‘Computer Networking’ and so on.

Ontological Domain Information Social Contexts/Topics present in the document

The proximity or nearness between any two documents can be calculated by taking the dot product of two documents’ vectors:

The ontological domain information highlights the broad domains in which the document falls. Social contexts or topics are the numerous subjects which are covered in the document.

1. 2 Two different types of vector constitute any single document’s vector (domain and topic); the dot product would also yield two types of numeric values:

A Web-service (OpenCalais) [7] is used to extract the semantic metadata from a document. Document ontology domain vector is constructed on the basis of this data.

a. + b.

For example, a document containing an article of networking device ‘router’ contains the following semantic metadata, as shown in Table 1:

An ontological domain would be a highly generalized term which would have many topics under it. Since the ontological domain is superset of the topics in the above vectors, the in the result is given precedence over the . A dot product with high value of the

624

which could be easilyy read by the Jaava APIs. For carrying c this task, a software writtenn in Ruby calledd WP2TXT [10]is used. IDE used foor development is i NetBeans 7.0.1, and the versioon of JDK is 1.6.Inddexing steps in crreating the vectoor space model are a –

ddomain coefficieent denotes a greeater proximity than t a dot produuct w lower domaain coefficient an with nd higher topic coefficient. c

4 4.1.2 Vector weight calculation The value of weeights associateed with any dom T main/topic can be b c calculated usingg different meth hods. One methhod of calculatinng w weights is describbed below. For example, e = (a.

+ b.

) + (p.

+…. + n.

+ q.

+ … + z. )

T value of coefficient ‘p’ can be The b calculated as:: p=∑ (K Keywords which h can grouped unnder

read_file(f) – Read file f.

b.

Separate_artiicles(f) – Diviide the file acccording to different articcles present in itt.

c.

∑ Create_foorward_index[a(ii)], i=0 to n – For each article, createe a forward inddex file containinng the term information.

d.

∑ Create_reeverse_index[a(i)], i=0 to n - mergethe indices into the common index pool – rresulting in reverse indices.

e.

Calculate_weeight() - calculaate tf-idf weightts of all the words. The weight calculatiion can also bee performed separately at the end, and onnly once, since the weights change whenn any new file is added. Deterrmining the weights afterr all the files haave been indexedd appears to be a more praactical option.

)

This weight asssignment wou T uld have to be b based on an a e evolutionary algoorithm [8], since the keyword cattegorization undder a appropriate topic an t is a subjective process. A keyword, for f e example ‘bill’ caan be grouped un nder ‘person’ or ‘price’ dependinng u upon its relative usage u in the doccument. The value of cooefficient ‘a’ of T o ontological domain d c calculated as: a= ∑ (T Topics of

a.

can be b

)

Each of the topiics (or social co E ontexts) which are a extracted froom thhe piece of innformation can n be grouped under a broadder o ontological dom main. The weig ght of any dom main found in a d document can sim mply be calculaated by adding thhe number of thhat s specific domain’s topics presentt in the documennt. One more way too determine the domain weightss can be:

The inddexing procedurre in the propossed system is accompanied by thee domain clusteering of data. After one article is read compleetely from the fille, it is sent to thhe web service (O OpenCalais) for dom main and topic extraction. Thee web service creates rich semanttic metadata for the submitted content using methods m like natural language proceessing and macchine learning. This T step is succeedded by vector crreation – storing the received infformation in the forrm of a vector. Clustering is done d by calculatting the dot productt of two vectorrs. The documeents with high dot product values can be said to be similar and thus clustered together. An l like this: averagee running instancce of back-end looks

a= ∑ (T Topics coefficien nts)/ Number off Topics nsidered are thee only ones which Inn above formulaa, the topics con f fall in the domain, and the to opic coefficientss are the weighhts c calculated in thee previous section. In the sim mulation described b below, the value of coefficients is i set as 1 by deffault. The weight of toppic coefficients can be derived by T b interpreting thhe s significance of the t topic in con ntext to the parrticular documennt. T This can be dettermined while semantic metaddata extraction is p performed. Topiic segmentation, in natural lannguage processinng c be used in deeriving topic coeefficients. can

F Figure 1. A runn ning instance of indexing algorrithm

4 Design Rationale 4.2 R A data set is firsst selected on wh hich the clusteriing and ranking is [9] p performed. Wikiipedia database dump d has beenn selected for thhis p purpose due to diverse d nature off information avaailable on it. Back e of simulatioon is developed using Java, andd the front end is end d designed using JSP J Servlets. RE EST APIs are useed in Java to sennd thhe document daata to the web service. s The sem mantic data in thhe f form of vectors is stored in th he Oracle 11g database. d Servleets o offers the powerrful feature of rendering r Java functionality f onnto thhe front end. Detailed descriptiion of the system is explained in thhe next section.

The above depiction shhows the workinng of developed algorithm a in a condensed manner. The T article on ‘Acoustic ‘ metam materials’ is the lastt article in the fille being currentlly read. After alll the articles in a parrticular file are read, r the first piece of data, an article a about ‘IC 2889’ is sent to the web servicce for ontologiical domain classifiication. This stepp ends in creatioon of the vector,, storing the IDs of the domains/toppics returned. Clustering is perfformed after this steep. Dot product of o the newly created vector is taaken with all the otheer vectors storedd in the databasee. The closest documents are then stoored in a separatte table for swifft retrieval duringg the search processs. The line ‘Caalculating Proxim mities of – file319_vector’ denotess the product callculation of the file named as fille319. After proxim mity calculation, document d vectorrs according to vector v space model are created. The T weight is not calculatedd in every c once in i the end. individdual file’s operatiion; rather it is calculated

4 Theorettical founda 4.3 ation of sIRcch ssIRch system is designed to ind dex the documents and providee a s search engine to effectively querry it. The back end of the systeem iss developed in Java, J with Oraclee 11g as the dataabase. The sampple d document dataseet used for indeexing is the Wikipedia W databaase d dump. The databbase dump consissts of a single XML X file of arounnd 2 27GB, which rennders it unsuitab ble for loading it i into memory by b thhe Java DOM M parser. SAX X parser API offers enhanced fu functionality whhile dealing with h large XML fiiles, but the hugge s size of the file also a made this method m unreliable. As a result, thhe s single XML file was subsequenttly divided into smaller text filees,

The alggorithm for the back b end can be summarized as:

625

a.

Read thee directory in whhich the files aree present

b.

Read thee file in which thhe data is presennt

c.

If new article a found, sennd the article to web w service

d.

Crreate fileno_vecctor consisting of domains annd toopics found in the document

e.

Calculate the pro oximities of thee new article annd v clusters thhe store it in file_prroximity. This vector filles according to their semantic metadata. m

f.

Calculate the docu ument vectors acccording to vecttor a tf-idf weights to the wordds sppace model and assign

g.

Iff more data preseent in the directoory, go to step b

o document caan be part of moore than one Figure 3 depicts that one cluster according to thhe nature of conntent present inn it. If three broad clusters c are created according too domains, then a document about ‘IBM’ will fall into 2 major clusters c – Techhnology and Industrry.

4.4 sIRch s – Proototype impllementation n sIRch is i an IR (informaation retrieval) system s which is designed to observee the results of the t algorithm prresented in the paper. p sIRch is a webb portal developped in JSP (Java Server Pages), Servlets S and HTML L. The system is deployed on Appache Tomcat 7..0.14 server. The basic interface of the t system lookss like (Figure 4):

The file_vectorr mentioned in step d stoores the domaain T innformation. It consists c of an ID field, and a ‘type’ field for f iddentifying whethher the entry is a domain or a toopic. A typical file fi v vector looks like (Figure 2) –

Figure 2. A file vector Figgure 4.sIRch in nterface

Some identifier details, S d which arre stored in the above file vectoor, a as follow are 4 – Games, Toopic 441

5. EX XPERIMEN NTAL RES SULTS

5 - Tile based board games, To 578 opic

For obbserving the results of ontological domainn clustering implem mented in sIRch,, query term ‘Teexas’ is chosen. The results for the particular queryy are shown in Fiigure 5:

1 – Sports, Dom 13 main 7 – Business Finnance, Domain ffile_vector captuures the multi-d dimensional sem mantic nature off a p piece of data innto a single vecctor. This type of representatioon g greatly helps in capturing c the natture of entire coontent into a singgle c compact equationn. Centroid based clustering C c is useed to create cluusters. Documennts h having overlappping domains or o topics are cllustered togetheer. C Centroid is decided by the most m common domains d or topiics p present in the connsidered data sett.

Figure 5. 5 Search resultts for 'Texas' Figure 3.. Possiible cluster forrmation by onttological domaain F c clustering

626

All the search results are assocciated to Texas in some or othher A m manner. For obseerving how the algorithm a works out, we select thhe ‘N Nearest Docum ment’ link of ‘George McGoovern presidentiial c campaign, 1972.’ The results of selecting s the cloosest document are a a shown (Figuree 6): as

6. RE ESULT OB BERVATIO ONALONGS SIDE AN EXISTING E SEMANTIC C WEB SEA ARCH ENG GINE In this section, searchh results from the t newly desiggned system sIRch are analyzed allongside an existing semantic web search f comparison provides an engine.. The web searcch engine used for option of retrieving siimilar results baased on the conntent of the given URL. The queery selected foor performing the test is q yields thhe following ressults in the ‘consteellations’. The query sIRch (Figure ( 7):

F Figure 6.Docum ments closest to ‘George McGovvern presidentiial gn, 1972.’ campaig All the retrieved results are very A y similar to the original o documeent o George McGoovern. For examp of ple, the topmost search result is of G Glen Casada, whho was also a po olitician of samee era. The fact thhat thhe document is the topmost seaarch results sugggests that the tw wo d documents havee semantically common domaain/topics. Closser innspection tells us u about the com mmon topics (Tabble 2):

Figure 7. Seearch results forr 'constellationss' All the search results are a either constelllations in themsselves or are related to star systemss in some manneer. The search result r ‘Argo Navis’ is selected for drilling down on o nearest docuuments. The results of nearest documents of Arrgo Navis are (Figure 8):

Table 2. Commoon domains/top T pics between ‘G George McGoverrn p presidential cam mpaign, 1972’ and ‘Glen Casad da’ Type

Terrm

Domaain

Politics

Topiic

United States

Both the domainns fall under thee ontological doomain of ‘Politiccs’ B a and the commoon topic betweeen the two is ‘United Statess’. A Algorithm for arrriving at search h results throughh the sIRch can be b s summarized as: a.

Read thhe query.

b.

Removve the stop words

c.

Stem thhe query

d.

in orderr Returnndocuments of [ w(k1)+ +w(k2)+…+w(kn n)], where k1, k k2 are thhe stemmeed keywords in n the query, and a w(k1) is thhe weightt of the keywo ord with respecct to a particullar docum ment.

e.

Offer suggestions of nearest docum ments along wiith every search s result.

Figurre 8. Nearest Documents of Arrgo Navis consteellationas per sIRch A currrent operating semantic s web search engine which w offers results based on semaantic analysis iss selected for observations.

627

The query entereed in the search engine was – ‘cconstellations likke T a argo navis’-

c also be caarried out for connstructing vectoors in shared Work could ontologgies [12].

8. ACKNOWLE EDGEMEN NTS We woould like to thankk the OpenCalaiis Web Service for f allowing us to exxtract rich semaantic metadata frrom a documentt. We would also likke to thank the Wikimedia dum mp service for providing p us with sample s data onn which the proposed system m could be experim mented.

9. RE EFERENCES [1] Tiim Berners-Lee and James Henndler; Scientificc publishing onn the Semantic web- Nature (22001) Volume: 410, Issue: 68832, Publisher: History H Today Lttd., Pages: 1023-1024 [2] G.. Salton, A. Woong, C.S. Yang ; A vector space model for auutomatic indexinng -Communications of the AC CM Volume 188 Issue 11, Nov. 1975 ACMNew w York, NY, USA A [3] Anndreas Hotho, Alexander Maaedche and Steeffen Staab: Onntology-based Text T Documennt Clustering - Intelligent Innformation Proceessing and Web Mining, Proceedings of the Innternational IIS: IIPWM'03 Conference held in Zaakopane (2003) , p. 451-452

om a web search h engine againsst Figure 9. Results obtained fro Argo Navis’ con nstellation similar site search for ‘A This is a screensshot (Figure 9) showing the firrst page of search T results. It can bee observed that the search engine is not able to s semantically undderstand the meeaning of the quuery.The query is u unambiguous abbout the fact th hat the ‘argo navis’ n refers to a c constellation annd the user intends i to seaarch for simillar c constellations. B the results generated But g are noot accurate. Apaart fr from the result of o ‘Carina’, non ne of the results are constellations s similar to ‘argo navis’. Whereass in the proposeed system’s resuult, a almost all the ‘neearest documentts’ are close to thhe actual intended results. Nearly all a the results aree related to constellation, and thhe i explanatory in nature. This type t of relation is tyype of content is p possible when the t data is clusstered accordingg to the semanttic m metadata. Precisiion in the resultss of sIRch can bee calculated as:

[4] Errnesto William De Luca annd Andreas Nüürnberger Im mproving Ontoloogy-Based Sensse Folder Classsification of Doocument Collecctions with Clusstering Methodss : Proc. of 2nnd Int. Workshoop on Adaptive Multimedia M Retrrieval AMR 20004, part of ECA AI 2004, (2004) [5] Naathalie Aussenaac-Gilles & Josiane Mothe - Onntologies as Baackground Know wledge to Explore Document Collections: RIIAO, page 129-1142. CID, (2004)) [6] Xiiaodan Zhang, Liping Jing, Xiaohua X Hu, Michael M Ng, Xiiaohua Zhou- A Comparative Study of Ontology Based Teerm Similarity Measures M on PubbMed Documennt Clustering :D DASFAA, volum me 4443 of Leecture Notes inn Computer Sccience, page 1155-126. Springer, (2007)

Precision = | {R P Relevant Documeents}∩ {Retrievved Documents}| / |{{Retrieved Docuuments}| = 6/7 (thee document titled ‘apparent maggnitude’ can be termed as an irrelevant resultt) ~ 86%section. D Difference in datta sets of two alg gorithms may bee a possible reasoon f variance in reesult sets. for

[7] Oppen Calais http:///www.opencalaais.com. [8] Evvolutionary Algorithms htttp://en.wikipediaa.org/wiki/Evoluutionary_algorithhms.

-

[9] W Wikipedia Database dumps htttp://dumps.wikim media.org/backuup-index.html

-

[10] WP2TXT W - http:///wp2txt.rubyforgge.org/.

7 CONCLU 7. USION

[11] Deebajyoti Mukhhopadhyay, Anirban A Kunduu, Sukanta Siinha; Introducingg Dynamic Rannking on Web-ppages based onn Multiple Ontollogy supported Domains; D Lectuure Notes in Coomputer Sciennce Series, Sppringer-Verlag, Germany; Feebruary 15-17, 2010; LNCS 5966 pp.104-109

The algorithm described in th T he paper proviides an alternaate a approach to clusster documents according a to thee meaning of theeir c content. The prroposed system demonstrates how h the existinng s search techniquees can be impro oved by incorpoorating ontologiccal d domain clusterinng techniques in n them. Analyzzing the nature of d would play a pivotal role in data n developing seaarch techniques in S Semantic web. The T proposed system s shows how h the semanttic m metadata can bee exploited to cluster c data andd ultimately offfer relevant search results. The algorithm a can be improved by b d designing more innovative i wayss to assign weigghts to ontologiccal d domain coefficieents. Extraction of semantic dataa is a crucial steep, a refinement in and i its methods would w consequenntly lead to bettter c clustering algoritthms. The rankiing algorithm caan also be refined b adding suppoort for multiple ontology suppoorted domains [111]. by (i.e. apart from the t one which is being used forr clustering in thhe [ p proposed system m, i.e., ontology y developed byy OpenCalais [3] )

[12] Peepijn R. S. Vissser and Valenntina A. M. Taamma - An Exxperience withh Ontology Cllustering for Information Inntegration: Intelliigent Informatioon Integration, voolume 23 of CE EUR Workshop Proceedings, (1999) [13] W Wu, C., Potdar, V., Chang, E., E 2008. Latennt Semantic Annalysis – The Dynamics off Semantic Web W Service Diiscovery. In: T. Dillon, Chaang, E., R. Meeersman, K. Syycara eds. 20088. Advances in Web Semantics I. LNCS. Beerlin, Heidelbergg: Springer. pp. 346-373. 978-3--540-897835.

628

Suggest Documents