Int. J. Intelligent Information and Database Systems, Vol. 6, No. 5, 2012
495
Fast algorithm for assessing semantic similarity of texts Andrzej Siemiński Institute for Informatics, Technical University of Wrocław, Wybrzeże Wyspiańskiego 27, 53-370 Wrocław, Poland E-mail:
[email protected] Abstract: The paper presents and evaluates an efficient algorithm for measuring semantic similarity of texts. Calculating the level of semantic similarity of texts is a very difficult task and the proposed up to now methods suffer from computational complexity. This substantially limits their application area. The proposed algorithm tries to reduce the problem by merging a computationally efficient statistical approach to text analysis with a semantic component. The semantic properties of text words are extracted from the WordNet lexical database. The approach was tested using WordNets for two languages: English and Polish. The basic properties of this approach are also studied. The paper concludes with an analysis of the performance of the proposed method on a sample database and suggests some possible application areas. Keywords: text similarity measures; synsets; NLP; WordNet; two-layer retrieval; user dividend. Reference to this paper should be made as follows: Siemiński, A. (2012) ‘Fast algorithm for assessing semantic similarity of texts’, Int. J. Intelligent Information and Database Systems, Vol. 6, No. 5, pp.495–512. Biographical notes: Andrzej Siemiński is an Assistant Professor in the Department of Information Systems, Institute for Informatics, Technical University of Wrocław. His research interest covers various low and high level methods for decreasing the browser latency. They include internet traffic analysis, smart buffers and mining the content of local buffers. Recently, his research interests shifted into the subject of using natural language analysis to describe user interest profiles and to boost the performance focused crawlers. He received his PhD in Computer Science from the Institute of Cybernetics, Technical University of Wrocław.
1
Introduction
The majority of natural language processing systems that are used in practice rely on purely statistical methods. The processing could be roughly divided into two stages. The first one consists in finding the similarity of individual words and the latter calculates the similarity of texts. The first stage is not complex. The words are identical if they are the spelled in the same way or they could be transformed to identical form by a simple stemming algorithm. The second stage usually involves a well proven tdf-idf algorithm. Copyright © 2012 Inderscience Enterprises Ltd.
496
A. Siemiński
The statistical methods are computationally efficient but are incapable of handling such basic language features as word synonyms. In order to go beyond the limitations of classical systems the text semantics has to be taken into account. In the semantic approach, the processing is much more complex. Even the former stage includes complex operations such as word tagging, disambiguation or the computing of words similarity based on their meaning. The latter stage is even more intricate. This all results in large computational complexity of the semantic approaches that were developed so far. This certainly limits their application area. The main aim of the paper is to propose an extension of the statistical methods by a semantic component. The algorithm augments the input text by word semantic features in a relatively simple way. In the next step, the standard, well-developed statistical methods are used. As a result the processing should be much faster than in the classical semantic methods. The application areas for such a method are numerous. In the paper we propose to use it for the two layered text retrieval. The first layer is a generic retrieval system and the second layer is responsible for re-ranking the results obtained by the first layer and for assisting the user in their evaluation. The paper is organised as follows. Section 2 introduces the concept of two-layered retrieval. Its prerequisite is an efficient algorithm for measuring the semantic similarity of texts. The algorithm proposed in the paper uses the WordNet database and therefore Section 3 describes the way in which a WordNet database could be used to asses semantic similarity between single words. Two WordNets for English and Polish were used. Section 4 elaborates on the subject of measuring the similarity of texts. It covers both statistical and semantic approaches. It ends with the presentation and evaluation the proposed SynPath algorithm. Section 5 describes the way in which the test data were prepared and processed. It ends with an analysis of obtained test search results. In Section 6, new research areas are envisaged.
2
Two-layered retrieval
The generic internet search engines such as, e.g., Google are unsurpassed in scanning billions of web pages and no wonder they are capable of selecting thousands of pages that are relevant to user’s queries. Therefore absolutely crucial is the ranking of retrieved pages. A generic search engine could and should do it an objective manner. The most notable achievement on that field (The Page Rank; Brin and Page, 1998; Del Corso et al., 2005) does it in a very ingenious way. The ranking algorithm does not rely on the usually insignificant differences in statistical similarity but instead of that it exploits the valuable source of information which is the link structure. It harvests the work of thousands of web masters from all over the world. They analyse the quality of available web pages and provide a link to those pages which they find most useful or important. The work is done in a decentralised manner, the page evaluation is performed by human experts and what is also should be noted it is done for free. The objectivity of the ranking is however a mixed blessing. On the one hand, it gives preference to pages widely regarded as most useful but on the other hand the ranking is not adopted to the needs of a particular user. To assist a user in the final page selection text snippets containing word from the query are displayed. This is certainly not satisfactory. Far more useful would be the presentation a self contained text fragments such as sentences or a paragraph that are most pertinent to the user interests.
Fast algorithm for assessing semantic similarity of texts
497
In a word, such engines are best suited to satisfy the needs of a casual user who seeks an answer to an ad hock defined, precise question. For such informational questions (Broder et al., 2007), the pages below the first or second position on the retrieval list are simply as good as non-existent. Scientists, students or journalists who need to extend their knowledge on a subject they are more or less familiar with could not be satisfied with such a way of operation. They need a whole set of answers and they have to quickly evaluate their usefulness. Such users browse the search results and they could benefit from a two-layered retrieval presented in the paper. The first layer is a classical web search. The search may use the interface of the web search engine on its own or more preferable it could use a web service offered by the engine; e.g., the Google company has published the Google web toolkit. Such tools enable developers to build and optimise browser base applications. The ranking of pages reflects generic page properties: the trustfulness of its host server or text similarity to the user query. Albeit using sophisticated algorithms it follows the ‘one size fits all’ principle. The aim of the second layer is to personalise the first layer output and make it more readable for the user. The processing is done on a local work station and therefore the knowledge of both user preferences and processing power could easily surpass the capabilities of a generic search engine. The layer can fully exploit the so called user dividend (Siemiński, 2004) – a set of features that could be utilised on a local workstation but are not available for a generic search engine. The scope of actions performed at the second layer include: •
filtering out unnecessary pages
•
prefetching retrieved pages
•
displaying meaningful text fragments having the highest similarity to the user information needs
•
employing advanced lexical, syntactic or semantic text processing
•
modifying user interest profile
•
processing of the obtained results.
The filtering out of pages involves pages that were already retrieved and evaluated by a user and to pages originating from not active or disdained servers. The remaining pages have a chance of containing new, unknown to the user information items. The prefetching means the downloading of pages before they are presented to the user. It has two advantages: improves browsing response time (page is shown almost immediately) and far more important the pages are available for further processing as described below. Generic search engines display only a text snippet containing one of the query terms. The second layer could display a sentence, a sequence of nearby sentences or even a whole paragraph that is most similar to the query. The possibility to jump from one such text fragment to another could also prove to be useful. Having such a detailed information a user is better equipped to asses the usefulness of a page. The second stage could exploit the free computational power of users’ workstation and apply it to a relatively small amount of data. Local processing of texts makes possible a wide variety of activities starting with substituting words by their base forms, syntactic
498
A. Siemiński
tagging or exploiting semantic similarity. The text processing could be much more detailed than in the case of generic search engine which has to serve the requests of many users. Initially, a user interest profile consists of his/her query. Users prefer to formulate their queries in a simple manner and are generally unwilling to formulate sophisticated queries. This is a widely known phenomena. Many factors contribute to it among them are: •
the lack of familiarity with the advanced search options of search engines
•
the rigid nature of the options that are unable to catch the subtle nuance of user information needs
•
the unwillingness or unacceptable long time necessary to type a long query.
The second layer is capable to infer user preferences by analysing his/her behaviour and refine the query an iterative mode. An exceptionally long time spent on reading a paragraph, specific user actions such as downloading a page, saving it on a local system or printing it out clearly indicate a profound interest in the page and therefore its content could be used to augment the query in a manner similar to that proposed in Cox (1994, 1995). Such an incremental query formulation is somewhat analogues to the drill-down approach to query formulation known from the database world.
3
WordNet
A WordNet is a comprehensive dictionary and lexical database for a language. It can be interpreted and used as a lexical ontology in the computer science. It includes nouns, verbs, adjectives and adverbs. Its elementary building block is a synset – a set of words or collocations (word sequences) that are all mutually synonym. In some but not all WordNets the meaning of a sysnset is further clarified by a gloss. Synsets are connected to other synsets via a number of semantic relations, e.g., the nouns synsets are connected by five relations: hypernym, hyponyms, coordinate terms, holonym and meronym. In the study, the WordNets for English and Polish were used. It should be noted that the languages have a different nature. The inflexion of English is not well developed and its sentences have a relatively rigid structure. In Polish, the sentence structure far more loose and the word meaning is largely conveyed by the rich inflection.
3.1 Princeton WordNet The Princeton WordNet (PWN) is commonly used as a reference for other wordnets and for other wordnet-related activities. It was created and is being maintained at the Cognitive Science Laboratory of Princeton University (Miller at al., 1990). It is the first WordNet ever created. The PWN has begun as a psychological experiment that aimed at explaining how lexical meaning is stored in the mind, and to shed light on the acquisition of lexical meaning by children. The PWN has a number of attractive features:
Fast algorithm for assessing semantic similarity of texts •
comprehensive coverage of the English
•
fine granularity of the synsets
•
the database and software tools have been released under a BSD style license and can be downloaded and used freely.
499
WordNet does not include information about etymology, pronunciation and the forms of irregular verbs. It contains only limited information about word usage. Some researches complain that the glosses are too short.
3.2 plWordNet The success of the PWN has inspired similar work in other countries. In 1996, the project EuroWordnet (EWN) was initiated (Vossen, 2002). The EWN project was aimed at developing WordNets for a number of European languages. All of them were mutually aligned via the mediating mapping into inter-lingual index. The work on the plWordNet (the WordNet for Polish) had not started until 2005. The long-term goal was to develop a valuable lexical resource for Polish and therefore the emphasis was on its trustworthiness. Such an approach requires an extensive involvement of human experts which is time consuming and costly (Piasecki et al., 2009). As a result the coverage of the language is not as comprehensive as it is for English. The plWordNet shares with the PWN main concepts but there are some differences. The plWordNet does not contain glosses. Unlike the PWN the plWordNet does not have a single top node for the hyponym hierarchy of the nouns. The WordNets differ also in the way their databases are implemented. The plWordNet uses a XML database. This enables the developers to easy modify and restructure it. The price to be paid is the resulting relatively poor performance. It is considerably slower than the PWN which uses a proprietary database.
4
Text similarity
The proposed method augments the classical tf-idf algorithm that is described below.
4.1 Statistical methods Vector space model proposed originally by Salton et al. (1975) is probably the most successful model used in text retrieval. In the approach, the texts are represented as vectors which each dimension corresponding to a separate term, so for the text tj: t j = ( w1 j , w2 j ,… , wnj )
The word term is misleading. A single term could represent many derived forms. In traditional indexing individual words are often augmented by longer phrases such as collocations. The proposed here algorithm goes even further, the ‘words’ are artificial entities – synset identifiers taken from a WordNet.
500
A. Siemiński
The values wxj are known as weights. The values of weights are ≥ 0 and they represent the usefulness of the term x for the description of text j. In general, the formula for the calculation of wxj should satisfy to two requirements: • •
the more often a term appears in a text the better represents it its content. the more texts contain a term less useful it is for text selection.
Both requirements are met by the tf-idf weighing scheme: wt ,d = tft , d
log
|D| {dx ∈ D | t ∈ dx}
(1)
where tft,d
is the frequency of term t in text d
|D|
is the number of all texts
| {dx ∈ D | t ∈ dx} | if the number of texts that contain the term t. As a consequence frequent terms, appearing in any text have the weight equal to 0 no mater how often they appear in a text. The highest weight values are for terms appearing frequently in a few texts. The first component of the product in the formula (1) gives preference to long texts. The texts have drastically different lengths as in the case of short query and a long document. Therefore, the text similarity measure should not use the value weights but rather their relative distribution. Therefore, the cosine measure is used:
∑ w *w )= ∑ w ∑ N
sim ( t j , tk
i =1
N
i =1
2 ij
ij
ik
N
i =1
(2) wik2
Over the years the tf-tdi weighting schema with the cosine similarity measure has proved to be highly useful. It is based on a simple theoretical model, makes it possible to rank the texts according to their similarity and it corresponds to human similarity notion. A number of advanced high performance software products have evolved, notable among them the Apache Lucene search engine (Gospodnetic et al., 2009). Its main disadvantage of the method is that it is purely statistical, it treats terms as separate entities and does not take into account the semantic relations that exist between them.
4.2 Word similarity The non-semantic word similarity measures have only limited influence on the retrieval efficiency. Even in non-flectional languages a word can take more then one form. In computer science the regular expressions are used to define a set of similar, equivalent strings. The regular expressions are popular in the database world and they are used, e.g., by the LIKE clause of the SQL language. They are not satisfactory for text retrieval due to their poor performance as they do not support indexing and therefore require sequential text processing. Moreover, for a flexional language, it is not even possible to write a regular expression that covers all equivalent forms of a word. Other alternative
Fast algorithm for assessing semantic similarity of texts
501
presents the Levenstein (1966) distance. It is a perfect tool for spelling correction or for providing suggestions to replace misspelled words in a user query. Applying it to the search process hampers the retrieval performance even worse than regular expressions. The semantic similarity of words has been subject of many research projects on in the area of artificial intelligence for some 40 years. The traditional approaches rely heavily statistical data on the co-occurrence of words. This was caused by two factors: the abundance of textual data and availability statistical methods and tools. The creation of WordNet enabled the researches to develop methods that exploit the relations between synsets or that combine both approaches. The similarity is defined only for nouns and from the wealth of relations that are available only the hyponym relation is taken into account. The same word form often belongs to many synsets. Finding the proper synset for a given context is the task of word disambiguation. It is a complex problem on its own. The tagger used for the Polish language provided a pretty effective word disambiguation algorithm. In the case of the English language, the most common meaning was used. This is a simplified approach but it was also taken by other researches (Mihalcea and Moldovan, 1998). In what follows only the hyponym relation is used. This simplifies the WordNet structure to a tree as shown on Figure 1. In order to make it more readable, the sysnset identifiers are replaced by the words making up the synstet. Figure 1
The hyponym structure of the PWN
502
A. Siemiński
Let •
len(s1, s2) denote the length of path between two synsets, that is 1+ minimal number of nodes separating the synsets s1 and s2
•
ca(s1, s2) denote the most specific common abstraction (MSCA) that subsumes s1 and s2
•
hypo(w) denote the number of hyponyms of word w.
Over the years the researches have proposed many similarity measures, a handful of them is presented below. The first four methods Simlen, Simlc (Leakok and Chodrow), Simwp (Wu and Palmer) and Simsvc (Seco, Veale and Hayes) are strictly structural. Simlen ( s1, s 2) =
1 len( s1, s 2)
⎛ len( s1, s 2) ⎞ Simlc ( s1, s 2) = − log ⎜ ⎟ ⎝ 2* D ⎠
where D is the maximal depth of the WordNet taxonomy, for the PWN it is usually set to 16.
Sim( s1, s 2) wp =
2* len ( ca ( s1, s 2) ) len( s1) + len( s 2)
The Simsvc measure is more complex. It requires to find the synsets both below and above in the hierarchy a given synset: Simsvh ( w1, w1) = 1 −
ic( w1) + ic( w2) − 2* Simres ( s1, s 2) 2
where ic ( w) = 1 −
log ( hypo( w) + 1) log ( max wn )
and Sim = ( w1, w2) = max c∈S ( w1, w 2) ( ic (c) )
where s(w1, w1) is the set of all nodes common to w1 and w2. The next two methods proposed by Lina (Simlin) and Jiang and Conrath (Simjr) combine the structural component with a statistical one. Let p(w) denote the probability of occurrence of the word w. Simlin ( s1, s 2) =
log ( p ( ca( s1, s 2) ) ) log ( p ( s1) * p( s 2) )
⎛ p ( ca( w1, w2) )2 ⎞ ⎟⎟ Sim jr ( w1, w2) = log ⎜⎜ ⎝ p ( w1) * p( w2) ⎠
Fast algorithm for assessing semantic similarity of texts
503
The variety of methods is impressive. However, all of them require the finding of the path between two nodes that is searching the hyponyms and hypernyms of a sysnset. In all cases several accesses to the database that represents the WordNet structure are necessary. This could pose a serious problem especially for all WordNet implementations, especially for those which are implemented using an intrinsically slow XML database.
4.3 Text similarity The next step is calculating the text similarity. The first method uses only word similarity. Let match(t1, t2) denote the sum of similarities of all matching words from the texts t1 and t2 and | t | denote the length of text t. The similarity is calculated according to the formula: Simbase (t1, t 2) =
2 * match(t1, t 2) | t1| + | t 2 |
The second method, Simcms proposed by in Mihalcea et al. (2006), assigns similarity in a manner more akin to the human judgment but its complexity is even greater. ⎛ mSim( w, t 2) * idf ( w) mSim( w, t1) * idf ( w) ⎞ + Simcms (t1, t 2) = 0.5* ⎜ ⎟ idf ( w ) idf ( w) ⎜⎜ ⎟⎟ w∈{t1} w∈{t 2} ⎝ ⎠
∑
∑
where mSim(w, t) is the maximum similarity of word w to any of words from text t. Slightly less complex version of the formula was presented in Siemiński (2009). The version has not included the idf factor but it still required the calculation of the mSim function. The calculation of word similarity is in itself time consuming. The performance problem is even more acute for calculation of word similarity the as it requires the comparison of similarity of m * n term pairs where m and n represent the number of words in the compared texts. In an attempt to reduce the problem in Siemiński (2009) the similarity of most frequent synset pairs was cached and in an additional database. To reduce even further the number of accesses to the wordnet database an additional database was created. The database contained for each synset a sequence of all synset identifiers connecting the synset with the root sysnset. Such performance measures make it possible to asses the similarity of relatively short texts, e.g., two sentences or link texts in real time. The calculating of similarity for longer text what is necessary for the two layered retrieval is not feasible and it calls for anther solution.
4.4 Query modification The query modification offers a possibility to include semantic data into retrieval process without the burden of extensive processing required by the discussed above methods. In the simplest case it takes the form of a query expansion. The process superficially looks like a reverse to the process which is traditionally preformed by librarians. A librarian uses a thesaurus to replace ascriptors (proper but not preferred terms) by their respective descriptors. In the query expansion process, the terms in a query are replaced by their synonyms. Although the processes work in opposite directions, that is expansion vs.
504
A. Siemiński
contraction, the end result is the same. The retrieval system is capable of finding a text no matter what semantically equivalent term is used. Such an approach has however two drawbacks. Firstly, the term replacement requires a query processing process what excludes publicly available web search engines such as Google as we do not control the process. Secondly, it does not answer the problem of capturing the semantic similarity of terms belonging to different synsets. The second problem was partially eliminated by Mihalcea as early as in 2002. The wildcard denoted by # acts in a manner similar with the lexical wildcard, but at semantic level, enabling the retrieval of subsumed concepts. While indexing the terms are replaced by DD codes. A DD code is a sequence of sysnset identifiers that are on the path from an indexed term to the root of sysnset hierarchy. This is somewhat similar to the solution proposed in the paper but in an apparent contrast it requires IRSLO – a dedicated search engine developed to handle the operator. The proposed here solution works with highly optimised, public domain search engines such as, e.g., Lucene. In still another approach, the WordNet synsets are annotated with subject field codes (SFC) (Magnini and Cavaglià, 2000). They are sets of relevant words for a specific domain and are considered to be highly useful for many NLP tasks such as information retrieval (IR), word sense disambiguation (WSD) and text classification. The SFC may include synsets of different syntactic categories: for instance MEDICINE1 groups together senses from nouns, such as doctor#1 and hospital#1, and from verbs such as operate#7. It is also possible that a SFC contains senses from different WordNet sub-hierarchies. This is certainly a promising approach but it was verified only for the text classification task and not for retrieval.
4.5 Synset path indexing algorithm The synset path indexing algorithm (SPI) addresses the performance problem in another manner. The SPI replaces each term t term by the identifier of a synset to which it belongs and by the identifiers of all its hyperonym synsets. Figure 2 shows an exemplary WordNet structure. Any term that belongs to the s1211 synset is replaced during indexing by a sequence of sysnsets identifiers: s1211, s121, s12 and s1. Note that after this process the different levels of similarity of between s1211, s1212 and s131 are clearly apparent. Contrary to the query expansion approach the SPI captures the semantic similarity between terms belonging to different sysnets. More over the number of common elements in added sets is proportional to their semantic similarity, e.g., s1211 and s1212 have three common elements whereas s1211 and s111 just only 1. The assigning of synpath identifiers is relatively simple. The indexing is performed only once while integrating a text into the database. From that moment on, the well-established algorithms and products for statistical text processing could be used. This is a notable difference to the solution proposed in Mihalcea (2002). Usually, the word similarity methods rely on hyponym/hypernym relations and therefore they are used only for nouns. This limitation does not apply to the SPI algorithm. It turned out during the experiment that adding the synset identifiers for adjectives and adverbs has substantially increased the retrieval precision.
Fast algorithm for assessing semantic similarity of texts Figure 2
505
Exemplary WordNet structure
The proposed indexing schema introduces many common synset identifiers. It could affect adversely the precision of retrieval. Therefore, in what follows we discuss the impact of the SPI on the text similarity as measured by the tdi-idf algorithm. In what follows we compare the tf-idf measure computed for three indexing schemas: •
IdxTerm: each individual term in indexed (indexing of basic term forms)
•
IdxSynSet: each term is replaced by identifier of its synset (query expansion)
•
IdxSynPath: each term is replaced by a sequence of synset identifiers produced by the SPI algorithm.
The real life WordNet structure and the term frequency of occurrence pattern are very complex. Therefore, the theoretical analysis makes several simplifying assumptions: •
The WordNet has a regular structure and at each level (excluding the last level of the so called leaf synsets) a synset has exactly b branches.
•
Texts consist of only one term and the term appears only once in the text. This is not a serious restriction as the value of the similarity is just as sum of valued for individual terms calculated separately. Moreover, many queries contain just one term.
•
Each leaf synset appears in only once in the text database.
•
A term belongs to only one leaf synset.
•
The base of the logarithm used in the equation (1) is equal to b.
Let •
lmax denote the level of the leaf synsets
•
l(s) denote the level of synset s
•
synSet(t) denote the synset of term t.
506
A. Siemiński
Taking into account the above assumptions: | D | = blmax −1
the weight of a term t that belongs to the synset s and appears in text d is equal to: ws ,d = 1* logb
|D| b
lmax −l ( s )
= logb bl ( s ) −1 = l ( s ) − 1
So the weight of the root synset has the value of 0 and the deeper a synset is located in the WordNet structure the higher is its weight. The sysnets located at the top of the hierarchy do not harm much the precision as their weights are small. The top level sysets identifiers are so common that they do not influence the search precision or recall. They are like a stop words in the traditional indexing. The first text retrieval systems used medium-sized list of stop words which later grow up in size but not many systems do not used them at all (Manning et al., 2009). Let us suppose that the query consist only of one term q and simY denotes the value of cosine similarity for the indexing schema Y. The value of simIdxTrem(t, q) is > 0 only if q occurs in t. If more then one term belongs to the synset then simIdxTerm(t, q) > simSet(t, q). In all other cases, simTerm(t, q) = 0. The value of simIdxSet(t, q) is > 0 only if t and q belong to the same synset. As in the previous case, in all other cases, simSet(t, q) = 0. The value of simIdxPath(t, q) is equal to: simPath(t , q ) =
( l ( subSume(t , q) − 1) 2
* l ( subSume(t , q) )
The idxTerm schema offers the maximal precision at the expanse of recall. The idxSynSet lowers the similarity value and the slightly increases the recall. The WordNets attempt to describe the semantics of a language in a very precise manner. Therefore the synsets are fair grained and they often contain a main term and several rarely encountered synonyms. Therefore, one can presume that the difference between values of idxTerm and idxSynSet should not be great. On the other hand, the synPath indexing schema offers far greater increase in recall. Its influence depends on the depth of the used WordNet database.
5
Experiment
5.1 Input data processing The test data was collected from three English and three Polish RSS channels over the same period of five days. The idea was to collect data on the same subject originating from different news providers and gather texts describing the same events in a variety of styles. In previous studies (Siemiński, 2009) the data used in experiments are often extracted from WWW news services. The reasons for choosing RSS channels instead of newspaper services were twofold. The first one is that the RSS data is more easily parsed than newspaper data as it contains raw text without difficult to eliminate navigational or commercial components (Kovacevic et al., 2002). The second is the quality of the data
Fast algorithm for assessing semantic similarity of texts
507
itself. The newspaper data suffers often from an exaggerated expressiveness. Often the aim of a newspaper data is to catch the attention of a reader and not to convey information. In the experiment individual sentences were retrieved. The test set consisted of 4,890 English sentences with 142,116 words and 11,760 Polish sentences with 184,524 words. In both cases, the sentences have considerable length, longer than sentences in casual language. The Polish sentences were on the average 16 words long whereas the English sentences were almost twice as long. The difference between the languages could be to great extend attributed to the articles that are so common in the English language. There are no articles in the in the Polish language. The process of language data processing consisted of several steps. During the first one the data was converted to a common XLM format what included normalising the code page what was necessary for the Polish channels. Additionally the abbreviations ware expended. Each sentence was than converted into several forms with the following code names: •
raw: the original text
•
base: words are replaced by their base forms
•
nouns: only nouns in their base form
•
synsetid: the synsted identifiers for all disambiguated nouns and adjectives
•
synpath: the synset identifiers of all adjectives and the identifiers of all nouns and their hyperonims.
The stop list, that is a list of common words eliminated from text, was not used. The attitude towards stop lists in information retrieval has undergone several changes (Manning, 2009). In the past, relatively short lists were replaced by lists consisting of several hundred items. Nowadays, the trend is to reject their usage at all. The term weighting schema diminishes the importance of ‘buzz’ words in a more effective manner. The transformations necessary to produce the mentioned above forms were naturally language dependent. For English, the identification of base term forms and sentence parsing was done with the tagger developed at the Stanford University (Stanford.WWW) and the access to WordNet by the software developed at the Princeton University (Princeton.WWW). In all cases, the most common synset was used. For the Polish language, the set of equivalent tools included the Morfeusz, TaKIPI and plWordNet. The software to access the plWordNet was developed at the Wroclaw Technical University (PWR.WWW). It should be noted that the parser for Polish did disambiguate the word meanings. The pre-processing of polish texts was much slower than that of English. It did not hampered the test as it was done only once at initial stage of data indexing. In the case of both languages, the searching routines were adopted from the library of search Java classes that were developed at the Indiana University (Indiana.WWW).
5.2 Queries In what follows, we present an analysis of the obtained search results. The analysis covers a set of only four queries, two for each language. The set is clearly not a numerous one but in consists of typical queries a short and a long one. Hopefully, the depth of analysis compensates for the scope of the test.
508
A. Siemiński
The first ‘short’ query was a phrase ‘financial crises’ or ‘kryzys finansowy’ in Polish. The ‘long’ query consisted of three answers found by a human expert to be most relevant to the user information needs. Such a formulation of the long query is in line with the two layered retrieval. In each case, the answer set consisted of the top 15 sentences having the highest similarity level, provided its value exceeded 0.2. The similarity value was calculated using the cosine measure with the tf-tdi weighing factor. The test covered all of the indexing schemas described in the Section 5.1. As usual it was assumed that all relevant answers are to be found in the set containing the texts retrieved by any of the indexing schemas. The set was evaluated by a human expert and the relevant sentences were extracted. The answers for any indexing schema were ranked according to the increasing value of the tf-idf cosine similarity measure. All of the charts in the following sections depict the cumulative number of relevant sentences.
5.2.1 Short queries For the English language, the set of relevant sentences was equal to 17. All indexing schemas behaved in roughly the same manner, the number of properly selected sentences ranging from 6 to 8. The recall was thus rather poor. Figure 3
Cumulative number of relevant sentences for a short English query (see online version for colours)
The best overall performance was recorded for the base form word indexing. The WordNet-based indexing schemas were not much worse. The worst performer was nouns indexing. The base form schema is ideally suited for a user that in interested in obtaining an answer to a specific question and does not care what particular text will be presented to him/her. No wonder it is used by popular search engines. The test results for Polish language are depicted on Figure 4. The recall is much better. The number of relevant sentences was 12 and the best indexing schema was capable of retrieving ten of them. The raw schema was the clear looser. Due to the rich inflexion structure the raw schema is definitely the worst choice. Elimination of the flexional forms in the base schema resulted in high values for both recall and precision. The performance of other variants of indexing is similar. As in the case of English, the base form schema is best suited to satisfy the needs of a user.
Fast algorithm for assessing semantic similarity of texts Figure 4
509
Cumulative number of relevant sentences for a short Polish query (see online version for colours)
5.2.2 Long queries The performance of the various indexing schemas was drastically different when we moved from short to long queries. Long queries ware a concatenation of three sentences regarded as best answers to the short question by a human expert. Figure 5
Cumulative number of relevant sentences for a long English query (see online version for colours)
The queries are far more specific, hence the total number of relevant texts has been reduced to just ten. The raw and base schemas were able to identify only three texts. These were the text that made up the query so they were not capable of providing the user with new, relevant texts. On the other band of the performance spectrum was the synPath schema. It has retrieved nine out of ten relevant texts. Among the retrieved text were sentences like:
510
A. Siemiński
•
while an international monetary fund bailout would be one possible solution to the Greek debt problem European pride and politics stand in the way
•
the markets drove up Greece’s borrowing costs again an indication that the country’s debt crisis was far from over.
The precision of the search is also acceptable. Notice the very poor performance of the base form schema. The clear winner for short queries is the undisputed looser for long queries. Similar results were obtained for Polish, see Figure 6. In that case the total number of relevant sentences was eight. Figure 6
Cumulative number of relevant sentences for a long Polish query (see online version for colours)
Using the raw and base schemas results in a high precision but unacceptably low recall. Once more again the synPath was a clear winner. Not only was it capable of finding nearly all if relevant sentences (seven out of eight) but also nearly all of them were located at the top of its answer list. The synSet and nouns schemas behaved similarly. The poorest performance is also in this case recorded for the base form indexing schema.
6
Conclusions
Traditional methods of text indexing provide an excellent tool for dealing with specific, short ad hoc formulated queries. The answer to such a query could be any text that is both relevant and comes from a reliable website. Such queries are mostly directed to web search engines and no wonder they have been optimised for handling them. For a research worker that way of operation is not sufficient. The queries must be specified in a far more precise way and the search system should take into account previous user experience, expertise and assist him/her in evaluating search results. The proposed two layered retrieval system attempts to fulfil such needs. Its prerequisite is the availability of an effective text similarity measures. The proposed idxSynPath indexing schema offers a fast and a precise method for evaluating the semantic similarity of texts. The schema
Fast algorithm for assessing semantic similarity of texts
511
augments the traditional tf-tdi similarity measure with a semantic, WordNet-based component. The tests were conducted on two different languages. The first one was English, an uninflected language with an exhaustive, fain grained and multilevel WordNet. The second was Polish which a very rich inflexion and a less comprehensive WordNet database. The theoretical analysis shows that the inevitable loss in precision resulting from using the SynPath indexing schema should not seriously lower the precision while substantially increasing the recall. The analysis was confirmed by an experiment, which albeit far from being extensive, showed also usefulness of fain granularity of a WordNet. For both languages the usefulness of the SynPath algorithm for short queries was not much visible. The algorithm was however the winner in the case of long queries. Such are used for the two layer retrieval discussed in the paper. The English WordNet has the highest language coverage but in recent years substantial work on creating WordNet databases for inflected languages such as been done. Therefore one of the prospective further research areas includes testing more languages. The impact of using word disambiguating techniques is also to be evaluated. During the test for the Polish a rather sophisticated disambiguating method was used whereas for the English the disambiguation method was very simple. The achieved results did not reflect the difference and that phenomena requires further study.
References Brin, S. and Page, L. (1998) ‘The anatomy of a large-scale hypertextual web search engine’, Proceedings of the Seventh International World Wide Web Conference, pp.107–117, Elsevier Science Publishers B.V. Broder, A., Fontoura, M., Gabrilovich, E., Joshi, A., Josifovski, V. and Zhang, T. (2007) ‘Robust classification of rare queries using web knowledge’, SIGIR ‘07: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Cox, K. (1994) ‘A unified approach to indexing and retrieval of information’, SIGDOC ‘94 Proceedings of the 12th Annual International Conference on Systems Documentation 94-10/94, Eanff, Albera, pp.176–181. Cox, K. (1995) ‘Searching through browsing’, PhD thesis, University of Canberra, Australia. Del Corso, G.M., Gullí, A. and Romani, F. (2005) ‘Fast PageRank computation via a sparse linear system’, Internet Mathematics, Vol. 2, No. 3, pp.251–273. Gospodnetic, O., Hatcher, E. and McCandless, M. (2009) Lucene in Action, 2nd ed., Manning Publications, Shelter Island, NY, USA. Indiana.WWW, available at http://www.informatics.indiana.edu/fil/is/JavaCrawlers/ (accessed on 5 May 2010). Kovacevic, M., Diligenti, M., Gori, M. and Milutinovic, V. (2002) ‘Recognition of common areas in a web page using visual information: a possible application in a page classification’, Proceedings of 2002 IEEE International Conference on Data Mining (ICDM'02), pp.250–257. Levenstein, B. (1966) ‘Binary codes capable of correcting deletions, insertions, and reversals’ Soviet Physics Doklady, Vol. 10, pp.707–710. Magnini, B. and Cavaglià, G. (2000) ‘Integrating subject field codes into WordNet’, International Conference on Language Resources and Evaluation LREC2000. Manning, C., Raghavan, P. and Schütze, H. (2008) An Introduction to Information Retrieval, Cambridge University Press, Cambridge, UK.
512
A. Siemiński
Mihalcea, R. (2002) ‘The semantic wildcard’, Proceedings of the LREC 2002 Workshop on Using Semantics for Information Retrieval and Filtering: State of the Art and Future Research, May, Las Palmas, Spain. Mihalcea, R. and Moldovan, D. (1998) ‘Word sense disambiguation based on semantic density’, Proceedings of the COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems, Montreal, Canada. Mihalcea, R., Corley, C. and Strapparava, C. (2006) ‘Corpus-based and knowledge-based measures of text semantic similarity’, available at http://www.cse.unt.edu/~rada/papers/ mihalcea.aaai06.pdf (accessed on 5 May 2010). Miller, G.A., Beckwith, R., Fellbaum, C.D., Gross, D. and Miller, K. (1990) ‘WordNet: an online lexical database’, Int. J. Lexicograph, Vol. 3, No. 4, pp.235–244. Piasecki, M., Szpakowicz, St. and Broda, B. (2009) A WordNet from the Ground Up, Oficyna Wydawnicza Politechniki Wrocławskiej, Wrocław. Princeton.WWW, available at http://wordnet.princeton.edu/ (accessed on 5 May 2010). PWR.WWW, available at http://plwordnet.pwr.wroc.pl/wordnet/ (accessed on 5 May 2010). Salton, G., Wong, A. and Yang, C.S. (1975) ‘A vector space model for automatic indexing’, Communications of the ACM, Vol. 18, No. 11, pp.613–620. Siemiński, A. (2004) The Potentials of Client Oriented Prefetching’ in intelligent technologies for Inconsistent Knowledge Processing, pp.221–238, Advanced Knowledge International, Magil, Adelaide, Australia. Siemiński, A. (2009) ‘Using WordNet to measure the similarity of link texts’, Lecture Notes in Computer Science, Lecture Notes in Artificial Intelligence, Vol. 5796, pp.720–731. Stanford.WWW, available at http://nlp.stanford.edu/software/tagger.shtml (accessed on 5 May 2010). Vossen, P. (2002) ‘EuroWordNet general document version 3’, Technical report, University of Amsterdam.