Categorizing and Ranking Search Engine's ... - ACM Digital Library

4 downloads 1978 Views 205KB Size Report
A balanced similarity ranking method combined with. Google's rank and timeliness of the pages is proposed to rank these snippets. Preliminary experiments with ...
Categorizing and Ranking Search Engine’s Results by Semantic Similarity Tianyong Hao, Zhi Lu, Shitong Wang, Tiansong Zou, Shenhua GU, Liu Wenyin Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Hong Kong, China

{tianyong, luzhi2}@student.cityu.edu.hk, [email protected], [email protected], [email protected], [email protected]

ABSTRACT

especially finding out the most relevant information they want.

An automatic method for text categorizing and ranking search engine’s results by semantic similarity is proposed in this paper. We first obtain nouns and verbs from snippets obtained from search engine using Name Entity Recognition and part-of speech. A semantic similarity algorithm based on WordNet is proposed to calculate the similarity of each snippet to each of the pre-defined categories. A balanced similarity ranking method combined with Google’s rank and timeliness of the pages is proposed to rank these snippets. Preliminary experiments with 500 labeled questions from TREC03 show that 72.7% are correctly categorized.

We propose an automatic categorizing and ranking method in this paper to classify the results from a search engine (Google is actually used and will be referred to in the entire paper) into different categories according to their semantic similarity. This method uses the results associated with corresponding searching keywords from Google as the input to obtain snippets. Each snippet then is classified into categories defined as different domains in advance, by calculating the similarity between their meaningful words and the topics. WordNet [2] is employed as the lexical tool for similarity calculation. Another algorithm is proposed to rank the snippets by their similarities combined with both their cached time and original rankings from search engine. We implemented these methods in our system. Experiments results with 500 queries from TREC03 test set [3] labeled manually by UIUC [4] show that on average, 72.7% of the search results are correctly categorized. Therefore, we believe that it is capable of providing better user experience than current Search Engine.

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – information filtering, relevance feedback.

General Terms Algorithms, Design, Experimentation

The rest of this paper is organized as follows: Section 2 introduces related work. Section 3 presents the proposed method of categorizing and ranking snippets by semantic similarity in detail, including implementation. Experiments and evaluations are shown in section 4. Section 5 summarizes this paper and discusses future work.

Keywords Semantic similarity, categorizing, ranking, search engine

1. INTRODUCTION Research on search engine has a long history in philosophy, psychology, and artificial intelligence, and various perspectives have been suggested from both academic field and industry. Users can obtain different kinds of information associated with their keywords from current commercial search engines, such as Google [1]. However, the results of these search engines are lack of semantic relationship among each other since they may be ordered by importance only. Furthermore, there are many information redundancies of these results, and it is always timeconsuming for users to find out the most relevant information they want. Hence, we propose an automatic categorizing and ranking system based on the calculation of semantic similarity of these results to improve the user experience of using search engines,

2. RELATED WORKS Text similarity and text categorization have been used for a long time in natural language processing applications and related areas [5]. For text similarity, the typical approach of calculating the similarity between two text segments is to use a simple lexical matching method, and produce a similarity score based on the number of lexical units that occur in both input segments [6]. The major heuristic component of the Rocchio algorithm is the TFIDF, which is a word weighing scheme [7]. Improvements to these simple methods involve stemming, stop-word removal, part-ofspeech tagging, longest subsequence matching, as well as various weighing and normalization factors [8]. While successful to a certain degree, these lexical similarity methods cannot always identify the semantic similarity. More recently, semantic similarity either knowledge-based [9, 10] or corpus-based [11] is becoming more and more important. Lin & Hovy used semantic similarity for automatic evaluation of text summarization [12]. Query expansion is widely adopted to obtain the approximations in latent semantic analysis methods [13], as performed in information retrieval that measures the similarity of

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

284

texts by exploiting second-order word relations automatically acquired from large text collections.

3.1. Pre-processing on Retrieved Information from search engine

For text categorization and retrieval, in the earliest applications of text similarity, Salton & Lesk used the vector model in information retrieval, where the document most relevant to an input query is determined by ranking documents in a collection in the reversed order of their similarities to the given query [14]. Rocchio gave his widely applied learning algorithms for text categorization [15]. Yang [16] proposed a new approach to automatic categorization and retrieval of natural language texts. He used a training set of texts with expert-assigned categories to construct a network which approximately reflects the conditional probabilities of the categories which a given text could belong to.

Since what the system processes is a set of words associated with their snippets, the first phase is to retrieve the snippets from Google results. A simple simulated browser is designed for retrieving the result pages from Google. From each pages, system can extract one snippet in free text by eliminating the HTML tags. A snippet is defined as: Snippet = paragraph (Title) + paragraph (Abstraction). Since there are some words such as a certain location, person’s name and date, which cannot be recognized by most current dictionaries including WordNet. Name Entity Recognition (NER) is used to identify certain atomic elements of information in text, including person names, company/organization names, locations, dates & times, percentages and monetary amounts [17]. We mainly focus on recognizing people names and location names in result snippets based on our entity dictionary.

Compared with the corpus based method, our system calculates semantic similarity of search result snippets of a query retrieved from search engines. Meanwhile, we suggest a method for measuring the semantic similarity of texts by exploiting the word depth, word distance and information content in WordNet for classification. Moreover, we describe combination measures of word semantic similarity, and show how they can be used to categorize the snippets retrieved from Google.

In order to recognize such words correctly, a NE dictionary is set up based on a training process, in which a lot of structure patterns focusing on the recognition of locations and names of people were built up to learn name entities. The learned dictionary, which contains a total of 6981 name entities in our experiment, then was used to identify name entities in all snippets by keyword searching.

3. CATEGORIZING AND RANKING SNIPPETS BY SEMANTIC SIMILARITY In this section, we propose a systematic method to classify the results associated with keywords from Google into corresponding categories by comparing their semantic similarity automatically. Google results are used as the input for our system. Name Entity Recognition is used to recognize people names and location names in result snippets defined in our dictionaries. Stemming process and POS are employed to distinguish and label the meaningful nouns and verbs.

Since stop words as common words are meaningless for our categorizing, we remove stop words by comparing the words in each snippet with the one in a stop list [18]. The stemming process and POS are employed to disambiguate the forms and the appropriate meaning of the words in each snippet. The system utilizes TreeTagger [19], which is a language independent part-ofspeech tagger, to implement these tasks. The final outcome of this phase is a group of “processed snippets”, which consists of a set of extracted words.

A WordNet based semantic similarity algorithm is proposed to calculate the similarity of these extracted words and the categories, which are defined as sub domains. A combined similarity for each snippet is calculated from the similarity of the word, which is the element of one snippet. This combined similarity value is used as a factor for categorizing snippets into corresponding categories. In each category, the snippets are ordered in a descending sequence according to the similarity and other factors. Fig. 1 shows the overall structure of our system.

A topic list is also built up manually for categorizing. We obtain a total of 55 topics which exactly matches the words existing in WordNet based on the following criteria: (1) All the selected topics can be mapped in WordNet; (2) The meaning of each topic should be in the same level in taxonomy; (3) The selected topics should in different domains.

3.2. Calculating semantic similarity between snippets and topics In order to categorize the snippets into the most relevant topics, we propose an ingenious algorithm to calculate the similarity between the snippets and the topics. The similarity between one word, which is the atomic unit of snippets, and one topic is calculated first. In comparison with traditional semantic similarity calculation methods that consider words’ distance [20], our algorithm has two factors. The first factor is based on the path of concepts in WordNet. The second one is based on the information content of concepts which comes from entropy measurement. In traditional semantic similarity calculation methods, the calculation is based entirely on the concept of distance, which refers to the number of the nodes between two words. The similarity is simply defined as the reciprocal of the distance. Although this manner can exemplify the similarity of two words, the change trend of the similarity according to the distance cannot

Fig. 1 A flowchart of categorizing and ranking snippets

285

The average similarity of a snippet to a topic is defined as Avg _ Sim(Snippe t,Topic) . The corresponding equation is shown as follows, where n is the number of the terms in one snippet and all terms are in one snippet.

be reflected. Hence, we utilize the logarithm to reflect the change trend of the similarity. In addition, the density of two words on the semantic path is employed because it reflects the weight of categorizing in WordNet. That is, the deeper a word lies, the less it weighs. The corresponding equation based on these for term T1 and T2 is as follows: Sim path(T1,T2 ) = −

1 Depth(T1 ) + Depth(T2 ) 1 ( ) × ( log +1) 2Maxdepth Dis tan ce(T1,T2 )

Avg _ Sim( Snippet, Topic) =

(1)

3.3. Balanced method of ranking snippets One of our original ideas for this system is to help users to find out the most relevant information they want easier than they do by current search engine. Hence, it is necessary to return the ranked categorized snippets by their comprehensive features instead of only the order from search engine. We propose a balanced method for ranking snippets which considering the similarity combined with search engine’s order and the impact of time.

where Depth(T) is the number of levels of the term counting from the root node “Entity”; the Maxdepth is the maximum number of the level in the taxonomy of WordNet; Dis tan ce(T1 , T2 ) is the number of common hyponyms nodes between the two terms including themselves, in which the nodes are the closest common parent nodes in WordNet.

Google ranks the websites by the famous PageRank algorithm [21]. The ranking order of websites associated with keywords rely on the back links to the websites Since the Google’s ranking method reflects the importance of snippets to some extent, we take this factor into our ranking method. Moreover, the most significant factor is the semantic similarity of the snippet to the current topic. Impact of time is also concerned since most users want see the new information. We define the impact of time as

The method of information content is also introduced to calculate the information density of the two concepts in WordNet. For a concept C, its information content can be calculated as Equation (2), in which P(C) is the probability of encountering an instance of concept C in a large corpus.

IC(C) = -P(C) log P(C)

e

(2)

2 × IC(LCS) IC(C1 ) + IC(C 2 )

of these factors. Order final = ( 1 - W) × OrderGoogle + W/ ( δ × Average_Si milarity(S nippet, To pic) + ( 1 - δ ) × e

(3)

e

)

-Date _ dif

reflects the time importance of the snippets according to its cached date since the later the snippets are retrieved by search engine, the more important they can be. W and δ are two weights for balancing the ranking results from different factors.

3.4. Implementation In order to implement our algorithm, two free tools are employed in our system. One is prolog version of WordNet 3.0 [2], which is the term relationship database. Another one is the WordNet API named Javatools [22] developed by Fabian M. Suchanek. This API is used to retrieve the relationships of words in WordNet. Based on these tools, we implemented the algorithm and framework in our system using JAVA. This system can retrieve and process the snippets retrieved from Google. For each snippet, the system can calculate the similarity of each snippet to each topic and obtain the largest similarity. By these values, system can classify the snippets into corresponding categories according to our similarity method.

is added for accuracy improvement.

Simfinal (T1,T2 ) = α × Simpath(T1,T2 ) + ( 1 − α) × SimIC (T1,T2 )

(6) -Time _ dif

In Equation (6), the factor OrderGoogle is the original order of the snippets in Google, and the is the average Average_Si milarity(S nippet, To pic) similarity of the snippets and the corresponding topic. The factor

Sim final (T1,T2 ) can be calculated from Equation (4), where a

α

, where Time _ dif means the difference between time and current time, and it is defined as

Time current − Time cached . Equation (6) illustrates the combination

The final similarity between two words in our scenario is an ingenious combination of the above two similarity values which come from different points of view. Based on the Sim path(T1,T2 ) and the SimIC (T1,T2 ) , the finial similarity coefficient

-Time _ dif

cached

Since the similarity between the two words is affected by the Information Content of the two concepts, that is, in WordNet taxonomy, the similarity of two concepts should be smaller when the information content of the least common subsumer node (LCS) is smaller. For example, in Fig 2, the word “Carnivore” is the LCS, suppose we change LCS to “Entity”. The instances of “Entity” are much bigger than that of “Carnivore”, thus, the information content of “Entity” is smaller. Therefore, the similarity between “Dog” and “Cat” will be smaller. The formula of similarity based on information content is defined as Equation (3).

SimIC (T1,T2 ) =

(5) 1 n ( Sim final (Ti, Topic)) ∑ n i =1

(4)

After obtained the similarity value for each word to each topic, we can calculate the similarity between a snippet and a topic, which is considered as the significant criterion for the automation categorization. The similarity between them comes up with the average value of the similarity of pairs of words and topics since one snippet can consist of a set of words.

286

as a final category, we conducted three experiments for “M>0”, “M>1” and “M>2” (the accuracy will not fluctuate obviously if M>3 in our experiments). For the automatic evaluation of experiment results, we define an important threshold N, which refers to the number of positions chosen to see if it contains the correct category. Three different values of N (N=1, N=2, N=3) were used in different experiments for each group. If the labeled category shown in top N categorizing result, system will mark this categorization as “CORRECT”. The number of “CORRECT” categories of each group is shown in Table 1. The average accuracy of categorization when N=3 is (81.3% + 70.3% + 66.3%)/3. That means the possibility of categorizing correctly within the top 3 results is 72.7%.

Fig. 2 A user interface of categorization and ranking Furthermore, we made a user interface using JSP based on Tomcat, which can provide an easier way for users to use the system by searching keywords just like using any other search engine. Moreover, the coefficient W can be set by users for result ranking (the default value is set to 1). The categorization result will be shown on the left panel including the category name, the number of snippets contained and their average similarity. The right panel will show the ranked snippets in a descending order according to our balanced ranking method. The semantic similarity of each snippet with the corresponding topic is shown following the snippet content. Moreover, their original order in Google results is displayed in order to compare the performance of our balanced ranking method. Fig. 2 shows the user interface when categorizing and ranking results by searching “Dog”.

Table 1.

Experiments result of three groups with different thresholds

M>0;N=1 M>0; N=2 M>0; N=3 M>1; N=1 M>1; N=2 M>1; N=3 M>2; N=1 M>2; N=2 M>2; N=3

Group1 73 78 78 65 68 70 58 65 64

Group2 67 76 87 61 68 72 53 64 69

Group3 63 65 79 53 67 69 59 64 66

Average 67.7% 73.0% 81.3% 59.7% 67.7% 70.3% 56.7% 64.3% 66.3%

Fig. 3 illustrates number of categories which are marked as “CORRECT” with different M and N. From this diagram, we can see the average accuracy is high and if M is a constant, the accuracy fluctuates mostly between N=1 and N=2.

4. EXPERIMENTS AND EVALUATION In order to evaluate the performance of our system in the categorization of snippets, we design two comprehensive experiments including both manual and automatic evaluation methods.

(M>2,N=3)

In the manual experiments, we make a three-person group and choose about 50 common words from testing. These words, including “Sound, Jason, location, Chinese, java, Benz, Microsoft, dog, keyboard, Shenzhen, sword, Nike, city, Hong Kong and so on”, are processed in our system one by one. For the categorizing results, each member gives a score if the result is correct. If the pre-defined correct category appears in the top 3 categories of the result, it is then considered as a correct answer. Experiment result shows that the average possibilities of showing relevant topic in first 3 categories can reach 84% (totally 42 categorizing results are correct).

(M>2,N=2) (M>2,N=1) (M>1,N=3)

Group3 Group2 Group1

(M>1,N=2) (M>1,N=1) (M>0,N=3) (M>0,N=2) (M>0,N=1) 0

20

40

60

80

100

Fig. 3 Different groups are generated with different thresholds

Furthermore, automatic evaluation is employed to test our system and remove the anthropogenic factor. We use the standard test data from TREC 03 [3], which have ground-truth categorizations labeled manually by UIUC. The definition of topic list can be found with data set [23]. Since our topic list has some difference to theirs, we make a mapping table which can map our 55 topics (categories) to theirs categories to evaluate whether the classified results by our system are right or not.

5. SUMMARIZATION AND FUTURE WORK Automatic categorizing and ranking is a feasible way to help user find the most relevant information from search engine’s results with a high efficiency. In this paper, we propose an automatic method to categorize and rank snippets based on the calculation of semantic similarity. This method combined with the impact of depth of term in taxonomy, distance of nodes and information content. A semantic similarity ranking method combined with Google’s order and impact of time is also proposed to rank the order of the snippets in the corresponding category. Based on standard labeled TREC03 test data, we have done two different

In this experiment, we first select 300 records with queries and label the categories randomly from 500 labeled TREC 03 data and divided them into three groups, respectively. Each group with 100 queries was processed in our system automatically. All the result is recorded into text files. We consider and define the different values of M, which refers to the number of the snippets we choose

287

[11] Turney, P. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the Twelfth European Conference on Machine Learning, 2001.

types of experiments. Results show that 72.7% of the snippets can be correctly classified in the top three categorization results. Though our system can work well and get some satisfying results there are some shortcomings. Firstly, non-English texts are not supported in our system for the lack of concept dictionary or ontology in non-English language. Secondly, the efficiency of the similarity calculation is not satisfying for the prolog version of WordNet. In the future, we will try the MySQL version of WordNet to improve the efficiency and improve our algorithms to make our system really help user a lot.

[12] Lin, C., and Hovy, E. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of Human Language Technology Conference, 2003. [13] Landauer, T. K., Foltz, P., and Laham, D. Introduction to latent semantic analysis. Discourse Processes 25, 1998. [14] Salton, G., and Lesk, M. Computer evaluation of indexing and text processing. Prentice Hall, Ing. Englewood Cliffs, New Jersey, 1971.

ACKNOWLEDGEMENT

[15] Rocchio, J. Relevance feedback in information retrieval. Prentice Hall, Ing. Englewood Cliffs, New Jersey, 1997.

The work described in this paper was fully supported by a grant from City University of Hong Kong (Project No. 7002137) and the China Semantic Grid Research Plan (National Grand Fundamental Research 973 Program, Project no. 2003CB317002).

[16] Yang, Y. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In Proceedings of 17th Ann Int ACM SI- GIR Conference on Research and Development in Information Retrieval, SIGIR'94, 1994.

REFERENCES [1] Google. http://www.google.com, 2007.

[17] Hao, T.Y., Hu, D.W., Liu, W.Y., and Zeng, Q.T. Semantic patterns for user-interactive question answering, Concurrency and Computation: Practice and Experience, vol. 20, 1-17, 2007.

[2] WordNet. http://wordnet.princeton.edu/, 2007. [3] 500 labeled TREC 03 question set. http://l2r.cs.uiuc.edu/~cogcomp/Data/QA/QC/TREC_10.labe l, 2007.

[18] Stop word list. http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/s top_words, 2007.

[4] UIUC. http://www.cs.uiuc.edu/, 2007. [5] Mihalcea, R., Corley, C., and Strapparava C. Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of AAAI’06, 2006.

[19] TreeTagger. http://www.ims.unistuttgart.de/projekte/corplex/TreeTagger/, 2007. [20] Semantic similarity based on WordNet. http://www.codeproject.com/KB/string/semanticsimilaritywo rdnet.aspx, 2007.

[6] Lapata, M., and Barzilay, R. Automatic evaluation of text coherence: models and representations. In Proceedings of the 19th International Joint Conference on Artificial Intelligence, 2005.

[21] Page, L., Brin, S., Motwani, R. and Winograd, T. The pagerank citation ranking: bringing order to the Web. Stanford Digital Libraries Working Paper, 1998.

[7] Joachims, T. A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization, In Proceedings of the Fourteenth International Conference on Machine Learning, 1997.

[22] Javatools. http://www.mpiinf.mpg.ed/~suchanek/downloads/javatools/ , 2007.

[8] Salton, G., and Buckley, C. Term weighting approaches in automatic text retrieval. In Readings in Information Retrieval. San Francisco, CA: Morgan Kaufmann Publishers, 1997.

[23] Definition of topic list. http://l2r.cs.uiuc.edu/~cogcomp/Data/QA/QC/definition.html, 2007.

[9] Wu, Z., and Palmer, M. Verb semantics and lexical selection. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1994. [10] Leacock, C., and Chodorow, M. Combining local context and WordNet sense similarity for word sense identification, In WordNet, An Electronic Lexical Database. The MIT Press , 1998.

288