Building a web-snippet clustering system based on a

3 downloads 0 Views 709KB Size Report
Jul 1, 2010 - AdWords and Yahoo Search Marketing), mapping of search results against a predetermined taxonomy (e.g. Open Directory Project and Yahoo ...
The current issue and full text archive of this journal is available at www.emeraldinsight.com/1468-4527.htm

Building a web-snippet clustering system based on a mixed clustering method Lin-Chih Chen Department of Information Management, National Dong Hwa University, Hualien, Taiwan

Web-snippet clustering system 611 Refereed article received 1 July 2010 Approved for publication 1 December 2010

Abstract Purpose – Web-snippet clustering has recently attracted a lot of attention as a means to provide users with a succinct overview of relevant results compared with traditional search results. This paper seeks to research the building of a web-snippet clustering system, based on a mixed clustering method. Design/methodology/approach – This paper proposes a mixed clustering method to organise all returned snippets into a hierarchical tree. The method accomplishes two main tasks: one is to construct the cluster labels and the other is to build a hierarchical tree. Findings – Five measures were used to measure the quality of clustering results. Based on the results of the experiments, it was concluded that the performance of the system is better than current commercial and academic systems. Originality/value – A high performance system is presented, based on the clustering method. A divisive hierarchical clustering algorithm is also developed to organise all returned snippets into a hierarchical tree. Keywords Web-snippet clustering, Precision, Recall, F-measure, Normalised Google distance, Subtopic reach time, Search results, Search engines, Information searches Paper type Research paper

Introduction Search engines provide an interface to accept queries and use indexing techniques to generate a list of URLs to the webpages containing the search query. The goal of search engines is to help the users meet their information needs with minimal effort. What makes this goal challenging is that most users tend to input very short queries. According to the literature (Spink et al., 2001) the average length of a search query is 2.3 words. With such short queries it is a difficult task to assess users’ search needs, especially for ambiguous queries. The next generation search engines will solve this problem by focusing on users’ search needs rather than the search queries, and by offering various post-search tools to help users in dealing with large sets of somewhat imprecise results. Such tools include query suggestions or refinements (e.g. Google AdWords and Yahoo Search Marketing), mapping of search results against a predetermined taxonomy (e.g. Open Directory Project and Yahoo Directory), and web-snippet clustering (e.g. Clusty and Lingo3G). All these tools are based in full or in part on the analysis of the result set. The author would like to thank anonymous reviewers of the paper for their constructive comments, which helped him to improve the paper in several ways. This work was supported, in part, by the National Science Council, Taiwan under Grant NSC 99-2221-E-259-023.

Online Information Review Vol. 35 No. 4, 2011 pp. 611-635 q Emerald Group Publishing Limited 1468-4527 DOI 10101010101

OIR 35,4

612

Web-snippet clustering was introduced in a primitive form by Northernlight and was then made widely popular by Clusty. This post-search tool clusters the search results returned by a metasearch engine into a hierarchical tree that is labelled with variable-length sentences. The labels should capture the topics of search results contained in their associated labels. The hierarchical tree provides a complementary view of the search results. Users can customise this view by simply navigating the tree. This navigational approach is especially useful for informative, polysemous, and poor queries (Broder, 2002). There are three main challenges with this post-search tool: (1) generating good descriptive labels for clusters; (2) clustering the search results into a hierarchical tree; and (3) clustering must be performed on-the-fly. Traditional data mining approaches are not concerned with the quality of labels, but they are often very good at grouping documents (do Prado and Ferneda, 2007). Unfortunately, regardless of how good the document grouping is, users are not likely to use a clustering system if the labels are poor. Moreover, the search results are presented in a hierarchical tree that can help the users to evaluate their search needs. Finally, the processing time is also a major issue of this post-search tool because users expect fast response times. In this paper we adopt a mixed clustering method to implement a high performance web-snippet clustering system called WSC. The output of our system consists of both the regular search results and the clustering results. For the regular search results we use a well-known information retrieval metric – mean reciprocal rank – to rearrange the search results of major search engines, such as Google, Yahoo! and Bing, into our regular search results. For the clustering results we first adopt a two-round label construction technique, which involves a suffix tree clustering method and a two-pass hash mechanism, to generate all meaningful labels; then we develop a divisive hierarchical clustering algorithm to organise the labels into a hierarchical tree. The main contributions of this paper are twofold. First, we present a detailed system designed to achieve superior performance over current commercial and academic systems. Our preliminary system is available at http://cayley.sytes.net/wsc (see Figure 1), and offers a web interface similar to Clusty, which is the most well known clustering system. Second, the main advantages of our divisive hierarchical clustering algorithm are: . the labels are organised into a hierarchical tree rather than a list; . a child cluster can be assigned to multiple parent clusters; and . most operations of our divisive hierarchical clustering algorithm are only required to do bit-wise operations, thus the hierarchical tree is built relatively quickly. The rest of this paper is organised as follows. The following section discusses related work on current commercial and academic systems. The next section describes the details of our WSC system. The subsequent section discusses the results of the experimental analysis, and the final section comprises conclusions and some directions for further research.

Web-snippet clustering system 613

Figure 1. The clustering results of Clusty and WSC in response to the query “mobile phone”

Previous related work In this section we review state-of-the-art solutions in current commercial and academic systems. The main strength of the commercial systems is that they are constantly maintained by specialists, thus, most of these systems still exist. Conversely, the main weakness of the academic systems is that they are not continually maintained by specialists, thus, the majority of such systems do not survive. The main strength of the academic systems is that they use a public (shared) algorithm to cluster the collected snippets, thus, researchers can easily build or improve the system. On the contrary, the main weakness of the commercial systems is that they use a private (not shared) algorithm to cluster the collected snippets, thus, researchers have no way to build or improve the system. Commercial systems Web-snippet clustering has been widely and successfully employed in many commercial metasearch engines, such as Clusty, iBoogie, Lingo3G, WebClust, Grokker and Kartoo, but not much is known about the methods they employ. Table I lists several commercial metasearch engines that provide web-snippet clustering. According to the clustering results these engines are broadly divided into textual and graphical categories. In the textual category the clustering results are shown in a hierarchical tree, e.g. Clusty, iBoogie, Lingo3G, and WebClust. As the metaphor of a hierarchical tree is used for storing and retrieving files, menu items, bookmarks, and so on, most users are

OIR 35,4

614

familiar with it and hence no training is required. Furthermore, the hierarchical tree is a very efficient way to identify a user’s search needs. However, a textual indented representation is probably neither the most compact nor the most visually pleasing tree layout. Through a graphical representation the relationships in distance and kind between clusters can be rendered by rich spatial properties such as dimension, colour, shape, and adjacency. All returned snippets are clustered in a series of interactive maps by a proprietary graph algorithm. Three well-known systems, Grokker, Kartoo, and Lingo3G, have appeared on the internet; however, only Lingo3G is currently available online. Academic systems The academic literature offers various methods to solve the problem of web-snippet clustering. In the simplest type the labels are shown as a simple bag of words and the labels are organised into a flat list. In a more sophisticated type the labels are shown as variable-length sentences and the labels are organised into a hierarchical tree. According to the form and structure of the labels, the methods can be classified into four types: (1) single words and flat; (2) sentences and flat; (3) single words and hierarchical; and (4) sentences and hierarchical. Table II lists several studies that deal with the problem of web-snippet clustering. Retriever (Joshi and Jiang, 2002) uses a fuzzy clustering algorithm to organise the search results into a few clusters. In their algorithm documents can belong to more than one cluster, and a set of membership levels is associated with each document. These indicate the strength of the association between that document and a particular cluster. Fuzzy clustering is a process of assigning these membership levels, and then using them to assign documents to one or more clusters. WebCat (Giannotti et al., 2003) applies a K-means clustering algorithm to organise the clusters into a flat list. The K-means clustering algorithm partitions a set of documents into multiple non-overlapping clusters in which each document belongs to only the closest cluster. Its computational complexity is linear in the number of documents. EigenCluster (Cheng et al., 2006) applies a divide and merge algorithm, which involves a spectral clustering (efficient with sparse term-document matrices) and a dynamic programming

Table I. List of commercial systems

Output form

Metasearch engine

URL

Textual

Clusty iBoogie Lingo3G WebClust Grokker (dead) Kartoo (dead) Lingo3G

http://clusty.com www.iboogie.com/ http://search.carrot-search.com/ www.webclust.com/ www.grokker.com/ www.kartoo.com http://search.carrot-search.com

Graphical

Type Single words and flat

Sentences and flat

Single words and hierarchical

Sentences and hierarchical

System name and Core method(s) used in the source system architecture Online URL Retriever ( Joshi and Jiang, 2002) WebCat (Giannotti et al., 2003) EigenCluster (Cheng et al., 2006) Grouper (Zamir and Etzioni, 1999) Carrot2-STC (Weiss and Stefanowski, 2003) Carrot2-Lingo (Osinski and Weiss, 2005) FIHC (Fung et al., 2003)

Fuzzy

No

K-means

Yes http://ercolino.isti.cnr.it/ (dead) webcat

Divide and merge

Yes http://arc2.cc.gatech.edu/ (dead)

Suffix tree clustering

No

Suffix tree clustering

Yes

Singular value decomposition

Yes

CREDO (Carpineto and Romano, 2004) Highlight (Wu et al., 2003)

Concept lattice

Frequent itemset-based hierarchical clustering

Concept terms and probability of cooccurrence analysis WhatsOnTheWeb Topology-driven (Giacomo et al., 2007) SnakeT Gapped sentences (Ferragina and converge and knowledge Guli, 2008) bases WSC Two-round label (2010) construction and divisive hierarchical clustering

Web-snippet clustering system 615

http://search.carrot2.org/ stable/ search?algorithm ¼ stc

http://search.carrot2.org/ stable/ search?algorithm ¼ lingo Yes www.cs.sfu.edu.ca/,ddm/ (dead) dmsoft/Clustering/fihc_ index.html Yes http://credo.fub.it/ Yes

http://highlight.njit.edu/

Yes

http://whatsonweb.diei. unipg.it

Yes

http://snaket.di.unipi.it/

Yes

http://cayley.sytes.net/ wsc/

(for merging nodes of the tree resulting from the divide stage), to cluster the snippets. Even though EigenCluster is a relatively new system, it uses very simple cluster labels. Grouper (Zamir and Etzioni, 1999) was one of the early systems belonging to the second type. Although it uses sentences as the names of labels, such sentences are drawn as contiguous portions of snippets by a suffix tree-clustering algorithm. The suffix tree-clustering algorithm consists of three steps: (1) Document cleaning. (2) Identifying base clusters. (3) Combining base clusters into clusters.

Table II. List of academic systems

OIR 35,4

616

Carrot2-STC (Weiss and Stefanowski, 2003) was an open source implementation of Grouper. Carrot2-Lingo (Osinski and Weiss, 2005) uses a Singular Value Decomposition (SVD) technique on the term-document matrix to find multiple word labels. It starts by identifying key phrases and represents them in the same vector space as the documents. Vectors are then transformed using SVD, and clusters are identified by using the notion of document similarity from the vector space model, labelling the clusters with the terms closest to the centre of the documents vectors in the cluster. Their approach does not scale very well because the SVD technique is very time consuming, especially when high-dimensional data sets are considered. FIHC (Fung et al., 2003) applies a frequent itemset-based hierarchical clustering approach to construct the labels and organise them into a hierarchical tree. A frequent itemset is a set of words that occur together in some minimum fraction of snippets in a cluster. FIHC assumes that there are some frequent itemsets for each cluster in the snippet set, but different clusters share few frequent itemsets. CREDO (Carpineto and Romano, 2004) uses a formal concept analysis on single words to build a lattice of clusters later presented to users as a navigation tree. It works in two phases. In the first phase only the titles of input snippets are taken into account to generate the most general labels. In the second phase the concept lattice-based method is recursively applied to lower levels using a broader input of both titles and snippets. Highlight (Wu et al., 2003) first uses the concept terms, a series of noun phrases that appear in snippets, to generate the labels. Then it analyses the relationships between higher and lower level terms by a probability of co-occurrence analysis technique (Wu et al., 2002) to organise the labels into a hierarchical tree. WhatsOnTheWeb (Giacomo et al., 2007) uses a topology-driven approach to present the clustering results on a snippet graph whose vertices are the sentences in snippets, and whose edges denote the relationships between sentences. The snippet graph of clusters is computed in two steps. First, a snippet graph is constructed using the number and the relevance of sentences shared between the search results. Second, the clusters and their relationships are derived from the snippet graph by finding the vertices that are most strongly connected. SnakeT (Ferragina and Guli, 2008) first extracts the gapped sentences (related but not necessarily continuously appearing words) from snippets as labels. It then adopts two special knowledge bases (the dmoz hierarchy and “anchor texts”) to rank the labels. Finally, it builds a parent cluster if two child clusters converge on a substring among their labels. In this paper we propose a web-snippet clustering system belonging to the fourth type. We first adopt a two-round label construction technique, which involves a suffix tree clustering method and a two-pass hash mechanism, to generate all meaningful labels; then we develop a divisive hierarchical clustering algorithm to organise the labels into a hierarchical tree. In the next section we will describe the details of our proposed system. The anatomy of WSC In this section we describe the architecture of our proposed system as shown in Figure 2. Our system involves the following four main procedures: metasearch ranking, label construction for the first round, label construction for the second round, and building a hierarchical tree structure. In the metasearch ranking procedure we use a metasearch technique to integrate different search results from different search engines into the regular search results of our system. The regular search results are not only the output of

Web-snippet clustering system 617

Figure 2. The system architecture of our proposed system

this procedure but also the input of the next procedure. In the next procedure, label construction for the first round, we first apply a series of snippet cleaning techniques, such as Porter stemming, stop words, sentence boundaries, and non-word tokens, to convert the snippets (output by the metasearch ranking procedure) into a series of meaningful sentence tokens. Then we construct a suffix tree and trace it to generate the base clusters for the first round. To construct labels for the second round we adopt a two-pass hash mechanism to generate the base clusters for the second round. To build a hierarchical tree structure we develop a divisive hierarchical algorithm to organise the base clusters for the second round into a hierarchical tree. Brief descriptions of the above-mentioned four procedures are given in the following four subsections. In this paper we provide a series of examples to illustrate how to cluster the snippets in order to help readers easily understand our mixed clustering method.

OIR 35,4

618

Metasearch ranking In this procedure we first develop a web crawler, a search program that sends the search query to several search engines simultaneously, to fetch many relevant webpages from several search engines. Currently, we select three search engines (Google, Yahoo!, and Bing) as the sources of this procedure. We then utilise a Perl Compatible Regular Expressions (PCRE) library (Hazel, 2009), a powerful program library, to parse webpages, i.e. to identify the title, URL, and fragments. At the end of these two tasks we can get the input of our system. The following list is an example of a search listing returned from Google in response to the search query “mobile phone”: . Title: Mobile phone – Wikipedia, the free encyclopedia; . URL: http://en.wikipedia.org/wiki/Mobile_phone; and . Fragment: A mobile phone or mobile (also called cellphone and handphone, as well as cell phone, wireless phone, cellular phone, cell, cellular telephone, . . . History – Handsets – Related systems – Other Uses. Then we use a well-known information retrieval metric, called Mean Reciprocal Rank (MRR) (Baeza-Yates and Ribeiro-Neto, 1999), to integrate different search results returned from different search engines into the regular search results of our system as shown in the following equation: MRR ¼

1 Xjej 1 i¼1 rank jej i

ð1Þ

where jej is the number of search engines used in our system and ranki is the ranking order of a search listing returned from search engine i. Reciprocal Rank is the reciprocal of the rank at which the relevant search listing ranki. MRR is the average of the reciprocal ranks over a set of search engines. According to the definition of this equation we know that a search listing has a larger weight if it either wins more votes from different search engines or is ranked high in the listings of at least one search engine. Here is an example to illustrate the details of this procedure. Table III shows the top two search listings from our system, and the rank distribution of these two search listings among three different search engines. Based on Equation 1 the MRR values for these two search listings are 0.66667 and 0.44444, respectively. Figure 3 shows a snapshot of the regular search results in response to the search query “mobile phone”. Label construction for the first round The first step of our method is to extract all meaningful base clusters. We are interested in the base clusters that are long and intelligible sentences rather than single

Table III. Top two search listings for our system

Snippet ID

S1

S2

URL

http://en.wikipedia.org/wiki/ Mobile_phone Google (1)

http://tmobile.com Yahoo (Null)

Rank distribution MRR

0.66667

Bing (1)

Google (Null) 0.44444

Yahoo (1)

Bing (3)

Web-snippet clustering system 619

Figure 3. The snapshot of the regular search results in response to the query “mobile phone”

words. Sentences are more useful for identifying the snippet topics. In this procedure we use the following three steps to extract the base clusters for the first round: (1) Cleaning snippets. (2) Constructing a suffix tree. (3) Tracing a suffix tree. In the first step, snippet cleaning, the string of text representing each snippet is transformed using a Porter stemming algorithm (Porter and Boulton, 2007), a reduction approach to stemming, to reduce inflected words to their stem. We also use 421 stop words, as suggested in the literature (Fox, 1989), to prevent our system from having to deal with unnecessary words. Sentence boundaries, stop words, and non-word tokens (such as numbers, HTML tags, and most punctuation characters) are marked. At end of this step all clean snippets should form a series of meaningful sentence tokens. A meaningful sentence token is one in which the stop words or non-word tokens do not appear. In the second step in order to construct a suffix tree we use the suffix tree data structure (Gusfield, 1997) for a set of strings and Ukkonen’s linear-time suffix tree construction algorithm (Ukkonen, 1995) to generate a suffix tree for all meaningful sentence tokens. The suffix tree of a collection of meaningful sentence tokens is a compact tree containing all the suffixes of all the strings in the collection. The suffix tree for a set of strings “X Y Z” is a tree whose edges are labelled with strings such that each suffix list of “X Y Z” corresponds to exactly one path from the tree’s root to a leaf. Each node of the suffix tree represents a group of snippets and the label of each node represents the common phrase shared by snippets. For instance given three snippets “mouse eat cheese”, “cat eat mouse”, “cat eat cheese too”, we generate the following three meaningful sentence tokens “mous eat chees”, “cat eat mous”, “cat eat chees” by the snippet cleaning step. Next, for each

OIR 35,4

620

Table IV. The suffix lists for three meaningful sentence tokens

Figure 4. The suffix tree for three meaningful sentence tokens

meaningful sentence token, we obtain all suffix lists as shown in the following table. For example, the suffix lists for “mous eat chees” are “mous eat chees”, “eat chees”, and “chees”, respectively (Table IV). Then we adopt Ukkonen’s online suffix tree construction algorithm (Ukkonen, 1995) for building a suffix tree because it can be easily applied to multiple strings (Goto et al., 2007). Figure 4 is the suffix tree for three meaningful sentence tokens. To generate the base clusters for the first round we use the third step to trace the suffix tree. In this step we first calculate the intersection ratio with respect to the total number of snippets in the child node and the parent node for each edge. For example the intersection ratio of the edge “mous” is 0.66 (2/3), where the total number of snippets in the child node is 2 (S1, S2) and the total number of snippets in the parent node is 3 (S1, S2, S3). Figure 5 shows the suffix tree with the intersection ratio. We then define that the concatenation of strings along the path from the root node to a child node v is a base cluster for the first round if and only if the total number of snippets in v is larger than 1 and the intersection ratios for all edges in this path are all larger than a threshold value. Later we will determine the best threshold for this

Snippet ID

Meaningful sentence tokens

Suffix list

S1

mous eat chees

S2

cat eat mous

S3

cat eat chees

mous eat chees eat chees chees cat eat mous eat mous mous cat eat chees eat chees chees

Web-snippet clustering system 621

Figure 5. The suffix tree with the intersection ratio

procedure. In this example we assume that the threshold value is equal to 0.6; thus the base clusters for the first round are “mous”, “cat eat”, “eat chees”, and “chees”.

Label construction for the second round In this procedure we use a two-pass hash mechanism to generate the base clusters for the second round. In the first-pass hash we combine the snippets of different base clusters into a base cluster if the different base clusters have the same words but their permutations are different. That is, if we have two base clusters, “X Y Z” and “Y X Z”, then we combine all snippets for these two base clusters into “X Y Z”. For example we assume that those listed in Table V are the final base clusters for the first round. In the first-pass hash we first sort the words of snippet description by alphabetical order as the entry key of one hash table. We then combine different snippet lists into a snippet list if the entry keys of different base clusters are the same. Table VI shows one hash table with the entry key. Since the base clusters C3 (“eat cottag chees”) and C6 (“cottag chees eat”) have the same entry key (“chees cottag eat”) we combine the

Base cluster

Snippet description

Snippet list

C1 C2 C3 C4 C5 C6

mous cat eat eat cottag chees eat chees chees cottag chees eat

[S1, [S2, [S1, [S1, [S1, [S2,

S2] S3] S4, S6] S3] S3] S5]

Table V. An example of the final base clusters for the first round

OIR 35,4

622

Table VI. Hash table with the entry key for the first-pass hash

Table VII. The final base clusters for the first-pass hash

Table VIII. Hash table with the entry key for the second-pass hash

Table IX. The final base clusters for the second round

snippet lists of C3 and C6 into C3, and the snippet list of C3 is [S1, S2, S4, S5, S6]. Table VII shows the final base clusters for the first-pass hash. In the second-pass hash we then use the snippet list as the entry key of other hash table. A cluster should be removed if the entry key of this cluster is the same as another cluster and the snippet description of this cluster is a subset of another one. We maintain the cluster with a longer snippet description because the longer snippet description has richer description than the shorter snippet description. Table VIII shows the second hash table with the entry key. Since the base clusters C4 (“eat chees”) and C5 (“chees”) have the same entry key (“[S1, S3]”) we remove C5 because this cluster has a shorter snippet description. Table IX shows the final base clusters for the second-pass hash.

Entry key

Base cluster

Snippet description

Snippet list

mous cat eat chees cottag eat

C1 C2 C3, C6

[S1, S2] [S2, S3] [S1, S4, S6], [S2, S5]

chees eat chees

C4 C5

mous cat eat eat cottag chees cottag chees eat eat chees chees

[S1, S3] [S1, S3]

Base cluster

Snippet description

Snippet list

C1 C2 C3 C4 C5

mous cat eat eat cottag chees eat chees chees

[S1, [S2, [S1, [S1, [S1,

S2] S3] S2, S4, S5, S6] S3] S3]

Entry key

Base cluster

Snippet description

Snippet list

[S1, S2] [S2, S3] [S1, S2, S4, S5, S6] [S1,S3]

C1 C2 C3 C4, C5

mous cat eat eat cottag chees eat chees, chees

[S1, [S2, [S1, [S1,

S2] S3] S2, S4, S5, S6] S3]

Base cluster

Snippet description

Snippet list

C1 C2 C3 C4

mous cat eat eat cottag chees eat chees

[S1, [S2, [S1, [S1,

S2] S3] S2, S4, S5, S6] S3]

Building a hierarchical tree structure In this procedure we develop a divisive hierarchical clustering algorithm to organise the base clusters for the second round into a hierarchical tree. Our divisive hierarchical clustering algorithm involves three steps: (1) Sorting the clusters. (2) Transforming them to a binary code. (3) Finding the child clusters.

Web-snippet clustering system 623

The main task of the first step is sorting all base clusters in descending order by the number of snippets they contain. For example in Table X we assume that the base cluster BC1 has the following snippet list [S1, S3, S5, S9, S10, S12, S15] and the number of snippets for BC1 is seven. Other base clusters are also shown in Table X. In this step we sort all base clusters in descending order by the number of snippets they contain as shown in Table XI. The main task of the second step is to transform the snippet list of the base cluster into the binary code. We define that a bit i in the binary code should be encoded as 1 if the base cluster contains the snippet Si. For example the cluster BC5 (see Table XII)

Base cluster

Snippet list

No. of snippets

BC1 BC2 BC3 BC4 BC5 BC6

[S1, [S1, [S1, [S1, [S1, [S1,

7 2 3 4 2 6

Base cluster

Snippet list

No. of snippets

BC1 BC6 BC4 BC3 BC2 BC5

[S1, [S1, [S1, [S1, [S1, [S1,

7 6 4 3 2 2

S3, S5, S9, S10, S12, S15] S2] S3, S5] S5, S9, S15] S3] S2, S3, S5, S9, S15]

S3, S5, S9, S10, S12, S15] S2, S3, S5, S9, S15] S5, S9, S15] S3, S5] S2] S3]

Base cluster

Snippet list

Binary code

BC1 BC6 BC4 BC3 BC2 BC5

[S1, [S1, [S1, [S1, [S1, [S1,

101010001101001 111010001000001 100010001000001 101010000000000 110000000000000 101000000000000

S3, S5, S9, S10, S12, S15] S2, S3, S5, S9, S15] S5, S9, S15] S3, S5] S2] S3]

Table X. An example of the base clusters for our divisive hierarchical clustering

Table XI. All sorted base clusters based on the number of snippets

Table XII. The results of binary code for all sorted base clusters

OIR 35,4

624

Figure 6. All iterations of the third step

contains the snippets S1 and S3; that is, the first (the most significant bit) and third bits are encoded as 1. Thus, the binary code for BC5 is 101000000000000. Table XII also shows the results of binary code for other sorted base clusters. The core step of our divisive hierarchical clustering algorithm is the third step: finding all child clusters for each cluster. We define that a parent cluster BCj has one child cluster BCk if the snippet list of BCk is a subset of BCj and the snippet list of BCk is not a subset of its sibling clusters. For example in Figure 6 BC1 has two child clusters, BC4 and BC3, because the snippet lists of BC4 ([S1, S5, S9, S15]) and BC3 ([S1, S3, S5]) are both a subset of BC1 ([S1, S3, S5, S9, S10, S12, S15]), and the snippet list of BC3 is not a subset of its sibling cluster BC4. Although the snippet list of BC5 ([S1, S3]) is a subset of BC1, the snippet list of BC5 is a subset of its sibling cluster BC3. Thus BC5 cannot be a child cluster of BC1. The subset operation can be easily done by a bitwise logical AND operation. We say that BCk is a subset of BCj if “(the binary code of BCk) AND (the binary code of BCj)” is equal to the binary code of BCk. For example BC4 is a subset of BC1 because “the binary code of BC4 (100010001000001) AND the binary code of BC1 (101010001101001)” is equal to the binary code of BC4.

In each iteration shown in Figure 6 the parent cluster not only holds all child clusters, but also holds a special child cluster (“other topics”). The binary code of “other topics” is done by the following bitwise operations: “(the binary code of the parent cluster) XOR (the binary code of the kth child cluster OR the binary code of the mth child cluster)”, where k and m both are the child clusters. For example, in iteration 1 of Figure 6 the parent cluster BC1 has three child clusters (BC4, BC3, “other topics”). The binary code of “other topics” is done by “101010001101001 XOR (100010001000001 OR 101010000000000) ¼ 000000000101000”. Figure 6 also shows all other iterations of the third step. According to the final results of the third step we then convert the binary code into the snippet list as shown in Figure 7. Note that the snippet list of “other topics” of BC6 is a null set because its binary code is 000000000000000. Finally, we use a voting technique to present the actual word rather the stem word to users. That is, if two actual words “ABCx” and “ABCy” with a same stem word “ABC” have appeared in one cluster and “ABCx” has won the most votes (snippets) in this cluster, then “ABCx” should be presented to users. For example two actual words, “cottage” and “cottages”, with a same stem word “cottag”, have appeared in one cluster and “cottage” has won the most snippets in this cluster; thus “cottage” should be presented to users. Our clustering results are shown on the right-hand side of Figure 1. The main advantages of our divisive hierarchical clustering algorithm are threefold. First, the base clusters are organised into a hierarchical tree rather than a flat list. Second, our algorithm allows not only a snippet that can be assigned to multiple clusters, but also a child cluster that can be assigned to multiple parent clusters. Third, all operations of the third step are only required to do the bitwise operations; thus a hierarchical tree is built relatively quickly. Experiments analysis In this section we describe two experiments conducted to evaluate the performance of our proposed system. In the first experiment we wanted to determine the best threshold for the second procedure (label construction for the first round). In the second experiment we evaluated the performance of our proposed system against other existing online systems, including commercial systems (Clusty, iBoogie, Lingo3G, and WebClust) and academic systems (Carrot2-STC, Carrot2-Lingo, CREDO, Highlight, WhatsOnTheWeb, and SnakeT). All systems have been described earlier in this paper. First experiment (best threshold for the second procedure) In this experiment we tried to find the best threshold for the second procedure. For the test data set we selected 1,000 random queries which people were searching on the internet. The 1,000 random search queries were listed in Dogpile (2008). In this experiment we used a distance metric, called Normalised Google Distance (NGD) (Cilibrasi and Vita´nyi, 2007), to automatically evaluate the similarity information between terms from the large corpus of data, i.e. the Google search engine. NGD gives computers the ability to quantify the meaning of terms. It is defined as follows: NGDðx; yÞ ¼

max{ log f ðxÞ; log f ð yÞ} 2 log f ðx; yÞ log M 2 min{ log f ðxÞ; log f ð yÞ}

ð2Þ

Web-snippet clustering system 625

OIR 35,4

626

Figure 7. A hierarchical tree with the snippet list

where M is the total number of webpages searched by Google; f(x) and f(y) are the number of webpages for the search terms x and y, respectively; and f(x,y) is the number of webpages on which both x and y occur. NGD is a measure of semantic interrelatedness derived from the number of webpages indexed by the Google search engine for a given set of terms. It is based on the count and intersection of the search results associated with each search term. The search terms with the same or similar meanings in a natural language sense tend to be close in the units of Google distance, while the search terms with dissimilar meanings tend to be further apart. Intuitively NGD is a measure for the symmetric conditional probability of co-occurrence of the terms x and y: given a webpage containing one of the terms x or y, NGD measures the probability of that webpage also containing the other term. It provides a relative measure of how distant two search terms, x and y, are semantically, which is very suitable for the comparison of different systems. The main problem of the above equation is how to determine the size of webpages crawled by Google; that is, the size of the parameter M. This problem is answered by the official Google blog. According to Google’s team on the official blog, the total number of unique webpages indexed by Google has reached one trillion (Alpert and Hajaj, 2008). We then use the following “Mean(NGD@K)” equation to measure the average distance between x and its first K top-level labels where K ¼ {3,5,7,10}: MeanðNGD@KÞ ¼

1 XjKj NGDðx; yÞ y¼1 jKj

Web-snippet clustering system 627

ð3Þ

For example, when x is “mobile phone”, the first 10 top-level labels returned from our system and its corresponding NGD values are listed in Table XIII. By using Equation 3 the average measure of Google distance between “mobile phone” and its first 10 top-level labels is 0.18083. Interested readers can calculate the score of Mean(NGD@K) in our simulation system (http://cayley.sytes.net/experiment_ngd/). Figure 8 and Figure 9 show the results of Mean(NGD@K) for the first experiment where K ¼ {3,5,7,10}. The dot in each subfigure is the average value over 50 queries. For comparison purposes we then average all Mean(NGD@3) values for each threshold as shown in Figure 10. The average Mean(NGD@3) value for T ¼ 0.1 is 0.474. Moreover in Figure 10 we extend this analysis to other analyses Mean(NGD@5),

Label y Nokia Compare Shop Sony Ericsson Motorola Prices LG Cellular phone News Virgin mobile

f(x)

f(y)

f(x, y)

NGD(x, y)

Mean(NGD@10)

324,000,000

270,000,000 323,000,000 1,100,000,000 137,000,000 119,000,000 460,000,000 278,000,000 171,000,000 2,310,000,000 42,500,000

52,300,000 107,000,000 205,000,000 112,000,000 47,800,000 291,000,000 36,600,000 201,000,000 231,000,000 28,000,000

0.22195 0.13784 0.20910 0.11941 0.21178 0.05699 0.26633 0.05504 0.28658 0.24325

0.18083

Table XIII. How to calculate Mean(NGD@10) when x is “mobile phone”

OIR 35,4

628

Figure 8. Mean(NGD@K) for the first experiment where K ¼ {3,5,7,10}

Web-snippet clustering system 629

Figure 9. Mean(NGD@K) for the first experiment where K ¼ {3,5,7,10}

OIR 35,4

630

Figure 10. Average Mean(NGD@K) for the first experiment

Mean(NGD@7), and Mean(NGD@10). According to the results shown in Figure 10 we find that NGD increases when the threshold value is larger than 0.6 (T . 0.6). This is because a large threshold will result in a cluster with a shorter snippet description. This situation may result in the users having difficult fully understanding such short clusters. NGD also increases when the threshold value is less than 0.6 (T , 0.6). Although a small threshold will result in a cluster with a longer snippet description, this situation may result in a higher number of clusters so that the users easily lose focus on the most important clusters. Thus, in this paper we experimentally observed that the best threshold is 0.6. Second experiment (comparison of different systems) In this experiment we applied our analysis to two sets of test data. The first set is based on 1000 random search queries as described above. For the first test set we also use Mean(NGD@10) as our distance measure. The second column of Table XIV shows the results of Mean(NGD@10) for different systems. The average Mean(NGD@10) value

i Clusty iBoogie Lingo3G WebClust Carrot2-STC Carrot2-Lingo CREDO Highlight WhatsOnTheWeb SnakeT WSC

Mean(NGD@10)

Average Precisioni Recalli

0.229 0.251 0.271 0.273 0.345 0.328 0.366 0.496 0.404 0.304 0.191

0.773 0.756 0.730 0.684 0.491 0.497 0.451 0.131 0.409 0.548 0.814

0.145 0.137 0.133 0.126 0.114 0.117 0.108 0.075 0.096 0.121 0.153

F-measurei

SRTi

0.244 0.232 0.225 0.213 0.185 0.189 0.174 0.095 0.156 0.198 0.258

13.57 14.38 13.84 15.72 16.74 15.53 17.54 25.34 20.61 16.84 12.58

for Clusty is 0.229. According to the results shown in this table, our system has the lowest NGD score, and this means that the labels generated from our system have the highest semantic interrelatedness. In the second set we selected the 18 most searched queries in 2009 on Google (2010) and Yahoo (2009) that belong to many different topics: (“american idol”, “britney spears”, “facebook”, “farrah fawcett”, “glee”, “hi5”, “hulu”, “kim kardashian”, “lady gaga”, “megan fox”, “michael jackson”, “naruto”, “nascar”, “natasha richardson”, “paranormal activity”, “runescape”, “twitter”, “wwe”). We asked 20 undergraduate students and seven graduate students from our department to participate. We used the 18 most searched queries as our study sample since lazy users do not like browsing a lot of information. For the second test set we used three well-known measures from information retrieval to evaluate the performance of different systems: precision, recall, and F-measure. Precision, recall, and F-measure for the system i at the first ten top-level labels are defined as Equations (4), (5), and (6), respectively (Baeza-Yates and Ribeiro-Neto, 1999; Chen, 2010; Wan, 2009): Pr ecisioni ¼ jM i j=10 Re call i ¼

jM i j j

Suggest Documents