Clustering of web search results based on the ... - Semantic Scholar

3 downloads 6071 Views 263KB Size Report
In general, web clustering engines work as meta-search engines and collect .... outperforms Genetic Algorithm, Particle Swarm Optimization, and Differential ...
Clustering of web search results based on the cuckoo search algorithm and Balanced Bayesian Information Criterion Carlos Cobos a,b,*, Henry Muñoz-Collazos a, Richar Urbano-Muñoz a, Martha Mendoza a,b, Elizabeth León c, Enrique Herrera-Viedma d,e a b

Information Technology Research Group (GTI) members, Universidad del Cauca, Sector Tulcán Office 422 FIET, Popayán, Colombia

Full time professor, Computer Science Department, Electronic and Telecommunications Engineering Faculty, Universidad del Cauca, Colombia c

Full time professor, Systems and Industrial Engineering Department, Engineering Faculty, Universidad Nacional de Colombia, Colombia d e

Full time professor, Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain

Department of Electrical and Computer Engineering, Faculty of Engineering, King Abdulaziz University, Jeddah 21589, Saudi Arabia

*

Corresponding author: Carlos Cobos; Tel: 57-2-8209800 #2119; Fax: 57-2-8209810; e-mail: [email protected]

Abstract The clustering of web search results - or web document clustering - has become a very interesting research area among academic and scientific communities involved in information retrieval. Web search result clustering systems, also called Web Clustering Engines, seek to increase the coverage of documents presented for the user to review, while reducing the time spent reviewing them. Several algorithms for clustering web results already exist, but results show room for more to be done. This paper introduces a new description-centric algorithm for the clustering of web results, called WDC-CSK, which is based on the cuckoo search meta-heuristic algorithm, k-means algorithm, Balanced Bayesian information criterion, split and merge methods on clusters, and frequent phrases approach for cluster labeling. The cuckoo search meta-heuristic provides a combined global and local search strategy in the solution space. Split and merge methods replace the original Lévy flights operation and try to improve existing solutions (nests), so they can be considered as local search methods. WDCCSK includes an abandon operation that provides diversity and prevents the population nests from converging too quickly. Balanced Bayesian information criterion is used as a fitness function and allows defining the number of clusters automatically. WDC-CSK was tested with four data sets (DMOZ-50, AMBIENT, MORESQUE and ODP-239) over 447 queries. The algorithm was also compared against other established web document clustering algorithms, including Suffix Tree Clustering (STC), Lingo, and Bisecting k-means. The results show a considerable improvement upon the other algorithms as measured by recall, F-measure, fall-out, accuracy and SSLk. © 2014 Elsevier Ltd. All rights reserved. Keywords: Cuckoo search algorithm, clustering of web results, web document clustering, balanced Bayesian information criterion, k-means.

1 Introduction In recent years, web result clustering has become a very interesting research area among academic and scientific communities involved in information retrieval (IR) and web search [13]. This is because it is most likely that results relevant to the user are close to each other in the document space, thus tending to fall into a relatively small number of clusters [47] and thereby achieving significant reductions in search time. In IR, these web result clustering systems are called web clustering engines and the main exponents in the field are Carrot2 (www.carrot2.org), SnakeT (http://snaket.di.unipi.it), Yippy (http://yippy.com, originally known as Vivisimo and later as Clusty), iBoogie (www.iboogie.com), and KeySRC (http://keysrc.fub.it) [12]. Such systems usually consist of four main components: search result acquisition, preprocessing of input, cluster construction and labeling, and visualization of resulting clusters [13] (see Fig 1).

Fig 1. The components of a web clustering engine (adapted from [13]) The search result acquisition component begins with a query defined by the user. Based on this query, a document search is conducted in diverse data sources - in this case in such traditional web search engines as Google, Yahoo! and Bing. In general, web clustering engines work as meta-search engines and collect between 50 to 200 results from traditional search engines. These results contain as a minimum a URL, a snippet and a title [13]. Preprocessing of search results comes next. This component converts each of the search results (as snippets) into a sequence of words, phrases, strings or general attributes or characteristics, which are then used by the clustering algorithm. A number of tasks are performed on the search results, including removal of special characters and accents, conversion of strings to lowercase, stop word removal, stemming of word, and control of terms or concepts allowed by a vocabulary [13]. Once preprocessing is finished, cluster construction and labeling is commenced, making use of three types of algorithm [13]: data-centric, description-aware and description-centric. Each of these builds clusters of documents and assigns a label to the groups. Finally, in the visualization step, the system displays the results to the user in hierarchically organized folders. Each folder seeks to have a label or title that represents well the documents it contains and that is easily identified by the user. As such, the user simply scans the folders that are actually related to their specific needs. The presentation folder tree has been adopted by various systems such as Carrot2, Yippy, SnakeT, and KeySRC, because the folder metaphor is already familiar to computer users. Other systems such as Grokker and Kart004 use a different display scheme based on graphs [13]. To obtain good results in web document clustering the algorithms must meet the following specific requirements [13, 54]: 1) automatically define the number of clusters that are going to be created; 2) generate relevant clusters for the user and assign the documents to appropriate clusters; 3) define labels or names for the clusters that are easily understood by users; 4) handle overlapping clusters (documents can belong to more than one cluster); 5) reduce the high dimension of document collections; 6) handle the processing time, i.e. less than or equal to 2 seconds; and 7) handle the noise that is frequently found in documents. Another important aspect when studying or proposing an algorithm to perform web document clustering is the document representation model. The most widely used models are [38]: Vector space model [6, 31], Latent Semantic Indexing (LSI) [6, 57], Ontology-based model [41, 67], N-gram [54], Phrase-based model [54], and Frequent Word (Term) Sets model [41, 75]. In the Vector Space Model (VSM), the documents are designed as bags of words. Document collection is represented by a matrix of D-terms by n-documents. This matrix is commonly called Term by Document Matrix (TDM). In TDM, each document is represented by a vector of normalized term frequency by inverse document frequency for that term, in what is known as the TF-IDF value. In VSM, the cosine similarity is used for measuring the degree of similarity between two documents or between a document and the user query. In VSM, as in most of the representation models, a process of stop word removal and stemming [6] should be done before re-presenting the document. Stop word removal refers to the removal of very common words (such as articles and prepositions) and can yield a reduction of over 40% on TDM matrix dimensionality), while stemming refers to the reduction of words to their canonical stem or root form. A reduction of allowed terms or concepts by a vocabulary can also be executed [13]. General concepts of web result clustering can be used in other information retrieval areas e.g. as a visualization tool in digital library search systems [65] and recommender systems [21, 58]. Some web crawler tasks [2] can be modified on web search engines in order to improve the quality and performance of web clustering engines. Also, as other web search systems, web clustering engines should manage the user profile [23] so as to improve the quality of clustering results. The two predominant problems with existing web clustering engines are inconsistencies in cluster content and inconsistencies in cluster description [13]. The first problem refers to the fact that the content of a cluster does not always correspond to the label. Also, navigation through the cluster hierarchies does not necessarily lead to more specific results. The second problem refers to the need for more expressive descriptions of clusters (i.e. the cluster labels are confusing).

These two problems are the main motivation of the present work, in which a new algorithm that obtains better results for web result clustering is put forward. The proposed algorithm is based on: 1) cuckoo search (CS) meta-heuristic, 2) k-means algorithm, 3) balanced Bayesian information criterion (BBIC), 4) split and merge methods on clusters, and 5) frequent phrases approach for cluster labeling. To the best of our knowledge, this research is the first to integrate synergistically these components. Experimental results show improvements upon the best state-of-the-art algorithms in the research field, including Lingo, STC, Bisecting k-means, Lingo 3G, OPTIMSRC, and KeySRC. The proposed algorithm is based on CS meta-heuristics, bearing in mind the following points: 1) the state of the art has shown that meta-heuristics offer a feasible way of solving the web document clustering problem and the results are extremely promising. 2) It has been seen too that CS is one of the best meta-heuristics for solving continuous optimization problems. CS outperforms Genetic Algorithm, Particle Swarm Optimization, and Differential Evolution meta-heuristics. 3) In the context of data clustering, CS again outperforms Genetic Algorithm and Particle Swarm Optimization meta-heuristics [30, 62]. 4) The no free lunch theorem claims that different heuristics show different results for different problems [73]. It becomes necessary, therefore, to evaluate the behavior of each one. It is also necessary to adapt the meta-heuristic to the specific requirements of each problem in order to obtain the best results. The state of the art in data clustering reports a vast quantity of research with different approaches (partitional, hierarchical, incremental, model-based, heuristics, among others) and with applications in different sectors (image processing, agriculture, marketing, etc.) [1, 37]. Previous research using an approach similar to that proposed in this work (e.g. [51, 63, 64, 77]) does not meet the pre-defined requirements for the clustering of web results, including: not automatically defining the number of clusters, not defining cluster labels that are easily understood for users (a key factor for the success in web clustering engines [13]), [7]), not being designed to deliver appropriate results in a short execution time, or not being designed to work with high dimensional data (common in text collections). In this sense, although the approach is similar to previous research, the proposed algorithm is unique and is specifically designed to meet the requirements specific to the clustering of web results. Further, different cluster validity indices for developing data clustering algorithms are reported in the literature. These indices allow two or more solutions to be compared to decide which one is the best for solving a particular clustering problem [27, 33, 69, 77]. Following a comparison of the main cluster validity indices of the state of the art in the specific field of clustering of web results (or Web Document Clustering, WDC), it was observed that, based on the k-means algorithm and cosine similarity, the Bayesian information criterion (BIC) reported the best results, but that these results could be improved [16, 19]. To achieve this, a genetic programming and reverse engineering approach was used, thereby obtaining the BBIC criterion [20]. In this research, this new index (BBIC) is used and is experimentally shown to be more effective than BIC in the specific context of the research, while results are also superior to other state of the art algorithms in WDC (as is shown in section 4). Thus, the algorithm proposed, despite using an approach similar to previous research (meta-heuristic, kmeans, cluster validity index, and split-and-merge methods [50]), offers a viable and successful combination in the specific field of WDC with the new fitness function (BBIC). Also, the adaptation of a meta-heuristic (cuckoo search) - for which no report is known in the specific field of WDC- with a representation based on real-valued vectors that - in conjunction with the split and merge operations in the proposal - yields better clustering solutions in a short execution time (in the first iterations it quickly finds better solutions, as shown in section 4), this latter being a key factor for web clustering engine success. The remainder of the paper is organized as follows. Section 2 presents some related work and a summary of the k-means clustering algorithm. The new algorithm is described in detail in Section 3. Section 4 shows the experimental results. Finally, some concluding remarks and suggestions for future work are presented.

2 Related work 2.1 Clustering of web results As aforementioned, there are three types of web document clustering algorithms [13]: data-centric, description-aware and description-centric. A brief review of these is presented here. Data-centric algorithms are the algorithms traditionally used for data clustering (partitional, hierarchical, fuzzy, densitybased, etc.) [9, 13, 31, 36, 37, 68]. They seek the best solution in data clustering, but are not so strong on the presentation of the labels or in the explanation of the groups obtained. They address the problem of web document clustering as merely another data clustering problem. The algorithms most commonly used for clustering of web results have been the hierarchical and the partitional ones [31]. The hierarchical algorithms generate a dendogram or a tree of groups. This tree starts from a similarity measure, among which are: single link, complete link and average link. In relation to web document clustering, the hierarchical algorithm that brings the best results in accuracy is called UPGMA (Unweighted Pair-Group Method using Arithmetic averages) [36]. UPGMA is an algorithm devised in 1990 [41] based on the vector space model and using an average link based on the

clusters cosine distance divided by the size of the two clusters that are being evaluated. UPGMA has the disadvantage of having a time complexity of O (n3) and being static in the process of assigning documents to clusters. In partitional clustering, the algorithms perform an initial division of the data in the clusters and then move the objects from one cluster to another based on the optimization of a predefined criterion or objective function [37]. The most representative algorithms that use this technique are: k-means, k-medoids, and Expectation Maximization. The k-means algorithm is the most popular because it is easy to implement and its time complexity is O(n), where n is the number of patterns or records, but it has serious disadvantages: it is sensitive to outliers, it is sensitive to the selection of the initial centroids, it requires a prior definition of the number of clusters, and the obtained clusters are only hyper spherical in shape [54]. In 2000, a Bisecting k-means [31, 41] algorithm was devised. This algorithm combines the strengths of the hierarchical and partitional methods reporting better results than the UPGMA and k-means algorithms where accuracy and efficiency are concerned. In partitional clustering from an evolutionary approach, in 2007 three hybridization methods between the Harmony Search (HS) [45] and k-means algorithms were compared. These were: Sequential hybridization method, Interleaved Hybridization method and the hybridization of k-means as a step of HS. As a general result, the last method was the best choice of the three. Later in 2008 [45, 46] [29], based on the Markov Chains theory the researchers demonstrated that the last algorithm converges to the global optimum. This proposal is a data-centric algorithm [13] because among other features it does not define the number of clusters automatically and does not show appropriate cluster labels. In 2009, a link-based algorithm was proposed [22].This algorithm uses the web hyperlink structure to find dense units and also improve the joining process for creating hierarchical clusters of web documents. This proposal has the advantages of creating clusters in various shapes (with high accuracy) and removing noisy data. For the joining process, it uses a specific measure that provides the possibility of dynamically determining the cluster boundaries. Experimental results show higher clustering quality over other density-based clustering algorithms, but test data sets and compared algorithms are not the traditional (state of the art) of the research area. Also, authors do not pay attention to the cluster labeling process. A new learning algorithm based on k-means and neural networks was also proposed in 2009 [35]. This proposal uses Principal Component Analysis (PCA) to reduce the dimensionality of the document matrix (feature selection), SVD to find the measure of similarity and the multilayer neural network for reducing the time of the document clustering process. The algorithm was tested with different kinds of web pages and the results were promising. The performance of the algorithm was proved to be satisfactory and the system can be used to cluster and classify the downloaded web pages and other electronic text documents, but test data sets and compared algorithms are not the traditional of the web document clustering research area. Also, this proposal does not pay attention to the cluster labeling process and requires a training process which it is not “in reality” feasible on clustering of web results. Finally, in 2009 a new algorithm called ArteCM was proposed [15]. This algorithm uses an incremental approach for clustering documents, offers the ability to grow the number of clusters adaptively, and to employ domain-tailored similarity measures. In this proposal, an explicit centroid definition is avoided and substituted by a similarity-based concept of centroid. ArteCM was compared with two variants of k-means and SOM with satisfactory results on speeds and clustering quality. The proposed solution includes the requirement of a domain-specific specialized similarity measure (the minimum and maximum accepted similarity) and two parameters which can be a limitation because an effective and efficient definition of these two components can be tricky. In fuzzy clustering, FTCA [48] uses a fuzzy transduction-based clustering algorithm (2010). FTCA results are promising but they are not compared over recognized data sets, and neither do they use appropriate metrics, which are necessary to correctly compare the results for the algorithm with other state of the art algorithm results. The RElational Document clustering (RED-clustering) algorithm was also proposed in 2010 [25]. This algorithm takes into account both content information and hyperlink structure of web page collection. The algorithm finds embedded patterns of web document collection, and converges to a solution that includes different kinds of information - semantic visual coherence, content features and several relations with different "degrees" of importance between documents. The experimental results show that RED-clustering outperforms k-means and Expectation Maximization both in terms of effectiveness, purity and agreement between classes and partitions, but it does not use traditional benchmark data sets from the research area and neither does it make use of comparison with other state of the art algorithms. In 2011, an algorithm that performs spectral bisecting and merge operations on web documents, called METIS, was put forward [39]. Bisecting and merge operations are optimized to work with skewed distributions of cluster sizes. Results show a performance of approximately 56% and 36% in comparison with spectral bisection and k-means respectively in terms of Fmeasure, but in this proposal the number of clusters should be previously defined and data sets used for testing are not those traditional to the research area. Also, in this year another method based on multiclass spectral clustering for grouping of documents - including web pages in English and Chinese - was proposed [34]. The algorithm starts from a traditional term by document matrix (TF-IDF) but uses different preprocessing algorithms based on the language of the web page. To construct the similarity matrix it uses the cosine similarity measure. In general, results are promising.

Description-aware algorithms give greater weight to one specific feature of the clustering process than to the rest. For example, they make their priority the quality of the labeling of groups and as such achieve results that are more easily interpreted by the user. Their quality drops, however, in the cluster creation process. An example of this type of algorithm is Suffix Tree Clustering (STC) [41, 54] proposed in 1998, which incrementally creates labels easily understood by users, based on common phrases that appear in the documents. STC uses a phrase-based-model for document representation. Description-centric algorithms [7, 13, 28, 41, 49, 57, 82] are designed specifically for web document clustering, seeking a balance between the quality of clusters and the description (labeling) of them. In 2001, the SHOC (Semantic, Hierarchical, Online Clustering) algorithm was introduced [82]. SHOC improves STC and is based on LSI and frequent phrases. Next in 2003, the Lingo algorithm [55-57] was devised. This algorithm is used by the Carrot2 web clustering engine and it is based on complete phrases and LSI with Singular Value Decomposition (SVD). Lingo is an improvement of SHOC and STC, and unlike most algorithms first tries to discover descriptive names for the clusters and only then organizes the documents into appropriate clusters. NMF (2003) is another example of these algorithms. It is based on the non-negative matrix factorization of the term-document matrix of the given document corpus [72]. This algorithm surpasses LSI and the spectral clustering methods in document clustering accuracy but does not care about cluster labels. In 2004, the web document clustering problem was treated as a supervised salient phrase ranking problem using several techniques, among them linear regression, logistic regression and support vector machines. The results are promising but the training stage is unrealistic in a true application scenario [81]. In 2007, the Dynamic SVD clustering (DSC) [49] algorithm was made available. This algorithm uses SVD and minimum spanning tree (MST). This algorithm outperforms Lingo. Another approach was proposed by the Pairwise Constraints guided Non-negative Matrix Factorization (PCNMF) algorithm [84] (2007). This algorithm transforms the document clustering problem from an un-supervised problem to a semi-supervised problem, using must-link and cannotlink relations between documents. Finally in 2008, the CFWS (Clustering based on Frequent Word Sequences) and the CFWMS (Clustering based on Frequent Word Meaning Sequences) [41] algorithms were proposed. These algorithms represent text documents as frequent word sequences and frequent concept sequences (based on WordNet) respectively. Proposals using frequent word sets for document representation in clustering of web results include FTC (Frequent TermBased Text Clustering) and HFTC (Hierarchical Frequent Term-Based Text Clustering) algorithms (2002) [7]. These algorithms use combinations of frequent words (association rules approach) shared in the documents to measure their proximity in the text clustering process. Then in 2003, FIHC (Frequent Itemset-based Hierarchical Clustering) was introduced [28], which measures the cohesion of a cluster using frequent word sets, so that the documents in the same cluster share more frequent word sets than those in other groups. These algorithms provide accuracy similar to that reported for Bisection k-means, with the advantage that they assign descriptive labels to clusters. In 2009, a method based on granular computing (WDCGrc) was presented [83]. This algorithm transforms the term-by-document matrix (TF-IDF) to a documentby-binary granules matrix, then, using an association rules algorithm obtains frequent word sets between documents. These frequent word sets are pruned and finally used to create clusters. WDCGrc takes the number of the same words shared by documents as a similarity measure. Finally, the paper shows that WDCGrc is practical and feasible, with good quality of clustering, but it does not use standard benchmark data sets nor make comparisons with other state of the art algorithms. Full-Subtopic Retrieval with Keyphrase-Based Search Results Clustering (KeySRC) [10] also was proposed in 2009. This comprises an algorithm based on key phrases. The key phrases are extracted from a generalized suffix tree built from the search results. Documents are then clustered based on a hierarchical agglomerative clustering algorithm. Also in this proposal, a novel measure for evaluating full-subtopic retrieval performance is presented. The measure is called Subtopic Search Length under k document sufficiency (SSLk) and currently is one of the state of the art measures to evaluate the performance of web clustering engines. KeySRC outperforms STC and Lingo algorithms over AMBIENT data set using the proposed measure. A novel approach based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI) [52] was presented in 2010. The authors show how web directories, semantic information retrieval (SIR) systems and search results clustering systems (the most popular approach) are used to solve the query ambiguity problem. They show how SIR systems perform indexing and searching of concepts rather than terms based on different strategies, for instance, ontologies or dictionaries like WordNet. SIR systems have reported high precision on uncommon terms but still have problems when searching names instead of concepts. The key idea of this proposal was to automatically induce senses for the target query using a graph-based algorithm focused on the notion of cycles (triangles and squares) in the cooccurrence graph of the query. Then, web search results are clustered based on their semantic similarity to the induced word senses. Experiments show better results than STC, Lingo and KeySRC algorithms over AMBIENT and MORESQUE data sets. In 2010, a study of the search results clustering problems as a meta heuristic search was performed [14]. It showed that a stochastic discrete optimization algorithm could provide fast approximations to the optimal solution for the search results clustering (SRC) problem. The proposed algorithm was called OPTIMSRC (OPTImal Meta Search Results Clustering) and outperforms results shown by KeySRC, Lingo, Lingo3G and the original order of results reported by the Yahoo! search

engine. For the labeling process, they use labels generated by other algorithms (i.e. STC or Lingo) and match the generated clusters with the most appropriate labels. In 2010 and 2011, three new algorithms based on heuristics, partitional clustering and different strategies for labeling were put forward. The first algorithm, called IGBHSK [16] was based on global-best harmony search, k-means and frequent term sets; the second, called WDC-NMA [19] was based on memetic algorithms with niching techniques and frequent phrases; and the third, called HHWDC [18] was designed from a hyper-heuristic approach and allows defining the best algorithm for web document clustering based on several low-level heuristics and replacement strategies. These researches outperform results obtained with STC and Lingo, evaluate two different document representation models (term-document matrix and frequent term-document matrix) and use the Bayesian Information Criterion for evaluating quality of solutions. In 2012, a new algorithm called Topical was put forward [61]. It models the problem of clustering of web results as the problem of labeling clustering nodes of a graph of topics. Topics are Wikipedia pages identified by a topic annotator and edges of the graph denote the relatedness of these topics. The new graph it is based on the annotation by Tagme that replaces the traditional bag of words paradigm. This constructs a good labeled clustering in terms of diversification and coverage of the snippet topics, coherence of cluster content, meaningfulness of the cluster labels, and small number of balanced clusters. Finally, a large user study conducted on Amazon Mechanical Turk, which was aimed at ascertaining the quality of the cluster labels produced by this approach against the Clusty and Lingo3G systems shows that the algorithm outperforms the other approaches improving the SSLk measure by about 20% on average for different values of k. 2.2 The cuckoo search algorithm Cuckoo Search (CS) algorithm is based on the obligate brood parasitic behavior of some cuckoo species in combination with the Lévy flight behavior of some birds and fruit flies [79]. CS provides a new way of intensification (search for better solutions in the neighborhood of the current solution) and diversification (make sure the algorithm can explore the search space efficiently) [78]. Simplifying the breeding behavior of the cuckoo, a set of three idealized rules can be established [76]: 1) Each cuckoo lays one egg at a time, and deposits its egg in a randomly chosen nest; 2) The best nests with high-quality eggs will be carried over to the next generations; 3) The number of available host nests is fixed, and the egg laid by a cuckoo is discovered by the host bird with a probability pa ∈ [0, 1] (pa, Percentage of abandonments). In this case, the host bird can either get rid of the egg, or simply abandon the nest and build a completely new nest. Based on these three rules, the basic steps of CS can be summarized in Fig 2. Line 04 of the Cuckoo Search algorithm uses Lévy Flights. In nature, most animals search for food (or, in the case of the cuckoo, a host nest) in the manner of a random or quasi-random walk (because the next step is always based on the current location and the probability of moving to the next location). This can be modeled with a Lévy distribution (a continuous probability distribution for a non-negative random variable) [78] known as Lévy flights [5]. Studies have shown that the flight behavior of many animals and insects follows the typical characteristics of Lévy flights. For example, a study conducted by Reynolds and Frye [76] on fruit flies shows that the fruit fly Drosophila melanogaster explores its landscape in quite an odd manner, featuring a series of straight flight paths with sudden 90º turns, thus leading to a Lévy-flight-style intermittent scale free search pattern [80]. Fig 3 shows an example of Lévy flight in two dimensions. In the CS algorithm the line 04 “Get a cuckoo randomly by Lévy flights” is carried out using equation (1).

X i(t +1) = X i(t ) + α ⊕ Lévy(λ ) Where α > 0 represents a step size, which should be related to the scales of problem the algorithm is trying to solve. In most cases, α can be set to the value of 1, and

Lévy(λ ) = t

(1)

−λ

Where λ is the step length, λ ϵ (0, 3], and λ is randomly generated using a Lévy distribution.

Equation (1) is in essence a stochastic equation for a random walk based on a Markov chain, where next location (status) depends on two parameters: current location (the first term in the equation) and probability of transition (the second term). The product ⊕ symbol means entrywise multiplications. These entrywise multiplications are similar to that seen in the PSO algorithm, but random walks via Lévy flight are much more efficient for exploring the search space [70]. Recent proposals relating to the CS algorithm in data clustering include the following: In 2011 [30], a preliminary work was presented. This research uses a real representation of solutions (centroids), Euclidian distance, Davies-Bouldin index as fitness function, local improvement of solutions based on SSE over iterations, and finally, an adaptation of Lévy Flights in order to create new solutions (eggs). The results are promising in data clustering but they do not pay attention either to cluster labeling or to the maximum processing time available for finishing the task. Neither do they

(in order to increase the effectiveness of the algorithm) manage the dead unit problem likely to be generated with Lévy Flights. In 2013 [62], a comparative study of Genetic Algorithm, Particle Swarm Optimization, and Cuckoo Search over several clustering problems was conducted. CS shows better results in average classification error percentage and in execution time. In this work, a pre-defined value of clusters should be used in order to execute all algorithms; therefore, this proposal is unfeasible for web document clustering. Further, it does not pay attention to cluster labels, noise, and other success factors of the specific research field. 01 02 03 04 05 06 07 08 09 10 11 12 13 14

Objective function f(x); x = (x1, x2,…, xd)T Generate initial population of n host nests xi (i=1, 2, …, n) while (t FJ) Replace j by the new solution end A fraction (pa) of worst nests are abandoned and new ones are built Keep the best solutions (or nests with quality solutions) Rank the solutions and find the current best End while Post process results and visualization Fig 2. Pseudo-code of Cuckoo Search Algorithm via Lévy Flights (Taken from [76])

Fig 3. An example of a Lévy flight in two dimensions 2.3 The k-means algorithm The k-means algorithm is the simplest and most commonly used algorithm for clustering employing a Sum of Squared Error (SSE) criterion based on (2). k

n

SSE = ∑∑ Pi, j xi − c j

2

j =1 i =1

(2)

Where n is the total number of records (documents), k is the number of clusters, Pi,j equals 1 when the document xi belongs to the cj cluster, or 0 otherwise.

This algorithm is popular because it finds the local minimum (or maximum) in a search space, it is easy to implement, and its time complexity is O(n*k*L) where L is the average number of iterations required by the k-means algorithm to converge, therefore “this algorithm can be considered linear in the dataset size” [40, 47, 59, 74]. Unfortunately, the quality of the result is dependent on the initial points and may converge to a local minimum of the criterion function value if the initial partition is not properly chosen [8, 32, 37, 44]. k-means inputs are: The number of clusters (k value) and a set (table, array or collection) containing n objects (or registers) in a D-dimensionality feature space, formally defined by X = {x1, x2,…, xn} (In our case, xi

is a row vector, for implementation reasons). k-means outputs are a set containing k centers. The steps in the procedure of kmeans can be summarized as shown in Fig 4. 01 02 03 04 05

Select an Initial Partition (k centers) Repeat Data Assignment: Re-compute Membership Relocation of “means”: Update Centers Until (Stop Criterion) Return Solution Fig 4. The k-means algorithm

In line 01, there are several approaches for selecting k initial centers [60], for example Forgy [26] suggested selecting k instances randomly from the data set and McQueen suggested selecting the first k points in the data set as the preliminary seeds and then using an incremental strategy to update and select the real k centers of the initial solution [60]. In line 02, it is necessary to recompute membership according to the current solution. Several similarity or distance measurements can be used. In this paper, we used Cosine similarity formally defined as (3). D

Sim cos (i , q ) =

∑W t =1

t ,i

D

∑W t =1

× W t ,q D

2

t ,i

∑W

2

t ,q

t =1

(3)

Where D is the total number of attributes (features or terms in the document collection), Wt,i is the TF-IDF value of a term t in the current document i (see equation (7)), and Wt,q is the TF-IDF value of a term t in the query q. The query q can be replaced by another document and in this case, the cosine similarity measures the similarity between two documents.

In the literature of partitional clustering, various criteria have been used to compare two or more solutions and decide which is better [36, 71]. The most popular criteria are based on the within-cluster and between-cluster scatter matrices. In this research, two criteria were used to find automatically the number of clusters. The Bayesian Information Criterion [21] expressed by (4) and the Balanced Bayesian Information Criterion [20] expressed by (5).

SSE ) + k ∗ Ln ( n ) n  SSE  BBIC = n ∗ Ln   + k ∗ Ln (n )  n ∗ ADBC  BIC = n ∗ Ln (

(4) (5)

Where n is the total number of documents, k is the number of clusters, SSE is the sum of squared error expressed by formula (2), and ADBC is the average distance between all centroids in clustering solution expressed by formula (6).

ADBC =

k −1 k 2 ∑ ∑ 1 − Sim cos (cl , c j ) k * ( k − 1) j =1 l = j +1

(6)

3 The new algorithm: WDC-CSK The new algorithm, called Web Document Clustering based on the Cuckoo Search Algorithm (WDC-CSK) is a description-centric algorithm [13] for the clustering of web results, which was inspired by the new meta-heuristic algorithm, Cuckoo Search (CS) [76]. The algorithm combines a global/local strategy (from this point of view it is a memetic algorithm [53]) of search in the whole solution space. The k-means algorithm was used as a local strategy for improving CS global solutions. The Lévy flights are replaced by two operations or methods, split and merge, which are used to promote diversity in the population and prevent the population converging too quickly to local optimal solutions. Finally, either Balanced Bayesian Information Criterion (BBIC) or Bayesian Information Criterion (BIC) can be used as a fitness function and help the algorithm to find automatically the number of clusters. BBIC is recommended, however. Fig 5 shows the main steps executed by WDC-CSK algorithm. In the following, detailed explanations of the most important steps of WDC-CSK are presented. 01: Initialize algorithm parameters. In this research, the optimization problem lies in minimizing BBIC or BIC criteria, called fitness function. WDC-CSK needs the following parameters: The Maximum Number of Islands (MNI) - an integer between 1 and 5; Population Size (PS) - an integer between 5 and 10; Objective Function (OF) - an enumeration value

between BBIC and BIC; the Probability value of Abandoned nest (PA); a real value between 0.1 and 0.2; Term Frequency Threshold (TFT) - an integer value greater than 2 for the labeling process; Maximum Number of Cycles required by k-means algorithm to converge (MNCK) - an integer greater than or equal to 1; and finally, Maximum Number of Nests (MNN) or the Maximum Execution Time (MET) in milliseconds, as a stopping criterion algorithm. As can be seen in the literature, parallelism is necessary not only to reduce the resolution time, but also to improve the quality of the provided solutions [3, 4, 11, 43], since in many cases the search progress is conducted differently when using a parallel meta-heuristic [43]. The Maximum Number of Islands (MNI) parameter and lines 03 and 14 allow WDC-CSK to execute a search process in parallel using threads that do not share information, known as islands. MNI parameter defines the number of threads (islands) that WDC-CSK executes in parallel and separately. Lines 04 to 13 are executed as a unit into each island (execution thread). When all threads have finished their work the algorithm selects the best nest found over all islands (line 15). 02: Document preprocessing. In this stage, Lucene (http://lucene.apache.org) is used at the document pre-processing stage. The pre-processing stage includes: tokenize, lower case filtering, stop word removal, Porter’s stemming algorithm and the building of the Term-by-Document Matrix (TDM). Dimensions (columns) with a range equal to zero (0) are also removed. The TDM matrix is the most widely-used structure for document representation in IR, and is based on the vector space model [6, 31]. In this model, the documents are designed as bags of words; the document collection is represented by a matrix of D-terms by n-documents. Each document is represented by a vector of normalized frequency term (tft) by the document inverse frequency for that term, in what is known as TF-IDF value (expressed by equation (7)), and the cosine similarity (see equation (3)) is used for measuring the degree of similarity between two documents or between a document and cluster centroid.

wt ,i =

n × log  max ( freq i )  nt freq t ,i

  

(7)

Where freq t,i is the observed frequency of the term t in document i, max(freqi) is the maximum observed frequency in the document i, n is the total number of documents in collection, and nt is the number of documents where term t is presented.

01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16

Initialize algorithm parameters Document preprocessing Execute in parallel a specific number (MNI) of Islands Initialize population of nests; create randomly a set of nests (population of nests) from the current island Execute k-means (local optimizer) for each nest in population from the current island Calculate fitness values (BBIC or BIC) according to (4) or (5) for all nests in population from the current island Repeat Create a new nest using abandon, split or merge operations (methods) based on a randomly selected nest (current nest) from the current island Execute k-means (local optimizer) for the new generated nest Calculate fitness value (BBIC or BIC) according to (4) or (5) for the new generated nest Store best solution, if the new generated nest is better than another randomly selected nest, this last nest is replaced in the population for the new generated nest Until stopping conditions are satisfied (MNN parameter is reached or MET parameter is reached) Select the best nest in the population of nest from the current island End on parallel execution Select the best nest from all islands Assign labels to clusters in the best nest based on the frequent phrases in each cluster. Fig 5. Summary of WDC-CSK algorithm

04: Initialize population of nest. WDC-CSK algorithm works with nests, which are used to represent solutions. Each nest has a different number of clusters, a list of centroids, and the objective function value, based on BBIC or BIC, which depends on the location and number of centroids in each nest. The cluster centers in the nest consist of D x ki real numbers, where ki is the number of clusters and D is the total number of terms (words in vocabulary). For example, in three-dimensional data, the nest < [0.5|0.1|0.8], [0.2|0.5|0.3], [0.4|0.2|0.8], [0.1|0.7|0.7], 0.819 > encodes centers of four (k value) clusters with a fitness value of 0.819. Initially, each centroid corresponds to a different document randomly selected in the TDM matrix (Forgy strategy in the k-means algorithm). The initial number of clusters ki, k value, is randomly calculated from 2 to kmax (inclusive), where k is a natural number and kmax is the upper limit of the number of clusters and is taken to be Trunc ( n + 1)

(where n is the total number of documents in the TDM matrix. Users can define a minimum value of documents in order to execute the clustering algorithm, for example eight - equivalent to the total results presented in the first page of a Google web search) which is a rule of thumb used by many researchers in the clustering literature. 05: Execute k-means. Lines 02 to 05 from Fig 4 are executed based on centroids registered on each nest. Line 01 is not necessary because the nest previously selected the centroids. Lines 02 to 05 are repeated MNCK times. This parameter controls the exploitation level of the algorithm. If this parameter is 1, the algorithm does not improve solutions, only organizes the documents to the appropriate centroids and recalculates centroids. If this value is greater than 1, the algorithm improves solutions in a local environment. However, lines 02 to 05 can finish before reaching the parameter value should the k-means converge in fewer cycles. Lines 05 and 09 are equal in the WDC-CSK algorithm. 08: Create a new nest. To create a new nest (solution) the algorithm performs an abandon, merge, or split operation. With a specific probability (defined by PA parameter) the algorithm creates a new nest with randomly selected centroids from the TDM matrix. This operation corresponds to an abandon and it is inspired by the situation where a cuckoo egg is discovered by the host bird. In this case, a totally new nest is created to complete the population of cuckoo nests in the current island. This operation provides diversity and prevents the population nests converging too quickly. With a specific probability ((1-PA)*0.5) the split or merge operation is executed. These operations replace the Lévy Flights instruction in the original cuckoo search algorithm. For both operations, initially a nest is randomly selected from the current population. This nest is copied in a new nest and is called base nest. In the merge operation, the two most similar centroids (measured by cosine similarity) from base nest are selected and joined. In the split operation, the most disperse cluster is selected and divided into two clusters. The most disperse cluster is selected based on the SSE (Sum of Squared Error) value reported for each cluster, associated to each centroid in base nest. To divide the cluster, the most different document in the selected cluster is selected and a new cluster is created with this document as a centroid. 13: Select the best nest. In this step the algorithm finds and selects the best solution in the population of nests from the current island. The best nest is the nest with the lowest fitness value (minimize BBIC or BIC). Then this solution is returned as the best clustering solution (centroids and fitness) from the current island. 16: Assign labels to clusters. The algorithm uses a Frequent PHrases (FPH) approach for labeling each cluster. This step corresponds with step 2 called “Frequent Phrase Extraction” in Lingo [57] (with some modifications),. In WDC-CSK this method is used for each generated cluster in best solutions. Labeling of each cluster works as follows: a. Conversion of the representation: All documents in the current cluster are selected, one by one, and converted from character-based to word-based representation. b. Document concatenation: In the current cluster the documents are concatenated and a new document is created with the inverted version of the concatenation. c. Complete phrase discovery: In the current cluster the right -and left- complete phrases are discovered and alphabetically sorted by the method and combined into a set of complete phrases. This process is based on the following definition: “S is a complete substring of T when S occurs in k distinct positions p1, p2, …, pk, in T, and the (pi-1)th character in T is different from the (pj-1)th character for at least one pair (i, j), 1≤i

Suggest Documents