Query expansion using an immune-inspired biclustering algorithm ...

2 downloads 0 Views 406KB Size Report
Recently, the authors have proposed an immune-inspired algorithm, namely BIC-aiNet, to perform biclustering of texts. Biclustering differs from standard ...
Nat Comput (2010) 9:579–602 DOI 10.1007/s11047-009-9127-y

Query expansion using an immune-inspired biclustering algorithm Pablo A. D. de Castro Æ Fabrı´cio O. de Franc¸a Æ Hamilton M. Ferreira Æ Guilherme Palermo Coelho Æ Fernando J. Von Zuben

Published online: 21 April 2009 Ó Springer Science+Business Media B.V. 2009

Abstract Query expansion is a technique utilized to improve the performance of information retrieval systems by automatically adding related terms to the initial query. These additional terms can be obtained from documents stored in a database. Usually, this task is performed by clustering the documents and then extracting representative terms from the clusters. Afterwards, a new search is performed in the whole database using the expanded set of terms. Recently, the authors have proposed an immune-inspired algorithm, namely BIC-aiNet, to perform biclustering of texts. Biclustering differs from standard clustering algorithms in the sense that the former can detect partial similarities in the attributes. The preliminary results indicated that our proposal is able to group similar texts effectively and the generated biclusters consistently presented relevant words to represent a category of texts. Motivated by this promising scenario, this paper better formalizes the proposal and investigates the usefulness of the whole methodology on larger datasets. The BIC-aiNet was applied to a set of documents aiming at identifying the set of relevant terms associated with each bicluster, giving rise to a query expansion tool. The obtained results were compared with those produced by two alternative proposals in the literature, and they indicate that these techniques tend to generate complementary results, as a consequence of the use of distinct similarity metrics.

P. A. D. de Castro (&)  F. O. de Franc¸a  H. M. Ferreira  G. P. Coelho  F. J. Von Zuben Laboratory of Bioinformatics and Bio-inspired Computing (LBiC), Department of Computer Engineering and Industrial Automation (DCA), School of Electrical and Computer Engineering (FEEC), University of Campinas (Unicamp), P.O. Box 6101, Campinas, SP 13083-970, Brazil e-mail: [email protected] F. O. de Franc¸a e-mail: [email protected] H. M. Ferreira e-mail: [email protected] G. P. Coelho e-mail: [email protected] F. J. Von Zuben e-mail: [email protected]

123

580

P. A. D. de Castro et al.

Keywords Biclustering  Artificial immune systems  Information retrieval  Query expansion

1 Introduction With the popularization of the web and the collaboration of users to produce digital contents, there was an expressive increase in the amount of documents in electronic format. Therefore, several information retrieval tools have been proposed to help users to find relevant documents automatically, since a personal manipulation is becoming more and more prohibitive. In most collections, the same concept may be referred to using different words. This issue has an impact on the recall of most information retrieval systems. In order to improve the performance of the retrieval system, it is usually employed a technique called query expansion which consists of reformulating the initial query submitted by the user by adding related terms. For instance, suppose a user inputs a query with the term ‘‘informatics’’. This query can be automatically expanded by adding terms such as ‘‘computer’’, ‘‘software’’, ‘‘hardware’’, ‘‘security’’, ‘‘book’’ and ‘‘courses’’. These additional terms can be obtained from documents stored in the database. In this sense, information retrieval systems utilize clustering to put together similar documents and assume that if a document from a cluster is relevant to the search then other documents from the same cluster are also relevant. Therefore, the system scans these related documents looking for representative terms. Afterwards, a new search is performed in the whole database. Standard clustering algorithms such as k-means, Self Organizing Maps, and Hierarchical Clustering present a well-known limitation when dealing with large and heterogeneous datasets: since they group texts based on global similarities in their attributes, partial matching cannot be detected. For instance, if two or more texts share only a subset of similar attributes, the standard clustering algorithms fail to identify this specificity in a proper manner. Besides, they assign a text to only one category, even when the text is involved in more than one category, each one associated with distinct subsets of attributes. A recent proposal to avoid these drawbacks is the so-called biclustering technique, which performs clustering of rows and columns simultaneously, allowing the extraction of additional information from the dataset (Cheng and Church 2000). It is worth of mentioning that biclustering is not equivalent to two sequential and independent clustering, as will be described in Sect. 2. Once the generation of biclusters can be viewed as a multimodal combinatorial optimization process, in a previous work the authors have proposed a novel heuristic-based methodology for accomplishing this task, named BIC-aiNet (Artificial Immune Network for Biclustering). The BIC-aiNet algorithm is based on the Artificial Immune Systems (AIS) paradigm (de Castro and Timmis 2002) and its effectiveness has been attested in different and challenging clustering scenarios, such as collaborative filtering (Castro et al. 2007a, b), gene expression data (Coelho et al. 2008), and text mining (Castro et al. 2007c). Regarding the text mining problem, the experiments have indicated that our methodology is able to group similar texts in a very effective way. Besides, another interesting feature observed is the capability of the biclusters to extract useful information from groups of texts, such as the subset of words responsible for putting those texts together in the same group. Motivated by these results, the objective of this paper is to employ the developed biclustering tool to group texts according to their profiles of term frequency so that

123

Query expansion using an immune-inspired biclustering algorithm

581

previously unknown categories can be identified, together with their relevant terms, and such terms can be suggested to the user so that he/she can select those terms capable of improving a given query. In other words, the goal of this paper is to apply a biclustering algorithm to a set of documents, identify the set of relevant terms associated with each bicluster and use those terms to perform interactive query expansion. Experiments with two datasets were carried out and the results were compared with those produced by two algorithms. Since each proposal employs different similarity metrics and distinct procedures, they generated alternative sets of terms for the same query, but with some intersection among them. Although distinct, all the terms were related to the initial query and they seem to be complementary. Therefore, these results indicate that a method to combine the outcomes of these different approaches could lead to significant improvements in the query expansion problem. The paper is organized as follows. In Sect. 2 we provide a brief background to query expansion, biclustering and artificial immune systems. Section 3 describes in detail the proposed approach to generate biclusters. In Sect. 4 we present the application of the algorithm to general biclustering problems. Section 5 describes and analyzes the experimental results on query expansion. Finally, in Sect. 6 we conclude the paper and provide directions for further steps of the research.

2 Background The aim of this section is to introduce some basic concepts of query expansion, biclustering and, finally, artificial immune systems, with their remarkable characteristics. 2.1 Query expansion Over recent years, we have witnessed an impressive growth in the availability of information in electronic format, mostly in the form of text, due to the popularization of the web and the increasing number and size of digital and corporate libraries. The overwhelming amount of text is hard to assimilate by a user, who faces an information overload problem. In this context, Automated Information Retrieval systems were developed to help users filter this huge amount of documents, keeping only the relevant ones. Users submit queries to the system—formal statements of information—and the system matches these queries to documents stored in a database. A fundamental problem in Information Retrieval is the word mismatch due to ambiguity and contextual issues. Users often utilize in their queries different terms to describe the same concept and these terms may not match the terms used to index the majority of documents stored in the database. This problem can be alleviated if the user inputs a query with several terms, since there is a higher chance that some important words co-occurring in the query also co-occur in relevant documents. However, in practical application, the queries are very short. According to (Croft et al. 1995), the average length of a query submitted to web applications is 2 words. A common approach to solve this problem and improve the results of an information retrieval system is to enrich the initial query by adding more terms in an attempt to refine the query. For instance, consider a user that inputs a query with the term ‘‘car’’. The system can incorporate more terms into the query such as ‘‘auto’’, ‘‘vehicle’’, ‘‘race’’, ‘‘rent’’ and ‘‘tire’’. This technique is called query expansion.

123

582

P. A. D. de Castro et al.

The query expansion can be automatic or interactive. In automatic query expansion, the information retrieval system suggests the additional terms, whereas in the interactive mode the system presents some terms and let to the user the task of choosing the most suitable ones to be inserted into the query. The central question in query expansion is how to generate additional terms to compose the new query. This task can be accomplished by extracting, for every term in the initial query, synonyms or related terms from other documents or thesauri. An advantage of this approach is that the terms are extracted from the collection of documents, thus automatically exploring some sort of contextual information. 2.2 Biclustering In data mining, biclustering is referred to the process of simultaneously finding clusters on the rows and columns of a matrix (Cheng and Church 2000). This matrix may represent different kinds of numerical data, such as objects and its attributes (comprising the rows and columns of the matrix, respectively). The biclustering approach covers a wide scope of different applications, and the main motivation is to find data points that are correlated under only a subset of attributes, which cannot be identified by usual clustering methods. Some examples of biclustering applications are dimensionality reduction (Agrawal et al. 1998), biological data analysis (Agrawal et al. 1998; Tang et al. 2001; Mitra and Banka 2006; Mitra et al. 2006; Divina and Aguilar–Ruiz 2007; Gira´ldez et al. 2007; Maulik et al. 2008), information retrieval and text mining (Castro et al. 2007c; Dhillon 2001; Feldman and Sanger 2006), electoral data analysis (Hartigan 1972) and collaborative filtering (Castro et al. 2007a; b; Symeonidis et al. 2007). Figure 1b and c shows two examples of biclustering extracted from an original matrix (a). The bicluster (b) was created with rows {1,2} and columns {2,4}, and the bicluster (c) was created with rows {2,3,5} and columns {1,3,5}. In these illustrative examples, the metric adopted is based on the degree of similarity among rows and columns. Several approaches were proposed in the literature to extract biclusters from a dataset (Sheng et al. 2003; Tang et al. 2001). Those approaches can be divided into several categories (Madeira and Oliveira 2004; Tanay et al. 2005), according to: (i) how they build the set of biclusters; (ii) which is the structure adopted for the biclusters; and (iii) the way each technique evaluates the quality of the bicluster. The way the biclusters are built refers to the number of biclusters discovered per run. Some algorithms find only one bicluster at each run, while others are capable of simultaneously finding several biclusters. Besides, there are nondeterministic and deterministic algorithms. Nondeterministic algorithms are able to find different solutions for the same problem at each execution, while the deterministic ones produce always the same solution. The biclusters returned by the algorithms may have different structures: (i) exclusive columns and/or rows, which consists on biclusters that cannot overlap in either columns or

Fig. 1 Two biclusters (b and c) extracted from the original matrix (a)

123

Query expansion using an immune-inspired biclustering algorithm

583

rows of the matrix; (ii) arbitrarily positioned and possibly overlapping biclusters; and (iii) overlapping biclusters with hierarchical structure. The classification based on the quality measure of a biclustering algorithm is related to the concept of similarity between the elements of the matrix. For instance, some algorithms search for constant value biclusters (Fig. 2a), some for constant columns or rows (Fig. 2b, c), and others for coherence in the values of the elements (Fig. 2d). In Fig. 2d, the second row is equal to the first one plus 1 and the third row equals the first minus 1. A more common situation, however, is the presence of some degree of deviation from the desired profile (see Fig. 1c, for example), which constitutes a residue of the bicluster with respect to the coherence metric. To calculate the coherence among the elements of a bicluster, it is often used the mean squared residue introduced by Cheng and Church (2000). This metric consists in the calculation of the additive coherence inside a bicluster by assuming that each row (or column) of the bicluster presents a profile identical to (or very similar to) the one exhibited by other rows (or columns), except for a constant bias. Thus, each element of a coherent bicluster could be calculated as: aij ¼ l þ ai þ bj ;

ð1Þ

where l is the base value of the bicluster, ai is the additive adjustment of row i [ I, bj is the additive adjustment of column j [ J, I is the set of rows and J is the set of columns in the bicluster. With the calculation of the mean value of the whole bicluster (aIJ), the mean value of row i (aiJ) and the mean value of column j (aIj), we get:  aiJ ¼ l þ ai þ b;

ð2Þ

a þ bj aIj ¼ l þ 

ð3Þ

 aþb aIJ ¼ l þ 

ð4Þ

 is the mean value of bj (j [ J). where a is the mean value of ai (i [ I) and b Isolating ai, bj and l in Eqs. 2–4 above, we obtain:  ai ¼ aiJ  l  b;

ð5Þ

a bj ¼ aIj  l  

ð6Þ

 ab l ¼ aIJ  

ð7Þ

Substituting equation (7) in Eqs. 5 and 6, we have: ai ¼ aiJ  aIJ þ  a;

ð8Þ

 bj ¼ aIj  aIJ þ b

ð9Þ

And finally, substituting equations (7)–(9) in Eq. 1 leads to:

Fig. 2 Example of different quality measures: a constant bicluster, b constant rows, c constant columns, d coherent values (rows or columns should exhibit high correlation)

123

584

P. A. D. de Castro et al.

Fig. 3 Example of a coherent bicluster (rows exhibit high correlation)

aij ¼ aIj þ aiJ  aIJ :

ð10Þ

Let us take as an example the bicluster depicted in Fig. 3. We can see that the second row is equal to the first one plus 10 and the third row is equal to the first plus 5. Notice that each entry of matrix A in Fig. 3 can be easily calculated with Eq. 10. Therefore, finding a coherent bicluster is basically the same as finding a bicluster that minimizes the error between the estimated value and the real value of an element of the matrix. So the mean squared residue becomes XX 1 ðaij  aIj  aiJ þ aIJ Þ2 ð11Þ R¼ jIj  jJj i2I j2J where |I| and |J| are the total number of rows and columns of the bicluster, respectively. Another important aspect of a bicluster is its volume, generally denoted by |I| 9 |J|. A more suitable notation could be ‘‘area’’ of the bicluster, but the usual notation in the literature will be adopted here. In order to be functional and to allow a deeper analysis of the data, it is usually required that a bicluster presents a large volume (large number of rows and columns). Apart from individual characteristics of the biclusters, like residue and volume, which should be optimized during their construction, several other aspects should also be considered. Some of these aspects are the coverage of the dataset, which corresponds to the amount of elements of the data matrix that is actually covered by the generated biclusters; the overlap degree among the biclusters, which is the intersection of the sets of elements in each one of them; and, for data matrices of high sparseness, the degree of occupancy of the biclusters (amount of non-null values present). Therefore, as can be seen, the biclustering problem is not restricted to the optimization of one or two individual aspects of each solution. Instead, the algorithms proposed to deal with this problem should provide mechanisms to simultaneously control several features of the set of biclusters according to the demands of each application. In this work, the algorithm adopted is nondeterministic and produces several coherent biclusters at each run, with some degree of overlapping. Besides, in the text mining experiments performed in Sect. 5, the objective function (see Sect. 3) will be defined to consider only the volume and the occupancy of the bicluster, mainly due to the fact that text mining datasets are usually highly sparse and it is hard to find dense biclusters with a constrained residue value. The coherence metric described before will still be used here, but solely at the post analysis of the generated biclusters. 2.3 Artificial immune systems Artificial immune systems (AISs) can be considered a new computational paradigm inspired by the immunological system of vertebrates and designed to solve a wide range of problems, such as optimization, clustering, pattern classification and computational security (de Castro and Timmis 2002). Focusing on optimization, immune-inspired algorithms generally differ from other bio-inspired approaches in the following aspects: (i) the

123

Query expansion using an immune-inspired biclustering algorithm

585

population size is dynamically adjusted along the execution cycle of the algorithm, in response to particularities of the problem, i.e., the population size can grow or shrink at each generation; (ii) they are capable of inserting and maintaining diversity in the population; and (iii) their efficient mechanisms of exploration/exploitation of the search space promote the discovery and preservation of many local optima, if they exist. The natural immune system of vertebrates, which is the inspiration for AISs, is one of the most important components of superior living organisms. Its main goal is to keep the organism healthy, and such task is performed through a permanent cycle of recognition and combat against infectious foreign elements, known as pathogens. During this recognition cycle, the molecular patterns expressed in those invading pathogens (or simply antigens) are responsible for triggering the immune response when properly recognized by the immune cells. Several immune-inspired algorithms have been proposed in the literature, and they differ mainly on the immunological metaphors adopted as inspiration. However, many of them, including the immune-inspired algorithm for biclustering that will be employed in this work, are mainly based on the reproduction of the Clonal Selection and Affinity Maturation principles (Ada and Nossal 1987), together with the Immune Network Theory (Jerne 1974). Roughly speaking, the clonal selection theory states that, when a pathogen invades the organism, some lymphocytes (particularly antibody-producing cells, like B-cells) that are capable of recognizing this antigen start proliferating through a cloning process. During this proliferation, the generated clones suffer mutation with rates proportional to their affinity with the antigens: the higher the affinity, the smaller the mutation rate, and vice versa. Although the clonal selection theory explains how the immune system responds to antigens, it does not explain well how the immune system reacts to itself. A well-known proposal to explain the autonomous dynamics of the immune system is the immune network theory, proposed by Jerne (1974). This theory considers that antibodies are not only capable of recognizing antigens, but they are also capable of recognizing each other. Due to the absence of practical evidences to support this theory, it was discredited in Biology, though still remaining computationally appealing. When an immune cell is recognized by another immune cell, the recognizing cell is stimulated and the recognized cell is suppressed. The meanings of stimulation and suppression in AIS will depend on how the theory is interpreted in the computational domain. Computationally implemented, the clonal selection principle and immune network theory are fundamental aspects to adjust the size of the population and to maintain diversity in the population, thus allowing a better exploration of the search space and, consequently, the location of multiple local optima (de Castro and Von Zuben 2002). Most of the standard immune-inspired algorithms based on clonal selection and immune network theory follow the steps given by Algorithm 1. Generally, a population of antibodies that represent candidate solutions to the problem is initially generated at random. While the stopping condition is not met, the antibodies are evaluated and each one of them gives origin to a certain number of clones that suffer a mutation process inversely proportional to their fitness (affinity with the antigen). Similar clones and clones with fitness worse than a predefined threshold are eliminated from the population (suppressed), and new randomly generated antibodies are inserted to increase diversity. The general structure presented in Algorithm 1 corresponds to the one adopted in the BIC-aiNet algorithm, employed in this work to perform text mining and described in the following section.

123

586

P. A. D. de Castro et al.

Begin Initialize the population; While stopping condition is not met do Evaluate the population; Clone the antibodies; Mutate the clones; Suppress the clones with fitness lower than a threshold; Eliminate similar antibodies; Insert new antibodies randomly; End while End

Algorithm 1 General pseudo-code of an AIS based on clonal selection and immune network theories

3 The BIC-aiNet algorithm As the biclustering task can be viewed as a multimodal combinatorial optimization problem, in this work we have considered an immune-inspired algorithm to accomplish this task, denoted BIC-aiNet (Artificial Immune Network for Biclustering), which is an extension of the aiNet algorithm proposed by (de Castro and Von Zuben 2001). In the next subsections, the main aspects of BIC-aiNet will be described in detail. 3.1 Representation of solutions (coding) Given a data matrix with n rows and m columns, the structure chosen to represent a bicluster (an individual in the population) of the algorithm for this data matrix is by using two ordered vectors: one representing the rows and the other the columns (with lengths N B n and M B m, respectively). Each element of the vectors is an integer that represents the index of the row or the column that takes part in the bicluster. This representation tends to be more efficient than the binary representation (two vectors of length n and m, where the number 1 represents the presence of a given row/column and 0 the absence) as the biclusters are usually much smaller than the original data matrix, especially for large values of n and/or m. Each individual in the population represents a single bicluster and may have distinct values for the number N of rows and M of columns. Figure 4 shows an example of the coding process adopted in BIC-aiNet.

Fig. 4 Example of coding: a the original data matrix; b an individual of the population, with N = 2 and M = 3; c the correspondent bicluster

123

Query expansion using an immune-inspired biclustering algorithm

587

3.2 Fitness function The fitness function adopted in BIC-aiNet to evaluate a given bicluster tries to minimize its mean-squared residue R and maximize the number of rows and columns in the bicluster (thus maximizing the volume) in a single objective, and is given as follows: f ðvr ; vc Þ ¼

R wr wc þ þ k2 N M

ð12Þ

where vr and vc are the ordered vectors containing the indices of rows and columns, N and M are the number of rows and columns, respectively, in the bicluster, R is the meansquared residue of the bicluster (calculated as in Eq. 11), k is a residue threshold (the maximum desired value for residue), wr is a weight indicating the relative importance of the number of rows, and wc indicates the relative importance of the number of columns. Parameters wr and wc allow the control of the desired ‘‘shape’’ of the bicluster, i.e., if it is preferred biclusters with a larger number of N rows, a larger number of M columns or similar values for N and M. With this fitness function we have a compromise between two conflicting objectives: the minimization of the mean-squared residue (which can be interpreted as some kind of variance of the elements in the bicluster) and the maximization of the volume. As stated before, for a bicluster to be meaningful, it should contain a reasonable number of elements so that some knowledge can be extracted (high volume). Also, it is important to maintain the maximum degree of coherence among its elements. 3.3 The algorithm The BIC-aiNet algorithm begins with the generation of a random population of biclusters consisting of just one row and one column, that is, just one element of the data matrix. This single element can be considered as a ‘‘seed’’ that will promote the growth of a local bicluster around it. After the initialization, the algorithm enters its main loop, which is repeated until a predefined number of iterations are completed. In this main loop, firstly the population is cloned and mutated (as will be described later) and then, if the best mutated clone obtained has a better fitness than its progenitor, the progenitor will be replaced by this clone in the next generation. Every sup_it iterations the algorithm performs a suppression of similar biclusters and then inserts new cells. This procedure causes a fluctuation in the population size and helps the algorithm to maintain diversity and to work with just the most meaningful biclusters. The main steps of BIC-aiNet are given by Algorithm 2. 3.4 Mutation The mutation operation consists of simple random insertion/removal procedures applied to the bicluster. Given a bicluster, the algorithm first chooses between the operations of insertion or removal with equal probability. After that, another choice is made, also with the same probabilities to define whether the operation will be made on a row or on a column. Considering the insertion operation, one element that does not belong to the bicluster is chosen and inserted into the row/column list in ascending order. If the removal operation is chosen, it also randomly picks the element that should be removed from the bicluster.

123

588

P. A. D. de Castro et al.

Algorithm 2 Pseudo-code of the BIC-aiNet algorithm

The mutation operator is summarized in Algorithm 3. 3.5 Suppression The suppression operation is a straightforward procedure, and it is given by Algorithm 4. For each pair of biclusters in the population, the intersection of the sets of rows and columns that constitute each one is generated and the number of common elements is counted. If this number is greater than a certain threshold e, defined by the user, the bicluster with the worst fitness is removed. Algorithm 3 Mutation operator of BIC-aiNet

123

Query expansion using an immune-inspired biclustering algorithm

589

Algorithm 4 Suppression mechanism of BIC-aiNet

After every suppression step, the BIC-aiNet algorithm inserts new randomly generated cells (individuals) trying to create biclusters with elements (rows or columns) that still does not belong to any existing bicluster. To do so, these new individuals are built in the same way as the initial population of the algorithm, except that the single elements that are inserted into each cell are taken from those not present in any bicluster yet.

4 Evaluating the performance of BIC-aiNet in general biclustering problems In order to illustrate the capability of the BIC-aiNet algorithm to generate high quality biclusters and justify its application to text mining, in this section BIC-aiNet was applied to two well-known gene expression datasets and to one collaborative filtering experimental dataset, which presents a high degree of sparsity. Besides, BIC-aiNet was also compared to two well-known algorithms from the literature, namely CC (Cheng and Church 2000) and the multi-objective evolutionary approach of Mitra and Banka (2006). The first gene expression dataset is the Saccharomyces cerevisiae cell cycle expression data from (Cho et al. 1998), which contains 2,884 genes under 17 conditions, and was adapted in this work according to (Cheng and Church 2000), with the null values being replaced by random numbers in the range [0,600]. This dataset is popularly known as Yeast. The second dataset is the Human B-Cell Lymphoma expression data, first employed in (Alizadeh et al. 2000), which holds 4,026 genes under 96 conditions, and the expression levels were treated in the same way as the Yeast dataset, with missing values being replaced by random numbers in [-750,650]. Both the Yeast and Human datasets can be downloaded at http://arep.med.harvard.edu/biclustering. The third dataset is called Movielens (available at http://www.grouplens.org/node/73), and it was built with ratings in the range of [1,5] given by 943 users to subsets of 1,650 movies, resulting in a highly sparse dataset with 80,000 ratings (approximately 6.4% of occupancy). This dataset was originally developed to be used with collaborative filtering algorithms, where the objective is to extract information from the dataset in order to infer a list of movie recommendations to a given user. This dataset is challenging to biclustering algorithms due to its lack of information (high sparsity), which requires that good quality biclusters not only present low mean-squared residue and high volume, but also a high rate of occupancy (non-null values). The CC algorithm (Cheng and Church 2000) was the first biclustering algorithm applied to gene expression problems, and it is a constructive heuristic that starts with a single bicluster, representing the whole dataset, and iteratively removes rows and columns of this bicluster until the residue is equal to or less than a predefined threshold d. After that, a sequence of insertions of rows and columns that are not in the bicluster yet is performed,

123

590

P. A. D. de Castro et al.

until it is not possible to insert any other row or column without increasing the residue to a value above d. After the first bicluster is built, the rows and columns already present in the bicluster are replaced by random values in the original dataset, and the whole process is restarted until a predefined amount of biclusters is obtained. The algorithm of Mitra and Banka (2006) is a multi-objective optimization algorithm for biclustering, and it is basically an adaptation of the NSGA-II algorithm (Deb et al. 2002) to generate a set of biclusters, with the addition of a local search procedure based on the node insertion and deletion mechanisms proposed by Cheng and Church in the CC technique (Cheng and Church 2000). One important aspect of the algorithm of Mitra and Banka is the way the mean-squared residue is optimized. In their approach, the meansquared residue is increased as long as it is smaller than the upper bound d, which stimulates the generation of biclusters with higher volume. For each experiment presented in this section, the mean values taken in five independent runs of each algorithm were calculated for the average mean-squared residue, volume, coverage and, for Movielens, occupancy of the final set of biclusters. The parameters adopted for each dataset were chosen experimentally for each algorithm, in order to lead to the best possible set of biclusters. Most of the parameters adopted were the same for all the three datasets being only a few of them problem-specific. The BIC-aiNet was run for 2,000 iterations, while for Mitra and Banka’s algorithm the number of iterations adopted was equal to the number of biclusters required. Also, for the latter, the offspring size was set to 50 and the probability of crossover and mutation were 75% and 3%, respectively. For CC, the only fixed parameter was a (rate of node removal) which held a value of 1.2. For the Yeast dataset, each algorithm was set to generate 300 biclusters with d (upper boundary of the mean-squared residue) set to 300 for CC and Mitra and Banka’s algorithm and to 100 for BIC-aiNet. As it was mentioned in Sect. 3, the BIC-aiNet algorithm tries to control the maximum mean-squared residue in the fitness equation (see Eq. 12), so the meaning of parameter d is different in this algorithm. However, d was properly set to 100 in order to lead BIC-aiNet to generate a set of biclusters that attend the same upper bound of the residue, defined as 300. Also, the parameter a of the Mitra and Banka’s algorithm (which is associated with the rate of node deletion in the local search) was set to 1.8. The row (wr) and column (wc) degree of relevance of the BIC-aiNet algorithm were set to 2 and 3, respectively. For the Human dataset, the algorithms were adjusted to generate 200 biclusters with maximum mean-squared residue (d) of 1,200 (in BIC-aiNet, d was set to 200), and a was set to 1.2 in Mitra and Banka’s technique. As for the Yeast dataset, the row (wr) and column (wc) degree of relevance of the BIC-aiNet algorithm were set to 2 and 3, respectively. Finally, for the Movielens dataset, all the algorithms were adjusted to generate 100 biclusters with a maximum residue d of 2 (including BIC-aiNet) and a was set to 1.2 in Mitra and Banka’s algorithm. For this problem, the row (wr) and column (wc) degree of relevance of the BIC-aiNet algorithm were set to 3 and 2, respectively. The obtained results for these experiments are given by Table 1. As it can be seen, for the Yeast dataset BIC-aiNet attended the residue constraint (which should stay below d = 300) and provided the second largest average volume. Considering Coverage and Overlap among biclusters, it is possible to see that BIC-aiNet provided the smallest coverage for this problem, and an overlap degree similar to the one obtained by Mitra and Banka’s proposal.

123

Query expansion using an immune-inspired biclustering algorithm

591

Table 1 Results of CC, Mitra and Banka’s algorithm and BIC-aiNet for Yeast, Human and Movielens datasets Dataset Yeast

Human

Criteria

CC

Residue

144.96 ± 2.9

288.62 ± 3.35

232.04 ± 25.30

Volume

126.10 ± 0.2

4800.40 ± 76.40

3659.00 ± 1214.80

BIC-aiNet

68.20 ± 11.28

Coverage (%)

76.75 ± 0.12

83.26 ± 0.83

Overlap (%)

10.67 ± 0.01

20.76 ± 0.72

20.67 ± 7.58

Residue

915.50 ± 6.27

1052.90 ± 2.33

1156.70 ± 2.62

Volume

610.00 ± 3.61

7559.00 ± 83.41

11606.00 ± 38.39

31.57 ± 0.19

17.08 ± 0.20

23.08 ± 0.11

Overlap (%)

0.00 ± 0.00

34.57 ± 0.27

32.00 ± 0.21

Residue

0.51 ± 0.37

1.18 ± 0.38

Volume

6.72 ± 8.29

73.42 ± 10.73

Coverage (%)

0.04 ± 0.06

0.28 ± 0.07

Overlap (%)

2.97 ± 1.03

4.63 ± 2.10

2.80 ± 0.14

100.00 ± 0.00

96.54 ± 5.69

92.31 ± 0.22

Coverage (%) Movielens

Mitra and Banka

Occupancy (%)

1.33 ± 0.03 216.43 ± 2.05 0.62 ± 0.01

For the Human and Movielens problems, it is possible to see that BIC-aiNet provided the largest set of biclusters (highest volume), but still attending the maximum residue constraint (d = 1,200 and d = 2, for Human and Movielens, respectively). Focusing in Movielens, we can note that BIC-aiNet also achieved the largest coverage in this sparse dataset, which is extremely desired in such problems, where the information available is limited. This is a clear indication that this algorithm seems to be indicated to deal with text mining problems as those studied in this work, because these problems are mapped in highly sparse data matrices.

5 Computational experiments Having evaluated and compared BIC-aiNet with other biclustering algorithms, we move forward to the text mining experiments proposed in this paper: the bicluster-based interactive query expansion, which will be explained in Sect. 5.2. The qualitative performance of BIC-aiNet was already evaluated in a previous work (Castro et al. 2007c) when applied to text clustering, and promising results were obtained. Besides grouping similar texts in an effective way, the biclusters indicated the words responsible for grouping those texts together in the same bicluster. Based on these results, especially the obtained relationships among words, the application to query expansion easily follows, since each bicluster can be interpreted as a list of related words to be suggested. For example, if a given user searches for documents that contain the word ‘‘immune’’, some obvious suggestions could be ‘‘cells’’, ‘‘lymphocytes’’, ‘‘antibody’’, ‘‘cloning’’, ‘‘antigen’’ and several other known related words, each one leading to a new subset of documents. Of course the query expansion may go beyond that and also suggests not so obvious words that are supposedly be useful to improve even more the accuracy of the search. Notice in this example that, depending on which word the user chooses, the search may take a whole new direction (going, for instance, from biology to computer algorithms).

123

592

P. A. D. de Castro et al.

Therefore, the objective of this experiment is twofold: (i) to investigate the coherence of suggestions given by BIC-aiNet, when compared to a basic query expansion algorithm called Term-Term Similarity (TTS) (Manning et al. 2008), and by the simple suggestion of terms that present the highest degree of Cosine similarity; and (ii) to analyze the relevance of the set of terms suggested by each technique, and verify the intrinsic characteristics of each set of recommendations. In order to provide a better understanding of the experiments, the remaining of this section will be divided into four subsections. First of all, it will be introduced a new fitness function for BIC-aiNet, that allows the discovery of more useful biclusters to this particular problem. Next, it will be described how the biclusters will be used to perform query expansion, how the TTS algorithm works and how the cosine-based suggestion of terms will be made. In the sequence, the datasets used in the experiments will be described together with the experimental framework adopted, and finally the obtained results will be presented and analyzed. 5.1 A more elaborate fitness for BIC-aiNet One advantage of the BIC-aiNet algorithm compared to other biclustering approaches is its capability to be easily adapted to different situations, usually by simply changing the fitness function to better express the desired properties of the biclusters. In text mining and, specifically in the Query Expansion problem, we have to deal with highly sparse datasets and so it is highly desired the production of dense sub-matrices (non sparse biclusters) in order to extract meaningful relations among words. In this situation, the residue of the bicluster is not so important as usual, as the simple generation of a common subset of words belonging to a subset of documents already implies that those words are related to some extent, even if they appear with different frequencies. But, of course, the residue may still provide some hints, at least to estimate the degree of relationship among terms in the same bicluster. Therefore, the first step should be the replacement of the residue term of the BIC-aiNet fitness by a term proportional to the degree of occupancy of the bicluster, in order to evaluate the proportion of elements presenting non-null values inside the bicluster. Obviously, in sparse data matrices, the higher the occupancy the more difficult the occurrence of high volume biclusters, so the fitness function must still include two other terms that lead to the increase of the number of terms and documents (rows and columns) at each bicluster. Regarding the occupancy term, we want the fitness to exponentially vary from 1 to close to 0 as the occupancy goes from 0 to 1 (0–100%), presenting a steepest decay between 0 and 0.5. Therefore, the function adopted to do so is given by Eq. 13: focc ðoccÞ ¼ eocc=0:3 ;

ð13Þ

where occ is the occupancy rate of the bicluster. Regarding the other two terms, we want that the fitness varies from 1 to close to 0 exponentially as the number of terms and documents (rows and columns) of the bicluster increases. Empirically, in this situation, a good balance for the fitness function was reached by letting the steepest decay occur between values 0 and 6, thus leading to biclusters of volume around 36 (6 9 6). The function adopted to do so is given by Eqs. 14 and 15: fterms ð#termsÞ ¼ e#terms ;

123

ð14Þ

Query expansion using an immune-inspired biclustering algorithm

fdocuments ð#documentsÞ ¼ e#documents ;

593

ð15Þ

Finally, the final fitness function adopted for the BIC-aiNet algorithm in the text mining experiments is given by Eq. 16: f ðocc; #terms; #documentsÞ ¼ focc ðoccÞ þ fterms ð#termsÞ þ fdocuments ð#documentsÞ ð16Þ 5.2 Expanding the queries After the biclusters are generated, we must explore them to perform the interactive query expansion (IQE). The proposed process to do so is very simple: first, the user enters a search term; second, the algorithm searches for this term inside the generated biclusters; and finally, for each bicluster that contains this term, all the other terms inside it are returned as a new term suggestion. After all suggestions are performed, the user must choose another term among all of them, so that the query can be expanded with the additional term, and the process can be repeated so that a new list of suggested terms is generated based on the new query (each one individually searched inside the biclusters). The process continues until the user is satisfied with the query results. In order to compare the effectiveness of BIC-aiNet in query expansion problems, it was also implemented two other approaches: (i) A simple query expansion tool based on Term-Term Similarities (TTS) (Manning et al. 2008) that, given a matrix A that relates the normalized weighting counts of terms in each document, consists in the calculation of the matrix C = AAT, where Cu,v is the similarity between the terms u and v (the higher the better). After that, given the initial query, the term is expanded by suggesting the ten most similar terms on the dictionary given by matrix C. (ii) The use of Cosine similarity to calculate the ‘‘n’’ most similar terms, given their occurrence on the documents in the form of a term-frequency matrix. This approach was applied only to the CISI dataset, due to memory constraints. 5.3 Description of the datasets To create a test bed for the search engine, two datasets were used, the 20-newsgroups dataset (Lang 1995) and the CISI dataset. The 20-newsgroups dataset consists of an average of 560 documents for each one of 20 subjects ranging from science and technology to politics, religion and sports, extracted from different newsgroups totalizing 11,314 documents with 115,182 distinct words. The newsgroups, subject and the respective number of documents are presented in Table 2. The preprocessing step of the text database was performed with the Matlab toolbox called Text to Term-Document Matrix Generator (TMG) (Zeimpekis and Gallopoulos 2005) that can be found at http://scgroup.hpclab.ceid.upatras.gr/scgroup/Projects/TMG/. With this toolbox it is possible to automatically perform the extraction of stopwords and headers, stemming and normalization. The parameters used for this pre-processing are depicted in Table 3. The parameters ‘‘Min length’’ and ‘‘Max length’’ correspond to the minimum and maximum length of the term to be considered (outside this range the term is extracted from the dataset), also there are frequency restrictions defined as local frequencies (the frequency of a given word inside a single document) and global frequencies (the frequency of a given word inside every documents), and finally the local and global

123

594 Table 2 Characteristics of the 20-newsgroups dataset

P. A. D. de Castro et al.

Newsgroup

Subject

Number of documents

alt.atheism

Religion

480

comp.graphics

Informatics

584

comp.os.ms-windows.misc

Informatics

591

comp.sys.ibm.pc.hardware

Informatics

590

comp.sys.mac.hardware

Informatics

578

comp.windows.x

Informatics

593

misc.forsale

Miscellaneous

585

rec.autos

Recreation

594

rec.motorcycles

Recreation

598

rec.sport.baseball

Recreation

597

rec.sport.hockey

Recreation

600

sci.crypt

Science

595

sci.electronics

Science

591

sci.med

Science

594

sci.space

Science

593

soc.religion.christian

Religion

599

talk.politics.guns

Politics

546

talk.politics.mideast

Politics

564

talk.politics.misc

Politics

465

talk.religion.misc

Religion

377

Total

Table 3 Parameters used in TMG toolbox to pre-process the 20-newsgroups dataset

11,314

Parameter

Value

Min length

3

Max length

30

Min local frequency

1

Max local frequency

40

Min global frequency

100

Max global frequency

11,500

Local term weighting

Term frequency

Global term weighting

Inverse document frequency

term weighting methods that evaluates the statistical importance of a given word to a document and also related to a collection of documents (Jones 1972). After the preprocessing step, the resulting dataset presented 2,323 documents and 84,703 different terms. The CISI dataset is composed of 1,460 documents with 5,544 words extracted using the same toolbox as described above. This dataset comprises abstract of papers regarding library science and related areas. Since this dataset was downloaded on TMG toolbox format, the parameters used to generate it are unknown. The dataset is available at http://scgroup6.ceid.upatras.gr:8000/wiki/index.php/Main_Page#Sample_Output.

123

Query expansion using an immune-inspired biclustering algorithm

595

5.4 Analysis of the query expansion results In order to find a good and meaningful set of biclusters the BIC-aiNet algorithm was run for 100,000 generations with a maximum population of 5,000 biclusters. With the new fitness function used here (defined in Sect. 5.1), no additional parameters were required. With the final set of biclusters generated by BIC-aiNet we now have several subsets of words and documents. Afterwards, given a search word, it is possible to find the biclusters that contain this word and present to the user the different contexts in which it appears, i.e., the different groups of texts that contain such word. In Table 4, some statistics of the generated biclusters are presented in which we can see the average number of words and documents in each bicluster for each dataset. As these are very sparse datasets, in order to maintain a certain degree of occupancy, the BICaiNet algorithm was only capable of finding low volume biclusters. 5.4.1 Application of the obtained biclusters As soon as the biclusters are generated, we can now utilize them in our search engine. This procedure consists of: given a certain word (query), the algorithm searches for this term in each bicluster with residue below a given threshold (/) and, for the biclusters found, the algorithm displays, as a suggestion, the terms of these biclusters with frequencies of occurrence between a given threshold range (n-, n?). The residue threshold (/) adopted for each word searched is presented in the following subsections, while the threshold range of the frequencies (n-, n?) of the words in the biclusters was defined as [0,600]. In order to illustrate the obtained results, some terms were chosen to perform query expansion using the three techniques (two in the case of the 20-newsgroups) described above and, then, the common words suggested by each algorithm and also the exclusive ones recommended by each one of them are shown. 5.4.2 Results for the CISI dataset For the CISI dataset, the residue threshold / was set to ‘‘1’’ in every experiment. First we started with more general terms to see if the algorithms were able to bring suggestions for a broad variety of subjects. The word we used to start the experiment was ‘‘retrieving’’, since it is a common word on library science subject. In Table 5 we can see the exclusive words found by all three algorithms as well as the common words. Table 4 Statistics of the generated biclusters for (a) 20-newsgroups and (b) CISI dataset

(a) 20-newsgroups dataset Avg. occupancy

87.77%

Avg. volume

64.12

Avg. number of terms Avg. number of documents Avg. residue

5.72 12.98 1.75

(b) CISI dataset Avg. occupancy

71.11%

Avg. volume

46.53

Avg. number of terms Avg. number of documents Avg. residue

4.41 10.67 5.32

123

596

P. A. D. de Castro et al.

Table 5 Results for the word ‘‘retrieving’’ on CISI dataset Words exclusive from BIC-aiNet

abridged, algorithm, basic, benefits, biomedical, book, called, characteristics, copying, expounds, forms, formulates, future, hill, light, line, list, literature, main, mechanize, medicus, method, mikhailov, number, organization, performance, principles, processing, recommendations, reproduction, resources, science, search, selection, sheds, sources, specification, subject, synthetic, translating, ways,

Words exclusive from Cosine similarity

alerting, amounts, analyzing, aware, bottle, completely, considerably, contained, content, deemed, discussed, engineering, essential, estimating, heavily, helpful, informative, keywords, limitations, listing, lost, mentioned, minimum, miss, nearly, nuclear, patents, possibility, proportion, rely, retrievable, robert, scanning, supplementary, synonyms, syntactical, tags, titles, transfer, transferred, underlying, uninformative, variants

Words exclusive from TTS algorithm

alerting, amount, amounts, analysis, analyzed, analyzing, article, articles, aware, based, bottle, broad, chemical, class, complete, completely, considerably, contained, content, dealing, deemed, discussed, document, documents, engineering, essential, estimating, fields, half, heavily, informative, journal, keywords, methods, proportion, retrievable, services, synonyms, terms, title, transfer

Common words

chemistry, information

Firstly, on this table, regarding the common words, ‘‘information’’ is quite obvious to pair with the word ‘‘retrieving’’, and ‘‘chemistry’’ has appeared due to the existence of several documents related to this subject. One possible explanation for the fact that only two common words were found for this query is possibly because ‘‘retrieving’’ is a very generic term in this database, which allows each technique to explore different subsets of documents in the dataset and obtain subsets of suggestions with almost no intersection. Concerning the exclusive words of each algorithm, we can see that each one returned a diverse set of words mostly related to this subject. The words we should point out for BICaiNet are: ‘‘algorithm’’, ‘‘book’’, ‘‘literature’’, ‘‘method’’, ‘‘processing’’, ‘‘recommendations’’, ‘‘Mikailov’’ (an important information science theorist), ‘‘synthetic’’ and so on. For the Cosine Measure we have ‘‘informative’’, ‘‘keywords’’, ‘‘scanning’’, ‘‘synonyms’’ and so forth. The TTS algorithm got ‘‘article’’, ‘‘content’’, ‘‘document’’, ‘‘engineering’’ and ‘‘methods’’. Next we searched for a term that is a little more specific, the word ‘‘mathematics’’, and the results are presented in Table 6. The set of common words brought us, as expected, terms that are closely related to math like ‘‘distribution’’, ‘‘problem’’ and ‘‘terms’’. Concerning the exclusive words for BIC-aiNet we must point ‘‘calculate’’, ‘‘complex’’, ‘‘factors’’, ‘‘group’’, ‘‘optimum’’, and ‘‘relationship’’. For Cosine measure we have ‘‘curve’’, ‘‘deviations’’, ‘‘equations’’, ‘‘exponential’’ and ‘‘theory’’. For TTS we have ‘‘equations’’, ‘‘methods’’, ‘‘theory’’, ‘‘probability’’, and so on. Next we tried the word ‘‘compounds’’ with the results described on Table 7. In the set of common words we have ‘‘indexing’’, ‘‘information’’, ‘‘problem’’ and ‘‘structures’’. For BIC-aiNet the words that stand out are ‘‘bonds’’, mathematical’’, ‘‘substructure’’. The Cosine similarity brings us the words ‘‘bond’’, ‘‘fracture’’, ‘‘thesaural’’. And the TTS results have ‘‘chemical’’, ‘‘relationship’’ and ‘‘structure’’. And, finally, the word ‘‘morphology’’ is described in Table 8. Here we have some general terms as the common words: ‘‘basic’’, ‘‘developed’’, ‘‘general’’, ‘‘means’’, ‘‘methods’’, ‘‘natural’’, ‘‘problems’’, ‘‘terms’’. For BIC-aiNet more notably are the words

123

Query expansion using an immune-inspired biclustering algorithm

597

Table 6 Results for the word ‘‘mathematics’’ on CISI dataset Words exclusive from BICaiNet

age, analyses, analyzed, attained, calculate, careful, close, complex, consulted, contents, difference, factors, find, forecasting, forty, frequencies, group, guidance, hold, implies, intended, item, items, journal, kendall, lent, limited, lists, major, necessarily, need, obtain, optimum, parameters, point, position, read, reader, reason, reasonably, reasons, references, regular, relation, relationship, relative, review, satisfactory, similar, sir, twelve, type, unpublished, validity, wider

Words exclusive from Cosine similarity

admittedly, advanced, agreement, ambiguity, ancient, bandwidth, cell, cells, confirmed, curve, curves, deliberately, depends, deviations, discrepancy, discussions, distortion, doctrine, doubling, doubt, equations, exists, exponential, fixed, formulations, goffman, jardine, lines, mast, mathematical, mathematics, meanings, measures, model, models, morse, negligible, parallel, pointed, presumably, probability, profitably, quantifiable, rate, recovery, secure, shannon, shifting, simplification, spread, subjective, substitutes, taube, taxonomy, term, theory, whilst, wishes

Words exclusive from TTS algorithm

abstract, advanced, agreement, analysis, application, automatic, based, basic, basis, book, called, characteristics, classification, collection, communication, cost, curve, data, development, disciplines, document, equations, general, growth, information, libraries, library, literature, mathematics, measures, method, methods, model, models, number, probability, problems, process, properties, provide, rate, recall, relevance, research, results, retrieval, science, scientific, study, subjective, system, systems, taxonomy, term, theory, titles, user

Common words

distribution, problem, terms

Table 7 Results for the word ‘‘compounds’’ on CISI dataset Words exclusive from BICaiNet

abstracts, applications, approach, austin, based, bonds, british, ciphers, coded, conversion, developed, field, generally, improvement, input, known, machine, mathematical, paper, quantitatively, reports, results, service, substructure, techniques, united, wiswesser

Words exclusive from Cosine similarity

abbreviated, bearing, bond, combined, compound, coordinate, coordinated, correlative, difficulty, employed, enabled, especially, fracture, influenced, influences, invariant, kevin, post, pre, presence, prior, producers, proof, registration, syntactic, thesaural, unacceptable, unique, wipke, wrong

Words exclusive from TTS algorithm

algorithm, cas, chemical, compounds, computer, coordinate, decision, employed, especially, indexes, making, names, post, pre, programs, relationship, retrieval, ring, rules, set, structure, studies, subject, syntactic, system, systems, unique, words

Common words

indexing, information, problem, structures

‘‘chemistry’’, ‘‘Chomsky’’ (a linguistic professor from MIT), ‘‘foundations’’, and ‘‘organic’’. For the Cosine similarity we have ‘‘adjective’’, ‘‘dictionaries’’, ‘‘language’’ and ‘‘synonymous’’. And, finally, the TTS algorithm brings us the words ‘‘adjective’’, ‘‘clusters’’, ‘‘document’’ and ‘‘thesaurus’’. 5.4.3 Results for 20-newsgroups dataset For the 20-newsgroups dataset we started by looking for two common and general words from two different subjects. The first word chosen was ‘‘religion’’ which comprises at least three different newsgroups from the dataset. The results are shown on Table 9 where, now,

123

598

P. A. D. de Castro et al.

Table 8 Results for the word ‘‘morphology’’ on CISI dataset Words exclusive from BIC-aiNet

automatically, bound, broad, chemistry, chomsky, concern, conclusion, construct, correct, correspondence, data, deals, deeper, designed, desire, discovery, drawn, established, exact, explore, expose, fail, forms, foundations, gain, inadequacy, intuition, lead, linguistics, logical, mere, method, models, morphology, motivation, narrow, negative, notions, order, organic, part, play, precisely, process, recommendations, respects, rigorous, role, solutions, source, titles, translating, unacceptable

Words exclusive from Cosine similarity

accounted, adjective, analysis, answering, approaches, aspects, attest, beginnings, clustering, clusters, conception, connected, consist, correspondingly, description, desirability, dictionaries, differential, dimensional, dimensions, english, enormity, entail, exhibit, explicit, factor, generality, generalizing, grouped, grouping, ideally, interpret, jeffrey, katzer, kumar, language, maximal, meaning, minimally, priori, questionable, reliably, respondent, scales, semantic, semantics, seventy, shreider, simplifies, simplistic, subgraphs, subsets, subtle, synonymous, syntactic, text, thousands, translation, underlies, unstructured

Words exclusive from TTS algorithm

abstracting, adjective, analysis, answering, aspects, automatic, based, clusters, conception, concepts, description, descriptor, development, document, english, experimental, experiments, fact, factor, field, formal, index, indexing, information, language, large, meaning, measure, model, need, point, present, procedure, processing, question, research, results, retrieval, scales, search, single, structure, study, system, systems, text, theory, thesaurus, translation, user, vocabulary, word, words

Common words

basic, developed, general, means, methods, natural, problems, terms

Table 9 Results for the word ‘‘religion’’ on 20-newsgroups dataset with / = 0.06 Words exclusive accord, assault, basebal, capabl, care, christ, church, cleveland, close, concern, consid, from BIC-aiNet cours, creat, desir, develop, doctrin, establish, ethnic, european, exist, form, foundat, freenet, ground, hard, help, includ, john, later, legal, level, longer, matter, meant, mind, minor, nation, need, note, order, origin, paul, practic, purpos, put, respons, right, saw, school, tradit, univers Words exclusive from TTS algorithm

act, argument, articl, atheism, atheist, believ, bibl, certainli, christian, claim, compar, discuss, don, drug, especi, fact, faith, feel, find, follow, freedom, gener, god, jewish, misc, moral, muslim, part, present, question, religi, respect, sai, scienc, sex, show, strong, studi, talk, teach, thing, think, thought, todai, true, truth, try, understand, want, war, worship

Common words

accept, american, answer, belief, book, choos, countri, govern, group, histori, human, idea, islam, jesu, jew, kill, law, live, major, mean, natur, non, person, point, post, read, reason, statement

we can find a larger set of common words, all of them very close to this subject. For the exclusive words we must point out some important ones found by BIC-aiNet like ‘‘Christ’’, ‘‘church’’, ‘‘doctrin’’, ‘‘ethnic’’, ‘‘Paul’’, ‘‘nation’’ and ‘‘universe’’. For TTS exclusive words we have ‘‘atheism’’, ‘‘believ’’, ‘‘bibl’’, ‘‘Christian’’, ‘‘God’’ and ‘‘worship’’. After that, we search for the term ‘‘hardware’’ that can be found in at least 7 subjects from the dataset, the results are shown on Table 10. The set of words common to both algorithms have many words related to this subject, but also some not so useful. The set of exclusive words retains, for BIC-aiNet, the terms ‘‘bug’’, ‘‘complex’’, ‘‘feature’’, ‘‘interface’’, ‘‘input’’, and ‘‘toolkit’’. For TTS we have ‘‘bit’’, ‘‘chip’’, ‘‘compat’’, ‘‘monitor’’, ‘‘network’’ and ‘‘workstat’’.

123

Query expansion using an immune-inspired biclustering algorithm

599

Table 10 Results for the word ‘‘hardware’’ on 20-newsgroups dataset with / = 0.003 Words exclusive adapt, addit, appear, archiv, base, book, bug, built, close, common, complex, confer, from BIC-aiNet copi, creat, current, detail, differ, digit, due, earli, end, environ, exist, fail, featur, find, input, interest, interfac, item, languag, later, learn, librari, like, link, main, manag, meet, non, note, older, open, order, point, present, print, put, question, reason, releas, remov, sai, show, singl, small, smaller, specif, speed, strang, tend, thing, tool, toolkit, try, updat, view, work Words exclusive from TTS algorithm

appl, bit, board, build, chip, chri, color, comp, compat, compress, comput, control, data, de, devic, disk, displai, do, don, drive, driver, engin, export, gener, govern, ibm, imag, includ, inform, instal, kei, look, lot, mac, machin, macintosh, make, manufactur, modem, monitor, need, net, network, number, platform, port, price, problem, program, read, requir, sampl, sell, server, softwar, sound, standard, sun, sy, system, think, uni, univers, video, window, work, workstat, write

Common words

articl, card, check, com, design, develop, doesn, go, graphic, group, implement, provid, run, support, user, want

Table 11 Results for the word ‘‘slot’’ on 20-newsgroups dataset with / = 1 Words exclusive from BIC-aiNet

articl, do, hard, keyboard, mous, pass, port, power, serial, suppli, window, work

Words exclusive adapt, board, card, expans, heard, isa, mac, motherboard, plug, simm, socket, vesa from TTS algorithm Common words

bit, bu, built, cach, chip, control, drive, ram, recent, support, video

Finally, the word ‘‘slot’’, that has more meaning on computer topics, was searched and the results are presented on Table 11. The set of common words bring us the most obvious choices like ‘‘bit’’, ‘‘chip’’ and ‘‘control’’. The exclusive set of words for each algorithm brings us a diverse pool of choices that can lead the user to different sets of documents. For BIC-aiNet we have ‘‘keyboard’’, ‘‘mous’’, ‘‘port’’ and ‘‘window’’. For TTS we should point out the words ‘‘card’’, ‘‘expans’’, ‘‘isa’’ and ‘‘motherboard’’. The results show that it is difficult to establish the best algorithm. Although they present alternative set of terms, all of them are correlated with the initial term. The difference between the suggested terms was expected, since each algorithm analyzes the words with different distance measures and distinct procedures. Cosine similarity could be seen as a kNN approach, whereas TTS performs a statistical analysis followed by a clustering of the terms. The BIC-aiNet algorithm first clusters the words according to subsets of terms and then performs statistical analysis with a different similarity measure. The terms generated by the different techniques seem to be complementary, so a method to combine these different approaches could lead to significant improvements in the query expansion problem. Regarding the occurrence of some useless words in the set of suggested terms, this fact happens because the algorithms are based on statistical properties of each term inside the collection of documents. One manner to avoid or alleviate this drawback is to perform a post analysis process which removes every suggested term that was never chosen by the users, given a time period or certain number of times that it was presented.

123

600

P. A. D. de Castro et al.

6 Concluding remarks In a previous work (Castro et al. 2007c), the authors have proposed an immune inspired algorithm for biclustering, called BIC-aiNet. The biclustering technique is not directly related to the conventional clustering, characterized by the use of the whole dataset, mandatory correlation of all attributes, and the assumption that one item must belong to just one cluster. Instead, it is possible to use just some attributes per cluster, making the whole process not only more efficient but also capable of generating information considering several alternative perspectives, which is possibly useful for classification or decision making in further analysis. The BIC-aiNet was applied to text mining and promising results were obtained. The success of BIC-aiNet is due to the fact that most texts refer to more than one subject, and ‘‘off-topics’’ happen with some frequency in texts generated by forums or newsgroups. Also, some additional knowledge can be extracted from the texts, such as: (i) words that commonly refer to a given subject, (ii) words that may refer to more than one subject, (iii) how to differentiate ambiguous words, (iv) how to classify texts in sub-topics, (v) how to guide a given search by using the biclusters to narrow the choices, and so on. Motivated by these remarkable characteristics, in this work we have carried out a deeper investigation about the usefulness of the application of the biclusters generated by BICaiNet to query expansion problems. Two datasets were adopted in the experiments performed (namely CISI and 20-newsgroups) to which several queries were made. The suggestions of related terms obtained by BIC-aiNet were compared to those obtained by two proposals from the literature: the query expansion approach based on TTS and the one based on the cosine metric of similarity, As the results have shown, all the algorithms were capable of returning unique relevant sets of suggestions to a given query, as well as some irrelevant or generic words, without being possible to state that any of them is superior to the other. However, it is clear from the obtained results that the suggestions generated by the different techniques seem to be complementary, so a method to combine these different approaches could lead to significant improvements in the query expansion problem, increasing the recall possibly without a sensible reduction in precision. As a further step on this research, we plan to develop a mechanism to integrate the results of such different query expansion approaches, in order to reduce the amount of generated irrelevant words and increase the overall set of relevant suggestions to a given query and, also, test the capability of BIC-aiNet to deal with data streams. Acknowledgements This research was sponsored by UOL (www.uol.com.br), through its UOL Bolsa Pesquisa program. The authors would also like to thank CNPq and CAPES for their additional financial support.

References Ada GL, Nossal GJV (1987) The clonal selection theory. Sci Am 257:50–57 Agrawal R, Gehrke J, Gunopulus D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM/SIGMOD international conference on management of data, pp 94–105 Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J Jr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd

123

Query expansion using an immune-inspired biclustering algorithm

601

JC, Botstein D, Brown PO, Staudt LM (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403:503–510. doi:10.1038/35000501 Castro PAD, de Franc¸a FO, Ferreira HM, Von Zuben FJ (2007a) Applying Biclustering to Perform Collaborative Filtering. In: Proc. of the 7th International Conference on Intelligent Systems Design and Applications (ISDA), p 421–426, Brazil Castro PAD, de Franc¸a FO, Ferreira HM, Von Zuben FJ (2007b) Evaluating the Performance of a Biclustering Algorithm Applied to Collaborative Filtering—A Comparative Analysis. In: Proc. of the 7th International Conference on Hybrid Intelligent Systems (HIS), p 65–70, Germany Castro PAD, de Franc¸a FO, Ferreira HM, Von Zuben FJ (2007c) Applying biclustering to text mining: an immune-inspired approach. In: Proceedings of the 6th international conference on artificial immune systems (ICARIS), Brazil, pp 83–94 Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the 8th international conference on intelligent systems for molecular biology, pp 93–103 Cho R, Campbell M, Winzeler E, Steinmetz L, Conway A, Wodicka L, Wolfsberg T, Gabrielian A, Landsman D, Lockhart D, Davis R (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell 2:65–73. doi:10.1016/S1097-2765(00)80114-8 Coelho GP, de Franc¸a FO, Von Zuben FJ (2008) A multi-objective multipopulation approach for biclustering. In: Proceedings of 7th international conference on artificial immune systems (ICARIS), vol 5132, pp 71–82 Croft WB, Cook R, Wilder D (1995) Providing government information on the Internet: experiences with THOMAS. In: Proceedings of the 2nd international conference on the theory and practice of digital libraries, pp 19–24 de Castro LN, Timmis J (2002) Artificial immune systems: a new computational intelligence approach. Springer Verlag, London de Castro LN, Von Zuben FJ (2001) aiNet: an artificial immune network for data analysis. In: Data mining: a heuristic approach, pp 231–259 de Castro LN, Von Zuben FJ (2002) Learning and optimization using the clonal selection principle. IEEE Trans Evol Comput 6(3):239–251. doi:10.1109/TEVC.2002.1011539 Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 6(2):182–197. doi:10.1109/4235.996017 Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the 7th international conference on knowledge discovery and data mining, pp 269–274 Divina F, Aguilar–Ruiz JS (2007). A multi-objective approach to discover biclusters in microarray data. In: Proceedings of the genetic and evolutionary computation conference (GECCO’07), London, UK,pp 385–392 Feldman R, Sanger J (2006) The Text Mining Handbook. Cambridge University Press Gira´ldez R, Divina F, Pontes B, Aguilar–Ruiz JS (2007). Evolutionary search of biclusters by minimal intrafluctuation. In Proceedings of the IEEE international fuzzy systems conference (FUZZ–IEEE 2007), London, UK, pp 1–6 Hartigan JA (1972) Direct clustering of a data matrix. J Am Stat Assoc 67(337):123–129. doi: 10.2307/2284710 JASA Jerne NK (1974) Towards a network theory of the immune system. Ann Immunol (Inst Pasteur) 125C:373– 389 Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21. doi:10.1108/eb026526 Lang K (1995) Newsweeder: learning to filter netnews. In: Proceedings of the twelfth international conference on machine learning, pp 331–339 Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1:24–25. doi:10.1109/TCBB.2004.2 Manning CD, Raghavan P, Schu¨tze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge Maulik U, Mukhopadhyay A, Bandyopadhyay S, Zhang MQ, Zhang X (2008). Multiobjective fuzzy biclustering in microarray data: method and a new performance measure. In: Proceedings of the 2008 IEEE congress on evolutionary computation (CEC 2008), Hong Kong, China, pp 1536–1543 Mitra S, Banka H (2006) Multi-objective evolutionary biclustering of gene expression data. Pattern Recognit 39:2464–2477. doi:10.1016/j.patcog.2006.03.003 Mitra S, Banka H, Pal SK (2006). A more framework for biclustering of microarray data. In: Proceedings of the 18th international conference on pattern recognition (ICPR’06), Hong Kong, China, pp 1154–1157 Sheng Q, Moreau Y, De Moor B (2003) Biclustering microarray data by Gibbs sampling. Bioinformatics 19(2):196–205. doi:10.1093/bioinformatics/btg1078

123

602

P. A. D. de Castro et al.

Symeonidis P, Nanopoulos A, Papadopoulos A, Manolopoulos Y (2007). Nearestbiclusters collaborative filtering with constant values. In: Advances in web mining and web usage analysis, vol 4811, Lecture notes in computer science. Springer-Verlag, Philadelphia, pp 36–55 Tanay A, Sharan R, Shamir R (2005) Biclustering algorithms: a survey. In: Aluru S (ed) Handbook of computational molecular biology. Chapman & Hall/CRC Computer and Information Science Series, Boca Raton, FL Tang C, Zhang L, Zhang I, Ramanathan M (2001) Interrelated two-way clustering: an unsupervised approach for gene expression data analysis. In: Proceedings of the 2nd IEEE international symposium on bioinformatics and bioengineering, pp 41–48 Zeimpekis D, Gallopoulos E (2005) TMG: a MATLAB toolbox for generating term-document matrices from text collections. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimensional data: recent advances in clustering. Springer, Berlin, pp 187–210

123

Suggest Documents