CHAOS2009:2nd Chaotic Modeling and Simulation ...

3 downloads 46 Views 106KB Size Report
There are many search engines in the web and when asked, they return a long list of .... optimization is performed on a unit hyper sphere with the columns of W ...
Using a Nonnegative Matrix Factorization (NMF) for Clustering Data Hussam Dahwa Abdulla, Martin Polovincak, Vaclav Snasel Department of Computer Science, VSB – Technical University of Ostrava [email protected], [email protected], [email protected]

Abstract There are many search engines in the web and when asked, they return a long list of search results, ranked by their relevancies to the given query. Web users have to go through the list and examine the titles and (short) snippets sequentially to identify their required results. In this paper we present how usage of Nonnegative Matrix Factorization (NMF) can be good solution for the search results clustering.

1. Introduction In the last few years the amount of information is growing exponentially. Easiness of using Information and easiness of access to Information brew to become a big problem to the information retrieval. Query results contain a lot of data and it can be hard to choose or find the relevant information in the result. The huge numbers of data and inability to recognize the type of data lead to inability for the right searching Information. For users with no prior experience, manual searching of a topic in the web can be difficult and time consuming. The major difficulties are the complicacy of the content and the classification of the huge information in the web, identifying and naming topics and relationships between these topics. In this situation, clustering data gives us a good result for data analysis. We can use the search result clustering in width area from different fields. In this paper we present one of the methods for clustering data to be used in the search result clustering. We use the Nonnegative Matrix Factorization as a mathematical method to reduce a high number of objects by combining the attributes of these objects[1].

2. Search results clustering In the last years, search result clustering has attracted a substantial amount of research (e.g., information retrieval, machine learning, human-

computer interaction, computational linguistics, data mining, formal concept analysis, graph drawing, etc.). Search result clustering groups search results by topic. Thus provides us with complementary view to the information returned by big documents ranking systems. This approach is especially useful when document ranking fails to give us a precise result. This method enables a direct access to a subtopic; search result clustering reduces the information, helps filtering out irrelevant items, and favours exploration of unknown or dynamic domains. Search result clustering, is different from the conventional document clustering. When clustering takes place as a postprocessing step on the set of results retrieved by an information retrieval system on a query, it may be both more efficient, because the input consists of few hundred of snippets. It also can be more effective, because query-specific text features are used. On the other hand, search result clustering must fulfil a number of more stringent requirements raised by the nature of the application in which it is embedded [2].

3. Nonnegative Matrix Factorization Nonnegative matrix factorization (NMF) differs from other rank reduction methods for vector space models in text mining by using of constraints that produce nonnegative basis vectors. There vectors make the concept of a parts-based representation possible. First introduced the notion of parts-based representations for problems in image analysis or text mining that occupy nonnegative subspaces in a vector-space model [10]. Basis vectors contain no negative entries. This allows only additive combinations of the vectors to reproduce the original. The perception of the whole becomes a combination of its parts represented by these basis vectors. In text mining, the vectors represent or identify some semantic features. If a document is viewed as a combination of basis vectors, then it can be categorized as belonging to the topic represented by its principal vector. Thus, NMF can be used to organize text collections into partitioned structures or clusters directly derived from the nonnegative factors.

Common approaches to NMF obtain an approximation of V by computing a (W,H) pair to minimize the Frobenius norm of the difference VĺWH. Let

: ∈ 5 P[N

9 ∈5 P[Q be a nonnegative matrix and N[Q and + ∈5 for 0 0 and these initial estimates are improved or updated with alternating iterations of the algorithm. In the following subsections some existing NMF techniques are discussed.[3][4][5]

Fig. 1. K-reduced Nonnegative Matrix Factorization (NMF)

3.1 Multiplicative method The NMF method proposed by Lee and Seung is based on multiplicative update rules of W and H. This scheme is referred to as the multiplicative method (MM). MM Algorithm (1) Initialize W and H with nonnegative values. (2) Iterate for each c, j, and i until convergence or after l iterations: (a) + FM ← + FM (b) :LF ← :LF

: 7 9 FM :7 :+ FM + dž

9+ 7 LF :++ 7 LF + dž

In the 2 steps (a) and (b), dž , a small positive parameter equal to 10-9, is added to avoid division by zero. As observed from the MM Algorithm, W and H remain nonnegative during the updates. Simultaneous updating of W and H generally yield better results than updating each matrix factor fully before the other. In the algorithm, the columns of W or the basis vectors are

A new nonnegative sparse encoding scheme, based on the study of neural networks has been introduced in [6]. This method includes an important feature that enforces a statistical sparsity of the H matrix. As the sparsity of H increases, the basis vectors become more localized, i.e., the parts-based representation of the data in W becomes more and more enhanced. 3.3. A hybrid method In this approach, the multiplicative method, which is basically a version of the gradient descent optimization scheme, is used at each iterative step to approximate the basis vector matrix W. H is calculated using a constrained least squares (CLS) model as the metric. It serves to penalize the nonsmoothness and no nonsparsity of H. As a result of this penalization, the basis vectors or topics in W become more localized, thereby reducing the number of vectors needed to represent each document. [7][8][9]

4. Problem formalization and algorithm The distinctive characteristic of the algorithm is that it identifies meaningful cluster labels and only then assigns search results to these labels to build proper clusters. The algorithm consists of five steps: 1. Pre-processing the input snippets, which includes tokenization, stemming and stop-word marking. 2. Identifies words and sequences of words frequently appearing in the input snippets. 3. Matrix decomposition is used to induce cluster labels. 4. Snippets are assigned to each of these labels to form proper clusters. 5. Post-processing, which includes cluster merging and pruning. Step 3 is the core of the algorithm, since this step relies on the Vector Space Model and a term-document matrix A. Matrix A has n rows and m columns; where n is the number of input snippets, and m is the distinct words found in the input snippets. Each element Anm of A numerically represents the relationship between word m and snippet n.

The Nonnegative Matrix Factorization may be applied on the binary matrix A which was created from step 2, where the rows of the matrix are the input snippets (objects), and the columns are the distinct words found in the input snippets (attributes). The input snippets and their attributes are presented as 0 and 1.(0 – the distinct word not found in the input snippet, 1 – the distinct word found in the input snippet). Since the rank-k NMF is known to remove noise by ignoring small differences between row and column vectors of A (they will correspond to small singular values, which we drop by the choice of k). It can be used in our algorithm, because NMF creates equivalence classes of data from the original data through adding some no primary attributes in the objects. This process leads the objects to have similarity in their attributes. From this similarity the algorithm combines these objects that have the same attributes and presents them like one object. Web results Preprocessing the input snippets Identifies words and sequences of words frequently

NMF assign snippets Topic 1

Topic 2

.....

Topic n

Fig. 2. Steps of the algorithm process The output of this process can be seen as groups of snippets that has a similar distinct word. This means that the algorithm can minimize the huge volume of snippets received as a result from the searching in the web.

5. Experiment We applied our experiment on data that were taking selected by using Google search engine, contained from 100 snippets with 35 distinct different words. However, for the reason easy explanation of how the algorithm works, we deduct small part from these data.

The data which we deducted from the result of Google search engine (table 1) are 20 objects (snippets) with 8 attributes (distinct words). These data have peculiar combination in every object.

O0 O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 O11 O12 O13 O14 O15 O16 O17 O18 O19

A0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1

A1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

A2 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 0 0 0 1 1

A3 0 0 0 0 0 0 1 0 1 0 1 1 1 0 0 1 1 0 1 1

A4 0 1 1 1 1 0 1 1 0 1 1 0 1 1 0 1 0 1 0 0

A5 0 0 0 0 0 0 1 1 0 1 1 1 0 0 1 0 1 1 0 0

A6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

A7 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1

Table 1. Data deducted from the result of Google search engine

O0 O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 O11 O12 O13 O14 O15 O16 O17 O18 O19

A0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1

A1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

A2 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1

A3 0 0 0 0 0 0 1 0 1 0 1 1 1 0 0 1 1 0 1 1

A4 0 1 1 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 0 0

A5 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1

A6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

A7 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1

Table 2. Data after using NMF with k-rank (the grey cells represent the change on the attributes)

From the table below, we can see how these attributes have changed to have similar attributes combination. Moreover, we notice that the change of attributes is more effective when object has more attributes. This process minimizes the representing data to 40% from the original data, but retains the primary attributes in the objects.

O0 O1 O2 O3 O4 O5 O6 O7

A0 1 1 0 0 0 0 0 1

A1 0 0 0 0 0 0 0 0

A2 0 0 0 1 1 1 0 1

A3 0 0 0 1 0 1 1 0

A6 0 1 1 1 1 0 1 1

A7 0 1 1 1 1 1 1 1

A8 0 0 0 0 0 0 0 0

A9 1 1 0 0 0 0 0 1

The efficiently of this method can be shown clearly when we have huge data −with big quantity of attributes− and a good choice of k-rank from the twos matrices (W and H) when applying the NMF method. Figure 3 showed how the k-rank value effects in the number of attributes combination.

6000

number of combinations

5000

4000

3000

2000

1000

0 5

10

15

20

7. Reference [1]

25

rank

Fig. 3. Relationship between value of k-rank and number of combination

6. Conclusion and future work Nonnegative Matrix Factorization (NMF) with k-rank depends on adding some attributes to the objects in the original data. Attributes addition can provide us minimal primary attributes to collect more objects that have the same combination of attributes. Applying this method of decomposition on the

Stanislaw Osinski, “Improving Quality of Search Results Clustering with Approximate Matrix

Factorisations”, ECIR 2006. Hua-Jun Zeng, Qi-Cai He, Zheng Chen, Wei-Ying Ma, Jinwen Ma, “Learning to cluster web search results”, SIGIR 2004. [3] Vasclav Snasel, Petr Gajdos, Hussam Dahwa Abdulla and Martin Polovincak, “Concept Lattice Reduction by Matrix Decompositions”, DCCA 2007. [4] Vaclav Snasel, Hussam Dahwa Abdulla and Martin Polovincak, “Behavior of the Concept Lattice Reduction to visualizing data after Using Matrix Decompositions”, IEEE Innovations’07, 2007. [5] Vasclav Snasel, martin Polovincak, Hussam Dahwa Abdulla and Zdenek Horak, “On Knowledge Structures Reduction”, IEEE CISIM 2008. [6] Hoyer, P., Non -negative Sparse Coding. In Proceedings of the IEEE workshop on neural networks for signal processing. Martigny, Switzerland, 2002 [7] M. Berry and M. Browne, “Understanding Search Engines, Mathematical Modelling and Text Retrieval”, Siam, 1999. [8] M. Berry, S. Dumais, and T. Letsche, “Computation Methods for Intelligent Information Access”, In Proceedings of the 1995 ACM/IEEE Supercomputing Conference, 1995. [9] R. M. Larsen, “Lanczos bidiagonalization with partial reorthogonalization”, Technical report, University of Aarhus, 1998. [10] Lee, D., Seung, “H. Algorithms for non-negative matrix factorization”, Advances in Neural Information Processing Systems, 2001, pp. 556-562.

[2]

Table 3. representing data after combining the similar objects

0

problems of web searching gives us a good solution for search results clustering.