Text Representation: from Vector to Tensor - CiteSeerX

1 downloads 0 Views 103KB Size Report
Mar 17, 2009 - text by multilinear algebraic high-order tensor instead of the traditional vector. .... in the SVD result. Step3. Solve the core tensor as follows. 1. 2.
Text Representation: from Vector to Tensor* Ning Liu1, Benyu Zhang2, Jun Yan3, Zheng Chen2, Wenyin Liu4, Fengshan Bai1, Leefeng Chien5 1 Department of Mathematical Science, Tsinghua University, Beijing 100084, P.R. China {liun01, fbai}@mails.tsignhua.edu.cn 2 Microsoft Research Asia, 49 Zhichun Road, Beijing 100080, P.R. China {byzhang, zhengc}@microsoft.com 3 School of Mathematical Sciences, Peking University, Beijing 100871, P.R. China [email protected] 4 Department of Computer Science, City University of Hong Kong, P.R. China [email protected] 5 Institute of Information Science, Academia Sinica [email protected]

Abstract In this paper, we propose a text representation model, Tensor Space Model (TSM), which models the text by multilinear algebraic high-order tensor instead of the traditional vector. Supported by techniques of multilinear algebra, TSM offers a potent mathematical framework for analyzing the multifactor structures. TSM is further supported by certain introduced particular operations and presented tools, such as the High-Order Singular Value Decomposition (HOSVD) for dimension reduction and other applications. Experimental results on the 20 Newsgroups dataset show that TSM is constantly better than VSM for text classification.

1. Introduction and Related Work Information Retrieval (IR) [2] techniques have attracted much attention during the past decades since people are frustrated by being drowned in huge amount of data while still being unable to obtain useful information. Vector Space Model (VSM) [2] is the footstone of many information retrieval techniques, which is used to represent the text documents and define the similarity among them. Bag of Word (BOW) [2] is the earliest approach used to represent document as a bag of words under the VSM. In the BOW representation, a document is encoded as a feature vector, with each element in the vector indicating the presence or absence of a word in *

the document by TFIDF indexing [5]. However, the major limitation of BOW is that it only retains the frequency of the words in the document and loses the sequence information. In the past decade, attempts have been made to incorporate the word-order knowledge with the vector space representation. N-gram statistical language model [3, 4] is a well-known one among them. The entries of the document vector by N-gram representation are strings of n consecutive words extracted from the collections. They are effective approximations and they not only keep the word-order information but also solve the language independent problem. However, the high-dimensional feature vectors of them make many powerful information retrieval technologies, such as Latent Semantic Indexing (LSI) [2] and Principal Component Analysis (PCA) [6], unfeasible for large dataset. During the past few years, the IR researchers have proposed a variety of effective representation approaches for text documents based on VSM. However, since the volume of available text data is increasing very fast nowadays, more and more researchers suggest that [1]: “Are the further improvements likely to require a broad range of techniques in addition to IR area?” These motivate us to seek for a new model for text documents representation based on some new techniques. The requirements for new model are to grasp the context of the word, language independent and to allow large dataset. In this paper, we propose a

This work is done during the first author worked at Microsoft Research Asia as an intern.

Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)

1550-4786/05 $20.00 © 2005 IEEE Authorized licensed use limited to: Bius Jussieu. Downloaded on March 17, 2009 at 12:11 from IEEE Xplore. Restrictions apply.

novel Tensor Space Model (TSM) for text document representation. The proposed TSM is based on the algebraic character level high-order tensors (the natural generalization of matrices) and offers a potent mathematical framework for analyzing the multifactor structure [9] . In contrast to VSM, TSM represents a text document by high-order tensors instead of vectors (1-order tensors) and matrices (2-order tensors). The features of each coordinate are letter “a” to “z” and all the other analphabetic symbols such as interpunctions are denoted by “_”. Moreover a dimensionality reduction algorithm is proposed based on a tensor extension of the conventional matrix Singular Value Decomposition (SVD), known as the High-Order SVD (HOSVD) [8]. HOSVD technique can find some underlying and latent structure of documents and make some algorithms such as LSI and PCA easy to be implemented under TSM. Moreover, the theoretical analysis and experiments tell us that HOSVD under TSM can find some underlying and latent structure of documents and can significantly outperform VSM on the problems of classification with small training data. Another contribution of TSM is that it can involve many multilinear algebra techniques to increase the performance of IR. The rest of this paper is organized as follows. In Section 2, we focus on the multilinear model. The HOSVD algorithm, which is used for computing the underlying space, is presented in Section 3. The experimental results on the 20 Newsgroups [7] are given in Section 4. Conclusion and future work are presented in Section 5.

2. Tensor Space Model “Tensor” is a term in multilinear algebra. It is a generalization of the concepts of “vectors” and “matrices” in the area of linear algebra. Intuitively, a vector data structure is called 1-order tensor and a matrix data structure is called a 2-order tensor. Then a cube-like data structure is called a 3-order tensor and so on. In other words, the higher order tensors are abstract data structures that are generalization of vectors and matrices. TSM makes use of the tensor structure to describe the text documents and makes use of the techniques of multilinear algebra to increase the performance of IR. To start with, we introduce the following notations. In this paper, scalars are denoted by lower case letters a, b, , vectors are denoted by normal font capital letters A, B, , matrices by bold capital letters A, B,…, and the higher-order tensors are denoted by curlicue capital letters A , B , . We define the order of a

tensor as be N if A ∈ R Ai1

I1 ×I2 × ×I N

. The entries of A are

ai1

in iN in iN or where 1 ≤ in ≤ I n for denoted by 1≤ n ≤ N . The traditional BOW cannot catch and utilize the valuable word order information. On the contrary, although the N-gram representation of documents can imply this word sequence information, the high dimensional vectors generated lead to very high storage and computational complexity. The high complexity can fail many powerful tools such as LSI and PCA in the text mining and information retrieval process. Hence, we propose to use higher order tensors to represent the text documents so that both the word order information and the complexity problem are considered. Moreover, we will show that the proposed TSM can give many other advantages compared with the popular used BOW and N-gram models. The TSM is a model of text document representation. We start from a simple example. Consider the simple document below, which consists of eleven words: “Text representation: from vector to tensor”

A = {a

} ∈ RI1 ×I2 ×I 3

i1i2i3 to We use a 3-order tensor represent this document and index this document by the 26 English letters. All the other characters except the 26 characters such as interpunctions and spaces are treated as the same and denoted by “_”. The character string in this document could be separated by characters as: “tex, ext, xt_, t_re, _re, rep, …” The 26 letters “a” to “z” and “_” scale each axis of the tensor space. Then the document is represented by a 27 × 27 × 27 tensor. The “_” character corresponds to zero of each axis, “a” to “z” correspond to 1 to 26 of each axis. For example, the corresponding position of “tex” is (20, 5, 24) since “t” is the 20th character among all the 26 English letters, “e” is the 5th and “x” is the 24th. Another example, “xt_” corresponds to (24, 20, and 0). Then we use the TFIDF method to weight each position of the tensor in the same way for VSM. By doing so, each document is represented by a character level 3-order tensor, as shown in Figure 1. If we put a corpus of documents together, it is a 4order tensor in a 27 × 27 × 27 × m space, where m is the number of documents, as illustrated in Figure 2. Figure 2 only shows a 4-order TSM. In our model, the order of tensor for each document is not limited to 3, thus the order of a tensor for a corpus of documents is not limited to 4. Without loss of generality, a corpus with m documents could be represented as a character

A = {a

} ∈ R 27 ×27 ×

×27 × m

i1i2i3 iN where each level tensor document is represented by an (N-1)-order tensor.

Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)

1550-4786/05 $20.00 © 2005 IEEE Authorized licensed use limited to: Bius Jussieu. Downloaded on March 17, 2009 at 12:11 from IEEE Xplore. Restrictions apply.

Text representation: from vector to tensor

Figure 1. A document is represented as a character level 3-order tensor

Step2. For n = 1 N , compute the matrix U n by performing the SVD of the flattened matrix D(n ) where U n is the left matrix in the SVD result. Step3. Solve the core tensor as follows Z = D ×1 U1T ×2 U 2T ×N U NT where matrix U n contains the orthogonal vectors spanning the column space of the matrix D(n ) . D(n ) is the matrix unfolding of the tensor, which is the matrix representations of that tensor in which all the column vectors are ranked sequentially.

4. Experiments A corpus of documents

Figure 2. A corpus of documents is represented as a 4-order tensor

3. HOSVD Algorithm VSM represents a group of objects as a “term by object matrix” and uses SVD technique to decompose the matrix as D = U1 ∑ U 2T , which is the necessary technique for PCA and LSI. Similarly, a tensor D in TSM undergoes Higher-Order SVD, which is an extension of matrix SVD. This process is illustrated in Figure 3 for the case N=3.

4.1. Experiments Setup We have conducted experiments to compare the proposed TSM with VSM on 20 Newsgroups Dataset, which has become a popular data set for experiments in text applications of machine learning techniques. In this paper, we select the five classes of 20 Newsgroups collection about computer science that are very closely related to each other. The widely used performance measurements for text categorization problems are Precision, Recall and Micro F1 [2]. Precision is a ratio, which could be computed by the number of right categorized data over the number of all testing data. Recall is a ratio, which could be computed by the number of right categorized data over the number of all the assigned data. Micro F1 is a common measure in text categorization that combines recall and precision. In this paper, we use the Micro F1 measure, which combines recall and precision into a single score according to the following formula: Micro F1 = 2 P × R

P+R

where P is the Precision and R is the Recall [2].

4.2. Experiment Results

Figure 3. Illustration of a Higher-Order SVD (N=3) The HOSVD algorithm based on TSM is described as follows: Step1. Represent a group of documents as a character level tensor D = {di1i2i3 iN +1 } ∈ R27 ×27 × ×27 ×m , where m is the number of documents.

Each text document of this dataset is mapped into a 131,072-dimension vector under VSM. We use a 4order tensor to represent each document under TSM. The original dimension of each tensor is 274=531,441. (In Figures, x^y means xy). We use the HOSVD technique to reduce the dimension of the tensor to different dimensions (264, 204 and 104). In Figure 4, we report the results of VSM in contrast to TSM in different reduced dimensions. It can be seen that the result of 4-order tensor with the original dimension (274) is better than VSM and the result of 4-order tensor whose dimension is reduced to

Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)

1550-4786/05 $20.00 © 2005 IEEE Authorized licensed use limited to: Bius Jussieu. Downloaded on March 17, 2009 at 12:11 from IEEE Xplore. Restrictions apply.

264 by HOSVD is the best one. This proves that HOSVD can find the principal component and wipe off the noise. However, though the performance of 104dimension reduced tensor is much lower than VSM and original 4-order TSM, the result is acceptable and this low dimensional representation could make the computation more efficient. 0.82 0.8

by document matrix”, and the rank of this matrix determines there are at most 165 singular vectors. On the contrary, the HOSVD under TSM can reduce the data to any dimension and is not limited by the number of samples. Figure 5 shows that the 12 × 12 × 12 reduced tensors can achieve outstanding performance than all the others while the SVD under VSM cannot reduce data to such a dimension.

5. Conclusion and Future Works

Micro F1

0.78 0.76 0.74 0.72 0.7 0.68 VSM(131,072)

TSM(27^4)

TSM(26^4)

TSM(20^4)

TSM(10^4)

Figure 4. Text categorization with VSM and TSM on 20 Newsgroups We do not reduce the dimension of 20 Newsgroups data by SVD under VSM since the huge term by document matrix for 20 Newsgroups to be decomposed is hard to be done due to its high time and space complexity. To compare the VSM by SVD with TSM by HOSVD, we randomly sampled a subset of the 20 Newsgroups with the ratio of about 5% such that the data dimension is about 8,000 under VSM re-indexing. The subset contains 230 documents in two classes, 165 for training and 65 for testing. By doing so, we can perform the matrix SVD on this sampled data. Figure 5 shows the results. 0.9

Micro F1

0.85

In this paper, we propose to use Tensor Space Model to represent text documents. By using TSM and HOSVD, some underlying and latent structure of documents can be found. Theoretical analysis and experimental results show that the proposed TSM keeps the merits but improves some disadvantages of VSM for certain IR problems. The TSM proposed in this paper implies many new tasks to be done. For instances, design of kernel of tensors for better similarity measurement, testing of the performance of TSM on non-language dataset, customization and application of the techniques originally applied to traditional VSM for TSM, and investigation and application of more multilinear algebra theorems to increase the performance of IR under TSM, are on our future work agenda.

6. References [1] Aslam, J., Belkin, N., Zhai, C., Callan, J., Hiemstra, D., Hofmann, T., Dumais, S., Harper, D.J., et.al. Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval, University of Massachusetts Amherst, 2001. [2] Baeza-Yates, R. and Ribeiro-Neto, B. Modern Information Retrieval. Addison-Wesley, 1999.

[3] Cavnar, W.B. and Trenkle, J.M., N-Gram-Based Text Categorization. In Proceedings of the SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, (1994), 161--169.

0.8

0.75

0.7

0.65 VSM(8,192)

VSM(125)

VSM(64)

TSM(5^3)

TSM(4^3)

TSM(12^3)

TSM(27^3)

Figure 5. Text categorization on a subset of 20 Newsgroups It can be seen that if we reduce the data under VSM and the data under TSM to the same dimension (125 versus 53, 64 versus 43, etc.), the reduced TSM data can always outperform its counterpart. Moreover, the dimension of reduced data under VSM cannot larger than 165 since the number of documents (smaller than the term dimension) determines the rank of this “term

[4] Croft, W.B. and Lafferty, J. Language Modeling for Information Retrieval. Kluwer Academic, 2003. [5] Gerard, S. and Chris, B. Term Weighting Approaches in Automatic Text Retrieval, Technical Report TR87-881, Department of Computer Science, Cornell University,1987

[6] Jolliffe, I.T. Principal Component Analysis. Spriger Verlag, New York, 1986. [7] Lang, K., NewsWeeder: Learning to Filter Netnews. In Proceedings of the 12th International Conference on Machine Learning (ICML 1995), 331-339. [8] Lathauwer, L.D., Moor, B.D. and Vandewalle, J. A Multilinear Singular Value Decomposition. SIAM Journal on Matrix Analysis and Applications, 21. 1253-1278. [9] Wrede, R.C. Introduction to Vector and Tensor Analysis. Wiley, 1963.

Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)

1550-4786/05 $20.00 © 2005 IEEE Authorized licensed use limited to: Bius Jussieu. Downloaded on March 17, 2009 at 12:11 from IEEE Xplore. Restrictions apply.