Dimensionality Reduction by Semantic Mapping - Semantic Scholar

3 downloads 0 Views 81KB Size Report
3. Semantic Mapping. The semantic mapping method (SM) was developed in .... of Entertainment (without subcategory, art, cable, culture, film, industry, media ...
Dimensionality Reduction by Semantic Mapping Renato Fernandes Corrêa 1, 2, Teresa Bernarda Ludermir 2 1 Polytechnic School, Pernambuco University Rua Benfica, 455, Madalena, Recife - PE, Brazil, 50.750-410 2 Center of Informatics, Federal University of Pernambuco P.O. Box 7851, Cidade Universitaria, Recife - PE, Brazil, 50.732-970 {rfc, tbl}@cin.ufpe.br Abstract In high-dimensional classification and clustering tasks, the dimensionality reduction become necessary to computation and interpretability of the results generated by machine learning algorithms. This paper describes a new feature extraction method called semantic mapping, that uses SOM maps to construct variables in reduced space, where each variable describes the behavior of a group of features semantically related. The performance of the semantic mapping is measured and compared empirically with the performance of sparse random mapping and PCA methods in a study case of text categorization, and shows to be better than random mapping and a good alternative to PCA.

1. Introduction When the data vectors are high-dimensional it is computationally infeasible to use data analysis or pattern recognition algorithms which repeatedly compute similarities or distances in the original data space [1]. For example, some neural networks, like most statistical models, cannot operate with tens of thousands of input variables [2]. Also, to interpret the results and mining knowledge of models generated by machine learning algorithms from high-dimensional data is a difficult task. To develop Automatic Text Categorization Systems, the dimensionality reduction methods are needed because documents are represented by high-dimensional vectors. Text Categorization is a process of classifying documents by grouping them into one or more existing categories according to the themes or concepts present in their content. The most common application of Text Categorization is document indexing for Information Retrieval Systems (IRS). Automatic Text Categorization systems typically have to process an influx of new documents, sometimes in real time, assigning each one of

them to none, one or many categories. The set of categories into which the documents are classified is typically predefined by the designer or maintainer of the system and generally remains unchanged over long period of time. The vector space model [3] is generally adopted to represent the documents. In this model, terms become indices and the respective values represent the importance of the term in the document. These terms consist of isolated words or groups of related words. The dimension of document vectors is equals to the size of corpus’ vocabulary, about 50.000 words in large corpus [4]. A preprocessing stage is introduced to reduce the number of terms, generally no-informative words are discarded, and algorithms are used to remove word affixes and for dimensionality reduction. There are many methods to dimensionality reduction. An orthogonal distinction may be drawn in terms of the nature of the resulting terms: term selection methods or term extraction methods [5]. The feature selection methods select a subset of the original set of features using a rank metric. Chi-squared and Information Gain are examples of this kind of method, and generally outperforms other term selection methods in text categorization [6]. The feature extraction methods extract a small set of new features generated by combinations or transformations of the original ones. Latent Semantic Indexing (LSI) [7] and Principal Component Analysis (PCA) [8] are equivalent feature extraction methods, based in the estimation of principal components from term/document matrix. These methods often outperform the feature selection methods in text categorization [5]. LSI has the potential benefit to detect synonyms as well as words that refer to the same topic, however, their computation is hard to high-dimensional data. The objective of this paper is to describe a new feature extraction method called semantic mapping, it is rapid, generally applicable and approximately preserves the mutual similarities between the data vectors. The semantic mapping was derived of random mapping

method [1] and uses self-organizing maps [8] to cluster terms semantically related (i.e. words that refer to the same topic). This paper is organized as follows. In Section 2 is described the random mapping method, and in Section 3 the semantic mapping method is presented. Section 4 describes the metodology and results of the experiments on text categorization used to measure and compare the performance of dimensionality reduction methods. Section 5 contains the conclusions and future works.

2. Random Mapping The random mapping method was generated and used in the context of WEBSOM project [9]. WEBSOM is a method for organizing textual documents onto a twodimensional map display using SOM [1]. Nearby locations on the display contain similar documents, thus the maps are a meaningful visual background providing: an overview of the topics present in the collection and means to make browsing and content-addressable searches. In the random mapping method (RM) [1], the original n-dimensional data vector, denoted by xj, is multiplied by a random matrix R. The mapping yj = R xj results in a d-dimensional vector y j. The matrix R consists of random values normally distributed and the Euclidean length of each column has been normalized to unity. RM is a method generally applicable that approximately preserves the mutual similarities between the data vectors [1]. The computational complexity of forming the random matrix O(nd), is negligible to the computational complexity of estimating the principal components, O(Nn2) + O(n 3) [1]. Here n and d are the dimensionalities before and after the random mapping, respectively, and N is the number of data vectors. In theory and experimentally, RM shows to be efficiently, producing results as good as PCA or the use of data vectors in the original space, when the final dimensionality is larger than 100 features [1] [9]. To increase the speed of computation of RM, simplifications in the construction of the matrix R was suggested and evaluated experimentally [9]. One of these simplifications was construct R as a sparse matrix, where a fixed number of ones (or 5 or 3 or 2) was randomicaly generated in each column (determining in which extracted features each original feature will participate), and the others elements remained equal to zero. This sparse random mapping method (SRM) generated results close to original method. The performance of SRM was directly proportional to the number of ones in each column (better performance was generated with 5 ones). The SRM was used successfully also in [4].

3. Semantic Mapping The semantic mapping method (SM) was developed in research for a better representation of documents in text categorization task, however this can be applied on any database to dimensionality reduction. The SM is a specialization of the sparse random mapping method (SRM). SM incorporates semantics of the features, captured of data-driven form, in construction of new features or dimensions. This method consists of the steps listed in Figure 1.

x 1 f f i i1

Semantic Description of features Semantic Clustering of features Construction of the Matrix of Projection

[

f1

x 2 f

x

i2

f

f2

 m11 m12 m  21 m22   cd md 1 md 2 



N

], i =1

n

fn

c1 c2 

N



m1n  m2 n    mdn  

Mapping of y j = M x j , ∀j vectors Figure 1. Diagram of Semantic Map method. Initially, the features fi are semantically described by theirs frequency in the N vectors of a training set, i.e., the vectors are considered as attributes or dimensions that describe the original features. In text categorization this description has direct interpretation because the semantics or means of a term can be deduced analyzing the context where this is applied, i.e., the set of documents (or document vectors) where it occurs [10]. Semantic description of the features corresponds to get the transpose of the matrix document by terms, where each line represents a document vector. The features fi are grouped in semantic clusters training a self-organizing map (SOM). In SOM maps, similar training vectors are mapped in the same node or neighboring nodes [11], as co-occurrent terms are represented by similar vectors, clusters of co-occurrent terms are formed. In text categorization these clusters typically correspond to topics or subjects treated in documents and probably contain semantic related terms. The formed maps are called semantic maps. The number of nodes in semantic map must be equals to the number of extracted features wanted. After the training of semantic map, each original feature is mapped in a fixed number of k nodes that better

represent it, i.e., the k first nodes that has the closest model vector to the vector that describes the original feature. Let n be the number of original features and d be the number of extracted features, the matrix of projection M was constructed with d lines and n columns, with mij equals to one if the original feature j was mapped into node i, zero otherwise. The position of the ones in the columns of the projection matrix indicates which extracted features each original feature will participate. While in SRM the position of the ones in each column of R is determined randomicaly, in the method of semantic mapping the position of the ones in each column of M is determined in accordance with the semantic clusters where each original feature was mapped. The set of matrices of projection generated by SM is a subset of that generated by SRM and RM, thus SM too approximately preserves the mutual similarities between the data vectors after projection to reduced dimension. The mapping of the vectors of data for the reduced dimension is made multiplying the matrix of projection M for each original vector. After the mapping, the generated vectors may be normalized in unitary vectors. The computational complexity of the SM method is O(nd(iN + k)) that is the complexity of the construction of the semantic map with d units by SOM algorithm from n vectors (original characteristics) with N dimensions (number of document vectors in training set) for i epochs plus the superior complexity of the construction of the matrix of mapping with k ones in each column. This complexity is smaller than the complexity of PCA, and still linear to the number of characteristics in the original space as the RM. The extracted features by SM are more representative of the content of the documents, beyond better interpretable that those generated by random mapping. Thus, the representation of documents generated by SM is expected to improve the performance of the machine learning algorithms in relation to the use of the one generated by RM. This better performance is measured empirically in the experiments related in next session.

4. Experiments In this session is presented the adopted methodology and the results of the experiments. The experiments consist of the application of semantic mapping (SM), sparse random mapping (SRM) and principal component analysis (PCA) to a problem of text categorization. The performances achieved by SRM and PCA were used as references to evaluate the performance of SM.

4.1. Methodology The performance of the projection methods was measured, in each dimension (100, 200, 300 and 400), by

the mean classification error generated by five SOM maps in the categorization of projected document vectors of a test set, trained with the respective projected document vectors of a training set. The classification error for a SOM map is the percentual of documents incorrectly classified when each map unit is labeled according to the category of the document vectors in training set that dominated the node. Each document is mapped to the map node with the closest model vector in terms of Euclidean distance.The document vectors of the test set received the category assigned to the node where they were mapped. The SOM maps are used here as classifiers because the low sensitivity of the training SOM algorithm to the distortions of similarity caused by random mapping [1]. These SOM maps are denominated document maps. The length of projected document vectors were normalized, thus the direction of the vector reflects the contents of the document [1]. The documents categorized belong to K1 collection [12]. This collection consists of 2340 Web pages classified in one of 20 news categories at Yahoo: Health, Business, Sports, Politics, Technology and 15 subclasses of Entertainment (without subcategory, art, cable, culture, film, industry, media, multimedia, music, online, people, review, stage, television, variety). The document vectors of the collection were mounted using the vector space model. These vectors were preprocessed eliminating generic and non-informative terms [12]; the final dimension of the vectors was equal to 2903 terms. After preprocessing, the document vectors was divided randomicaly for each category in half for training set and half for test set; the length of each set was 1170 document vectors. The categories were codified and associated to document vectors as labels. To measure the performance of the methods SM and SRM in relation to the number of ones in each column, for each pair combining dimension and number of ones, 30 matrices of projection for each method were generated. The number of ones in each column in the projection matrix was: 1, 2, 3 and 5. In the case of SM, for each dimension, 30 semantic maps were generated, and for each one of these, 4 matrices of mapping with 1, 2, 3 or 5 ones in each column were generated. For the SRM in each dimension, 30 matrices of projection with only one 1in each column were first constructed, and from these matrices, others matrices were constructed successively adding randomicaly ones until reaching the number of necessary ones in each column. The PCA method involves the use of SVD algorithm in the extraction of the principal components of the matrix of correlation of the characteristics in the training set. The correlation matrix was calculated on a boolean document matrix of features by documents. The components are ordered in such away that the first ones describe most of

the variability of the data. Thus, the last components can be discarded. Given that the components were extracted, 4 matrices of projection were mounted, one for each dimension, taking the 100, 200, 300 and 400 first components respectively. The matrices of projection generated by the three methods had been applied on boolean vector-documents, where each position indicates the presence or the absence of determined term in the document, thus forming the projected vectors in the reduced dimensions. The motivation in projecting boolean document vectors is to test the methods when the minimum of information is supplied. The projected vectors of the training set and test had been normalized and used to construct the document maps and to evaluate the performance of these respectively. The algorithm used for training SOM maps was batchmap SOM [9] because it is quick and have few adjustable parameters. The SOM maps used to construct the semantic maps and document maps had a rectangular structure with a hexagonal neighborhood to facilitate visualization. The Gaussian neighborhood function was used. For each topology, the initial neighborhood size was equal to half the number of nodes with the largest dimension plus one. The final neighborhood size was always 1. The number of epochs of training was 10 in rough phase and 20 in the fine-tuning phase. The number of epochs determines how mild the decrease of neighborhood size will be, since it is linearly decreasing with the number of epochs. The dimensions of document maps were 12x10 units (as sugested in WEBSOM project [1]) with the model vectors with 100, 200, 300 and 400 features. Because there is no prior knowledge in word clustering, the semantic maps had the most squared possible topologies: 10x10, 20x10, 20x15 and 20x20, with the model vectors with 1170 features. For all SOM maps topology, randomly initialized configurations were obtained using the som_randinit function of somtoolbox.

4.2. Results The first step was the evaluation of the number of ones needed in each column of the matrices of projection generated by SRM and SM in order to minimize the errors of classification in the test set. The t-test of combined variance [13] was used to compare the peformances of the methods with different numbers of ones, it was applied on the average and the standard deviation of the errors of classification achieved by each method in the test set. In Tables 1 and 2 the following codification of the Pvalue in ranges was used [14]: “>>” and “ 200 < >> ~ 300 >> 400 >> In Table 2, SRM generates better representation of documents in all the dimensions when 2 ones was generated in each column, minimizing the errors of classification. SRM is lesser sensible to the number of ones in the columns of the projection matrix than SM, since in the majority of the times the results had been equivalents, fact already expected due to the purely random nature of SRM in extract features. Table 2. Results of t-test over SRM number of ones Dimension 2-1 3-1 3-2 5-1 5-2 5-3 100 < ~ ~ ~ ~ ~ 200 < ~ ~ ~ ~ ~ 300 ~ > ~ 400 ~ ~ > ~ >> ~ Figure 2 shows the averaged classification error in the test set generated by SRM, SM and PCA in function of the mapping dimensions. It was used 2 ones in each column of the matrices of projection for methods SRM and SM. The bars denote one standard deviation over 150 experiments for SRM and SM (combination of the 30 matrices of projection with the 5 document maps) and 5 for PCA (combination of the matrix of projection with the 5 document maps). Figure 2 shows that the performance of SM had a small variation, but the classification error decreases with increasing of the dimension of projection. The SRM’s classification error decreases significantly with increasing of the dimension of projection. The PCA performance had

a small variation, the classification error decreases with the increasing of the dimension of projection from 100 to 200, after the classification error increases because the principal components after 200 incorporate the variability of the noise. In contrast to SRM, the SM preserves practically the same mutual similarity between document vectors in different dimensions. 













Mean classification error (%)

SRM 70

SM 





PCA 







































































































































































































































































































































































































































































































































































































































































































































































































60 50

top-right region of the map the words "dow" of "Dow Jones", "nasdaq" of "Nasdaq" and "wall" and "Wall Street" appear in neighboring nodes, representing the Stock Exchanges. Globally the region of these nodes contain related terms to government (reforms, economy etc.) and commerce. Hardware companies "ibm" and "intel", were mapped in a node in the middle-right region. In the same map, in top-left region has terms related to movies and television; the middle-bottom region contains terms related to sports. The semantic map can be viewed as a thesaurus, were nodes represent semantic relation between terms.

40 30 20 10 0

100 200 300 400 Dimensionality after mapping

Figure 2. Classification error as function of reduced dimension of document vectors. In Figure 2, PCA had the best performance, followed for SM, that generated classification errors smaller than SRM for all the dimensions. The difference between the performances of PCA and SRM is strong significant but lesser than 10%, this fact becomes SRM a good alternative due to high computational cost of the PCA. Table 3 shows the best results achieved by each method. The dimension of projection 400 minimizes the averaged classification error for SRM and SM; for PCA the dimension that minimizes the errors of classification is 200. All the differences between the performance of the methods are strong significant. Table 3. Better results generated by method. Method Dimension Trn Trn std. Tst Tst std. mean err dev. mean err dev. PCA 200 25,93 0,82 31,91 1,08 SM 400 31,46 0,96 38,46 1,43 SRM 400 43,69 1,72 51,72 2,13 Figure 3 shows the top-right region of a semantic map for reduced dimension 100. In this map, each word was mapped in two more representative nodes. For each node is listed the three words whose Euclidean distance to the model vector of the node is the minors possible) and the number of words mapped. In Figure 3, words semantically related are mapped in the same node, or neighboring nodes, for example, in the

Figure 3. Top-right region of a semantic map for reduced dimension 100.

5. Conclusions Theorically and experimentally, the characteristics extracted for the method of semantic mapping (SM) show to be more representative of the content of the documents and better interpretable that those generated by sparse random mapping (SRM). SM showed to be a viable alternative to PCA in the dimensionality reduction of high-dimensional data due to the performance relatively close to PCA and the computational cost linear to the number of characteristics in the original space as SRM. The construction complexity of semantic maps, which corresponds to the most computational expensive pass of SM, can be diminished using variations of the SOM training algorithm. One of these variations [15] generates the same results that the original SOM algorithm, however with complexity in function of the number of not-null components in data vectors and no more of the total number of original characteristics. That is

advantageous because document-vectors in the original space are highly sparse. As future works it is intended to test SM method on term-frequency based representation of document-vectors and to modify SM to use weights attributed to each word instead of simple ones in each column with the goal of improve performance.

6. References: [1] S. Kaski, “Dimensionality reduction by random mapping: Fast similarity computation for clustering”, Proc. IJCNN’98 Int. Joint Conf. Neural Networks, vol. 1, 1998, pp.413-418. [2] E. Wiener, J. O. Pedersen and A. S. Weigend. A neural network approach to topic spotting. In Symposium on Document Analysis and Information Retrieval (SDAIR’95), pages 317332, Nevada, Las Vegas, 1995. University of Nevada, Las Vegas. [3] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval. New York: McGraw-Hill, 1983. [4] E. Bingham, J. Kuusisto and K. Lagus. ICA and SOM in text document analysis, Proc. SIGIR’02, August 11-15, 2002, Tampere, Finland. [5] F. Sebastiani. Machine Learning in Automated Text Categorization. Proc. ACM Computing Surveys, Vol. 34, No. 1, March 2002. pp. 1-47. [6] Y. Yang and J. P. Pedersen. A comparative study on feature selection in text categorization. Proc. Fourteenth International Conference on Machine Learning (ICML’97), 1997, pp.412420. [7] S. Deerwester, S. T. Dumais, G. W. Furnas, and T. K. Landauer. Indexing by latent semantic analysis. Journal of the American Society for Information Science, Vol.41, 1990, pp. 391-407. [8] S. Haykin, Neural Networks: a comprehensive foundation – 2nd ed. Prentice-Hall, 1999. [9] T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J. Honkela, V. Paatero and A. Saarela, “Self Organization of a Massive Document Collection”, IEEE Transaction on Neural Networks, v. 11, n. 3, May 2000. [10] G. Siolas, F. d' Alché -Buc. Mixtures of Probabilistic PCAs and Fisher Kernels for Word and Document Modeling. Artificial Neural Networks - ICANN 2002, International Conference, Madrid, Spain, August 28-30, 2002, Proceedings. Lecture Notes in Computer Science 2415 Springer 2002, ISBN 3-540-44074-7, pp. 769-776. [11] X. Lin, D. Soergel and G. Marchionini, “A self-organizing semantic map for information retrieval”, Proceedings of the Fourteenth Annual International ACM/SIGIR Conference on

Research and Development in Information Retrieval, Chicago, IL, 1991, pp. 262-269. [12] D. Boley, M. Gini, R. Gross, E. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher and J. Moore, “Partitioningbased clustering for web document categorization”, Decision Support Systems, v.27, 1999, pp. 329-341. [13] M. R. Spiegel. Schaum’s Outline of Theory and Problems of Estatistics. McGraw-Hill, 1961. [14] Y. Yang and X. Liu. A re-examination of text categorization methods. Proc. 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99), 1999, pp. 42-49. [15] D. G. Roussinov and H. Chen. A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation. Communication and Cognition in Artificial Intelligence Journal (CC-AI), v. 15, n. 12, 1998, pp. 81-111.