Symbolic Representation of Text Documents Using Multiple Kernel ...

2 downloads 11442 Views 177KB Size Report
Jan 3, 2016 - First Online: 03 January 2016 ... Volume 9468 of the book series Lecture Notes in Computer Science (LNCS). Cite this paper as: Harish B.S., Revanasiddappa M.B., Aruna Kumar S.V. (2015) Symbolic Representation of Text ...
Symbolic Representation of Text Documents Using Multiple Kernel FCM B.S. Harish(B) , M.B. Revanasiddappa, and S.V. Aruna Kumar Department of Information Science and Engineering, Sri Jayachamarajendra College of Engineering, Mysuru, India [email protected], {revan.cr.is,arunkumarsv55}@gmail.com

Abstract. In this paper, we proposed a novel method of representing text documents based on clustering of term frequency vector. In order to cluster the term frequency vectors, we make use of Multiple Kernel Fuzzy C-Means (MKFCM). After clustering, term frequency vector of each cluster are used to form a interval valued representation (symbolic representation) by the use of mean and standard deviation. Further, interval value features are stored in knowledge base as a representative of the cluster. To corroborate the efficacy of the proposed model, we conducted extensive experimentation on standard datset like Reuters21578 and 20 Newsgroup. We have compared our classification accuracy achieved by the Symbolic classifier with the other existing Naive Bayes classifier, KNN classifier and SVM classifier. The experimental result reveals that the classification accuracy achieved by using symbolic classifier is better than other three classifiers. Keywords: Classification · Text documents bolic feature · Multiple Kernel FCM

1

·

Representation

·

Sym-

Introduction

Digital information on the web is increasing day by day and most of the information is in textual form. Text classification is one of the solution to provide better result in information retrieval (IR) system. Text classification (Categorization) is the process of automatically classifying text documents into predefined categories. Basically there are two approaches to classify text documents. First approach is the rule based, in this approach classification rules are defined manually and documents are classified based on rules. Second approach is machine learning approach, here classification rules or equations are found automatically using sample labeled documents. This class of approaches has much higher precision than rule based approaches. Therefore, machine learning based approaches are replacing the rule based approaches for text classification [1]. A well classified corpus can facilitate document searching, filtering and navigating for both users and information retrieval tools. Text classification is used in number of interesting fields like, search engine, automatic document indexing for information c Springer International Publishing Switzerland 2015  R. Prasath et al. (Eds.): MIKE 2015, LNAI 9468, pp. 93–102, 2015. DOI: 10.1007/978-3-319-26832-3 10

94

B.S. Harish et al.

retrieval system, document filtering, spam filtering used for emails, text mining, digital library, word sense [2]. The major challenges of text classification are the high dimensionality of the feature space, representation of document, similarity measures of different documents, preserving semantic relationship in a documents and sparsity. To tackle the above said problems a number of methods have been introduced in literature. Before applying machine learning techniques to classify text document, we need to transform the text documents, which are typically strings of characters. In literature, many representation schemes like Bag of Word (BOW), Vector Space Model (VSM), Universal Networking Language (UNL), N-Gram, Latent Semantic Indexing (LSI), Locality Preserving Indexing (LPI), Regularized Locality Preserving Indexing (RLPI) etc. are applied for text classification. Bag of Words (BOW) approach is most widely adopted representation model in text classification. In this approach a text is represented as a vector of word weights [3]. However, BOW representation suffers from two limitations, it tends to break terms into their constituent words and it treats synonymous words as independent features. Vector Space Model (VSM) is an algebraic model for representing text documents as vectors of identifiers, such as index terms. Major limitation of VSM is, it represents terms with very high dimensionality and the resulting vector is very sparse. Universal Networking Language (UNL) presents a document in the form of a graph, with nodes as the universal words and relations between them as links [4]. This method requires the construction of a graph for every document and hence it is unwieldy to use for an application where large numbers of documents are present. Hotho et al., in [5] proposed an ontology representation for a document to keep the semantic relationship between the terms in a document. The ontology model preserves domain knowledge of a term demonstrated in a document. However, automatic ontology construction is a difficult task due to the lack of structured knowledge base. Cavnar., [6] used a sequence of symbols (byte, a character or a word) called N-Grams, that are extracted from a long string in a document. In the N-Gram scheme, it is very difficult to decide the number of grams to be considered for effective document representation. Another approach in [7] uses multi-word terms as vector components to represent a document. But this method requires a sophisticated automatic term extraction algorithms to extract the terms automatically from a document. Latent Semantic Indexing (LSI) is based on Singular Value Decomposition (SVD), which project the document vectors into a subspace. LSI finds the best subspace estimation to the original document space in the sense of minimizing the global reconstruction error [8]. In other words, LSI seeks to uncover the most representative features rather the most discriminative features for document representation. Therefore, LSI might not be optimal in discriminating documents with different semantics. To discover the discriminating structure of a document space, He et al., [9] proposed a Locality Preserving Indexing (LPI). LPI can have more discriminating power than LSI even though LPI is also unsupervised. An assumption behind LPI is that close inputs should have similar documents. The computational complexity of LPI is very expensive because it involves eigen-decomposition of two

Symbolic Representation of Text Documents Using Multiple Kernel FCM

95

dense matrices. It is almost infeasible to apply LPI on very large dataset. Hence to reduce the computational complexity of LPI, Cai et al., [10] proposed Regularized Locality Preserving Indexing (RLPI). RLPI is fundamentally based on LPI. Specifically, RLPI decomposes the LPI problem as a graph embedding problem and a regularized least square problem. Such modification avoids eigendecomposition of dense matrices and can significantly reduce both time and memory cost in computation. However, RLPI fails to preserve the intraclass variations among documents of different classes. In text classification, clustering techniques has been used as an alternative representation scheme, which automatically groups the text documents into a list of meaningful categories. Several approaches of clustering have been proposed in literature to solve high dimensionality problem. Baker and McCallum [11] proposed Distributional Clustering method. This method cluster the words into groups based on the distribution of class labels associated with each word. Bekkerman et al., [12] proposed word-clustering representation model based on the information bottleneck method, which generates a compact and efficient representation for documents. However, distributional clustering and wordclustering methods are agglomerative in nature and results in sub-optimal word clusters, and high computational cost. To overcome these drawbacks, Dhillon et al., [13] proposed new information theoretic divisive algorithm for feature clustering. The feature clustering is a powerful alternative to feature selection for reducing the dimensionality of text document. Fuzzy clustering which is also called as soft clustering, is used in text classification, to improve accuracy and performance. In literature many researchers worked on fuzzy clustering techniques for reducing high dimensionality of features. A fuzzy set is class of objects with a continuum of grades of membership [14]. Generally, fuzzy membership functions are defined in terms of numerical values of an underlying crisp attribute and membership ranges between 0 to 1. To reduce high dimensionality of feature vector, Anilkumarreddy et al., [15] proposed fuzzy based incremental feature clustering method. Jiang et al., [16] proposed fuzzy self constructing feature clustering (FFC) algorithm, which is an incremental clustering approach to reduce the dimensionality of the features in text classification. In this algorithm, each cluster is characterized by a membership function with statistical mean and deviation. Features that are similar to each other are grouped into the same cluster. If a word is not similar to any existing cluster, a new cluster is created. Puri, [17] proposed Fuzzy Similarity Based Concept Mining Model (FSCMM). This model mainly reduces feature dimensionality and removes ambiguity at each level to achieve high classifier performance. Term frequency vectors of each cluster are used to form a symbolic representation by the use of mean and standard deviation. Carvalho [18] proposed fuzzy C-means clustering algorithm for symbolic interval data. It aims to provide fuzzy partitions of a set of pattern clusters and a set of corresponding representatives (prototypes) for each cluster by optimizing an adequacy criterion based on suitable squared Euclidean distances between vectors of intervals. In literature, most of the clustering based classification method uses conventional term document matrix representation. Since, the value of the term

96

B.S. Harish et al.

frequency different from document to document in the class, preserving these variations has been difficult. To overcome this problem, Guru et al., [19] proposed a symbolic based representation model. In this model, text documents are represented by the use of interval valued symbolic features. Harish et al., [20] extended work presented in [19] by applying adaptive FCM. In this method, to preserve the intra class variation multiple clusters are created for each class by using an adaptive FCM algorithm. Unfortunately, FCM is effective only for linear data. The important variant of FCM is kernel FCM (KFCM) and Multiple Kernel FCM (MKFCM) [21]. These variants are widely used for clustering the non linear data. The result of KFCM depends on the selection of right kernel function. Unfortunately, for many applications, selection of right kernel function is not easy. This problem is overcome by using Multiple Kernel FCM (MKFCM) [22]. Multiple Kernel FCM is based on the KFCM, which uses the composite kernel function. MKFCM gives the flexibility in selection of kernel functions. Addition to flexibility it also offers to combine the different information from multiple heterogeneous or homogeneous sources in kernel space. Huang et al., [22] applied Multiple kernel Fuzzy clustering for text clustering. This method uses four kernel function (i.e. Euclidean distance, Cosine similarity, Jaccard coefficient, Pearson correlation coefficient) to calculate pairwise distance between two document. In this paper, we proposed a novel method for representing text documents based on clustering of term frequency vector. In order to cluster the term frequency vectors, we make use of Multiple Kernel FCM (MKFCM). After clustering, term frequency vector of each cluster is used to form an interval valued representation (symbolic representation) by the use of mean and standard deviation. Further, interval value features are stored in the knowledge base as a representative of the cluster. In document classification, the features of the test document are compared with corresponding interval value features which stored in the knowledge base. Based on the degree of belongigness, we assign the class label. The rest of the paper is organized as follows: The proposed method is presented in Sect. 2. Details of dataset used, experimental settings and results are presented in Sect. 3. The paper is concluded along with future works in Sect. 4.

2

Proposed Method

The proposed method has two stages: (i) Multiple Kernel FCM (MKFCM) based representation and (ii) Document Classification. 2.1

Multiple Kernel FCM (MKFCM) Based Representation

In the proposed system, initially documents are represented by term document matrix. To reduce the dimensionality of term document matrix, we employ the regularized locality preserving indexing (RLPI) technique. Unfortunately, the RLPI has considerable intraclass variations. Thus, to overcome this problem, we proposed a Multiple Kernel FCM (MKFCM) clustering based representation

Symbolic Representation of Text Documents Using Multiple Kernel FCM

97

method. In the proposed method, we capture the intraclass variations through MKFCM clustering and represent each cluster by an interval valued feature vector. The training documents are clustered using MKFCM algorithm. Let d1 , d2 , d3 , ....., dN be a set of N training documents and Fk = fk2 , fk3 , ..., fkm be the set of m features of each document. The objective function of the MKFCM algorithm is as follows: J(w, U, V ) =

N  C 

2

um ic Φ(di ) − Φ(vi )

(1)

i=1 c=1

where uic is the membership value of the ith document to the cth cluster. vc is cth cluster center. Φ is an implicit nonlinear map, and 2

Φ(di ) − Φ(vc ) = KL (di , di ) + KL (vc , vc ) − 2KL (xi , vc )

(2)

where KL (di , vc )is the composite multiple kernel function which is defined as KL (di , vc ) = w1 K1 (di , vc ) + w2 K2 (di , vc ) + w3 K3 (di , vc ) + .... + wl Kl (di , vc ) (3) w = (w1 , w1 , w1 , ..., wl ) is a vector consisting of weights. subject to w1 + w2 + w3 + .... + wl = 1 and wl ≥ 0 ∀l . In the proposed method we used four kernels to calculate the pairwise distance between document and cluster center. They are Euclidean distance, Cosine Similarity, Jaccard coefficient, Pearson correlation coefficient. The main objective of the MKFCM is to find the combination of weights w, memberships U and cluster centers V , which minimize the objective function in Eq. 1. To obtain the membership value (uic ), we solve Eq. 1 by using Lagrange multiplier. The membership becomes: 1 1 C  2  m−1  Dic

uic =

´ c=1

(4)

2 Di´ c

2

2 where Dic = Φ(di ) − Φ(vc ) The weights w is obtained by solving the Eq. 1, by applying Lagrange multiplier. The weights w becomes:

wl =

1 βl 1 β1

+

1 β2

+ .... +

(5)

1 βL

where the coefficient βl is given by: βl =

N  C 

um ic αicl

(6)

i=1 c=1

where the coefficient αicl is given by: αicl = Kl (di , di ) − 2

N  j=1

ujc Kl (di , dj ) +

N  N  j=1 k=1

ujc ukc Kl (dj , dk )

(7)

98

B.S. Harish et al.

The training documents are clustered using MKFCM clustering method. Now we capture variations of each feature in the form of interval value. i.e.  − + the intraclass − − = μck − σck and fck = μck + σck . μck is the mean of the k th fck , fck . where fck th feature of documents present in c cluster. σck is the standard deviation of the k th feature of documents present in cth cluster. The interval value represent the upper and lower limits of feature value of document cluster. Now, the reference document for a class Cc is formed by representing each feature in the form of an interval value. i.e.  − +  − +  − + (8) , fc1 , fc2 , fc2 , ...., fcm , fcm RFc = fc1 This interval value features are stored in knowledge base as a representative of the cth cluster. thus, the knowledge base has N number of symbolic vectors representing clusters corresponding to a class.

Algorithm 1. Proposed Method Data: RLPI features N documents with Fi features, set of kernel function Kl , Number of Clusters C, fuzzification degree (m) and Convergence Criteria (ε) Result: Symbolic feature vector RF Initialize membership matrix U repeat um ic Calculate the normalized membership value as: uic =  N Calculate the αicl coefficient using Eq. 7 Calculate the βl coefficient using Eq. 6 Update the weights wl using Eq. 5 L  2 Calculate distance as: Dic = αicl wl 2

i=1

uic

l=1

Update the membership value using Eq. 4 until {U (t) − U (t − 1)} < ε; Calculate μck and σck of each cluster Cc Represent each cluster using symbolic vector RF as shown in Eq. 8

2.2

Document Classification

In proposed system, for document classification we considered a test document which is described by a set of m feature values of type crisp. The features of the test document are compared with corresponding interval value features which stored in the knowledge base. The number of feature of a test document, which fall inside the corresponding interval, is defined to be the belongigness. To decide the class label of the test document, we calculate the degree of the belongigness Bc . The degree of belongingness is calculated as follows: Bc =

m  k=1

  − +  C ftk , fcm , fcm

(9)

Symbolic Representation of Text Documents Using Multiple Kernel FCM

where 

C ftk ,



− + fcm , fcm



=

1 0

− + if (ftm ≥ fcm and ftm ≤ fcm ) otherwise

99

(10)

The feature of the test document falling into the respective feature interval of the reference class contributes a value 1 towards Bc . We compute the Bc value for all clusters of remaining classes and assign the class label to test document for which class has highest Bc .

3 3.1

Experimental Setup Dataset

For experimentation we have used the classic Reuters 21578 collection as the benchmark dataset. Originally Reuters 21578 contains 21578 documents in 135 categories. However, in our experiment, we discarded those documents with multiple category labels, and selected the largest ten categories. For the smooth conduction of experiments we used ten largest classes in the Reuters 21578 collection with number of documents in the training and test sets as follows: earn (2877 vs 1087), trade (369 vs 119), acquisitions (1650 vs 179), interest (347 vs 131), money-fx (538 vs 179), ship (197 vs 89), grain (433 vs 149), wheat (212 vs 71), crude (389 vs 189), corn (182 vs 56). The second dataset is standard 20 Newsgroup Large. It is one of the standard benchmark dataset used by many text classification research groups. It contains 20000 documents categorized into 20 classes. For our experimentation, we have considered the term document matrix constructed for 20 Newsgroup. 3.2

Experimentation

In this section, we present the results of the experiments conducted to demonstrate the effectiveness of the proposed method on both the datasets viz., Reuters 21578 and 20 Newsgroup. We used the four kernels to calculate the pairwise distance between two documents, they are: Euclidean distance(KF Ced ), Cosine similarity (KF Ccs ), Jaccard Coefficient (KF Cjc ) and Pearson correlation coefficient (KF Cpcc ). We set the fuzzification degree (m) to 2 and the Convergence Criteria (ε) as 0.0001. We have compared our classification accuracy achieved by the Symbolic classifier with the other existing Naive Bayes classifier (NB), KNN classifier (KNN) and SVM classifier (SVM). During experimentation, we conducted two different sets of experiments. In the first set of experiments, we used 50 % of the documents of each class of a dataset to create training set and the remaining 50 % of the documents for testing purpose. On the other hand, in the second set of experiments, the numbers of training and testing documents are in the ratio 60:40. Both experiments are repeated 5 times by choosing the training samples randomly. As measures of goodness of the proposed method, we computed percentage of classification accuracy. The average value of classification

100

B.S. Harish et al.

accuracy of 5 trials is presented in Tables 1 and 2. For both the experiments, we have randomly selected the training documents to create the symbolic feature vectors for each class. It can be observed from Tables 1 and 2, our proposed method based symbolic representation achieved a better results. The reason behind this is, for given dataset we don’t know which kernel will perform well. When we combine the kernels, they contribute more to the clustering and therefore, improve the results. Table 1. Comparative analysis of the proposed method with other classifiers on Reuters-21578 dataset Kernels Classifiers

Training vs Testing KF Ced KF Ccs KF Cjc KF Cpcc MKFCM

NB

50:50

62.85

66.90

64.55

63.15

68.10

60:40

63.10

67.50

65.10

64.40

69.55

50:50

63.55

68.85

64.10

63.50

69.80

60:40

64.10

69.10

64.65

63.80

70.45

50:50

64.10

69.00

63.55

64.60

70.15

60:40

65.55

69.85

65.10

63.55

70.65

KNN SVM

Symbolic classifier 50:50

67.40

68.55

64.55

63.40

69.65

60:40

68.60

68.90

65.10

63.85

71.55

Table 2. Comparative analysis of the proposed method with other classifiers on 20Newsgroup dataset Kernels Classifiers

Training vs Testing KF Ced KF Ccs KF Cjc KF Cpcc MKFCM

NB

50:50

63.50

68.60

67.20

66.15

69.10

60:40

64.25

69.35

67.15

67.00

70.20

50:50

64.50

69.10

63.45

64.25

70.25

60:40

65.75

70.60

64.20

64.95

71.60

50:50

65.35

71.00

64.20

63.55

71.95

60:40

66.10

71.95

64.80

63.90

72.40

Symbolic classifier 50:50

65.20

72.10

65.35

64.35

73.20

60:40

66.60

73.45

65.60

65.90

76.85

KNN SVM

4

Conclusion

In this paper, a new text document representation is presented. A text document is represented by the use of symbolic features. The main contribution of this paper is the introduction of Multiple Kernel FCM (MKFCM) clustering

Symbolic Representation of Text Documents Using Multiple Kernel FCM

101

algorithm to form a interval value representation. To corroborate the efficacy of the proposed model, we conducted extensive experimentation on standard text datasets. The experimental results reveal that the symbolic representation using feature clustering techniques achieves better classification accuracy over the existing cluster based symbolic representation approaches. In future, we are intended to introduce symbolic feature selection methods to further reduce the dimensionality of feature matrix. Our future research work will also emphasize in enhancing the ability and performance of our model by considering other parameters to capture intra class variations effectively.

References 1. Nedungadi, P., Harikumar, H., Ramesh, M.: A high performance hybrid algorithm for text classification. In: 2014 Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT), pp. 118–123. IEEE (2014) 2. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002) 3. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975) 4. Choudhary, B., Bhattacharyya, P.: Text clustering using universal networking language representation. In: The Proceedings of Eleventh International World Wide Web Conference, pp. 1–7 (2002) 5. Hotho, A., Maedche, A., Staab, S.: Ontology-based text document clustering 16, 48–54 (2002) 6. Cavnar, W.: Using an n-gram-based document representation with a vector processing retrieval model, pp. 269–269. NIST SPECIAL PUBLICATION SP (1995) 7. Milios, E., Zhang, Y., He, B., Dong, L.: Automatic term extraction and document similarity in special text corpora. In: Proceedings of the Sixth Conference of the Pacific Association for Computational Linguistics, pp. 275–284. Citeseer (2003) 8. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JAsIs 41(6), 391–407 (1990) 9. He, X., Cai, D., Liu, H., Ma, W.Y.: Locality preserving indexing for document representation. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 96–103. ACM (2004) 10. Cai, D., He, X., Zhang, W.V., Han, J.: Regularized locality preserving indexing via spectral regression. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 741–750. ACM (2007) 11. Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 96–103. ACM (1998) 12. Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: Distributional word clusters vs. words for text categorization. J. Mach. Learn. Res. 3, 1183–1208 (2003) 13. Dhillon, I.S., Mallela, S., Kumar, R.: A divisive information theoretic feature clustering algorithm for text classification. J. Mach. Learn. Res. 3, 1265–1287 (2003) 14. Zadeh, L.A.: Similarity relations and fuzzy orderings. Inf. Sci. 3(2), 177–200 (1971)

102

B.S. Harish et al.

15. Anilkumarreddy, T., Madhukumar, B., Chandrakumar, K.: Classification of text using fuzzy based incremental feature clustering algorithm. Int. J. Adv. Res. Comput. Eng. Technol. 1(5), 313–318 (2012) 16. Jiang, J.Y., Liou, R.J., Lee, S.J.: A fuzzy self-constructing feature clustering algorithm for text classification. IEEE Trans. Knowl. Data Eng. 23(3), 335–349 (2011) 17. Puri, S.: A fuzzy similarity based concept mining model for text classification. Int. J. Adv. Comput. Sci. Appl. 2(11), 115–121 (2012) 18. Carvalho, F.D.A.: Fuzzy c-means clustering methods for symbolic interval data. Pattern Recogn. Lett. 28(4), 423–437 (2007) 19. Guru, D.S., Harish, B.S., Manjunath, S.: Symbolic representation of text documents. In: Proceedings of the Third Annual ACM Bangalore Conference, pp. 1–8. ACM (2010) 20. Harish, B.S., Prasad, B., Udayasri, B.: Classification of text documents using adaptive fuzzy c-means clustering. In: Thampi, S.M., Abraham, A., Pal, S.K., Rodriguez, J.M.C. (eds.) Recent Advances in Intelligent Informatics. AISC, vol. 235, pp. 205–214. Springer, Heidelberg (2014) 21. M¨ uller, K.R., Mika, S., R¨ atsch, G., Tsuda, K., Sch¨ olkopf, B.: An introduction to kernel-based learning algorithms. IEEE Trans. Neural Netw. 12(2), 181–201 (2001) 22. Huang, H.C., Chuang, Y.Y., Chen, C.S.: Multiple kernel fuzzy clustering. IEEE Trans. Fuzzy Syst. 20(1), 120–134 (2012)

Suggest Documents