Support Vector Machine for String Vectors Malrey Lee1 and Taeho Jo2 1
School of Electronics & Information Engineering, ChonBuk National University, 664-14, 1Ga, DeokJin-Dong, JeonJu, ChonBuk, 561-756, South Korea
[email protected] 2 SITE, University of Ottawa, Room 5010, SITE, 800 King-Edward Ave, Ottawa, Ontario, Canada K1N6N5
Abstract. This paper proposes an alternative version of SVM using string vectors as its input data. A string vector is defined as a finite ordered set of words. Almost all of machine learning algorithms including traditional versions of SVM use numerical vectors as their input data. In order to apply such machine learning algorithms to classification or clustering, any type of raw data should be encoded into numerical vectors. In text categorization or text clustering, representing documents into numerical vectors leads to two main problems: huge dimensionality and sparse distribution. Although traditional versions of SVM are tolerable to huge dimensionality, they are not robust to sparse distribution in their training and classification. Therefore, in order to avoid this problem, this research proposes another version of SVM, which uses string vectors given as an alternative type of structured data to numerical vectors. For applying the proposed version of SVM to text categorization, documents are encoded into string vectors by defining conditions of words as their features and positioning words corresponding to the conditions into each string vector. A similarity matrix, word by word, which defines semantic similarities of words, should be built from a given training corpus, before using the proposed version of SVM for text categorization. Each semantic similarity given as an element of the matrix is computed based on collocations of two words within their same documents over a given corpus. In this paper, inner product of two string vectors as the proposed kernel function indicates an average semantic similarity over their elements. In order to validate the proposed version of SVM, it will be compared with a traditional version of SVM using linear kernel functions in text categorization on two test beds.
1 Introduction SVM (Support Vector Machine) refers to a kernel based machine learning algorithm where a vector space is mapped into another vector space where training examples are linearly separable, and two hyper-planes corresponding to two classes are defined with the maximal margin between them as boundaries of classification. In SVM, the mapped space is called feature space, and a function for computing inner products of two examples mapped into the feature space is called a kernel function [4]. Examples involved in defining these hyper-planes are called support vectors [2]. Since SVM is D.-S. Huang, K. Li, and G.W. Irwin (Eds.): ICIC 2006, LNCIS 344, pp. 1056 – 1067, 2006. © Springer-Verlag Berlin Heidelberg 2006
Support Vector Machine for String Vectors
1057
tolerant to huge dimensionality of input vectors, it has been applied very popularly to text categorization where documents are represented into high dimensional numerical vectors. In 1998, Joachims initially applied SVM to text categorization and shown that SVM had better performance than NB [7]. In 1999, Drucker et al applied SVM to spam email filtering which was a real task using text categorization and shown that SVM was also the better approach than NB [3]. In 2002, Sebastiani surveyed more than ten machine learning based approaches to text categorization and concluded that SVM is the best approach among them [11]. In 2004, Park and Zhang proposed training SVM using labeled and unlabeled documents by co-training for text categorization [10]. SVM has been applied not only to text categorization, but also to other classification problems including image classification and protein classification [1,2]. These previous literatures have shown that SVM is very good approach to text categorization. However, representation of documents into numerical vectors for text categorization leads to not only huge dimensionality, but also sparse distribution. Although SVM is tolerable to the former, it is not robust to the latter, together with other traditional machine learning algorithms, such as NB, KNN, and traditional neural networks. The reason is that if numerical vectors become very sparse, they lose their discrimination for classification. In order to solve the two problems, Lodhi et al proposed a new kernel function, called string kernel, in 2002 [8]. String kernel is applicable to texts themselves without representing documents in a structured format. The advantages of string kernel are that it is applicable to documents independent of a natural language in which they are written, and that it captures semantic relations implicitly between two documents. However, it requires defining approximately 20,000 substrings; it takes very much time to compute a value of inner product between two raw texts using the string kernel. The experiments of Lodhi el al presented that their proposed kernel failed to improve the performance of SVM in text categorization on the test bed Reuter 21578 [8]. In using SVM for text categorization, this paper proposes that documents should be represented into string vectors instead of numerical vectors. A string vector refers to a finite ordered set of words, and it will be described in section 2.2. In other words, words replace numerical values in string vector. When Jo proposed a new neural network, called NTC (Neural Text Categorizer), in 2000, string vector was initially used for representing documents [5]. Recently, we discovered that string vectors are applicable not only to NTC or NTSO (Neural Text Self Organizer) [7], but also to other traditional machine learning algorithms, such as SVM, NB, and KNN, if they are modified properly. Therefore, this paper proposes a modified version of SVM where a kernel function for string vectors is employed. Since a kernel function is for computing an inner product of two mapped vectors, the inner product implies a similarity between two entities whether they are mapped or not. Based on this fact, this paper defines a kernel function for string vectors as their semantic similarity. Before using the kernel function, it requires building a similarity matrix word by word from a particular corpus. The matrix defines semantic similarities between words as normalized values and the semantic similarities are computed based on collocations of words within their same documents.
1058
M. Lee and T. Jo
In the experiments of this paper, the proposed version of SVM will be compared with its traditional version and NB on two test beds, NewsPage.com and Reuter 21578. The experiments in section 4 will show that the proposed version of SVM is best one among them on both test beds. On the second test bed, the proposed version of SVM is evaluated as the best one, in macro-averaged F1, rather than in microaveraged F1. This means that the proposed version works very well to sparse categories where a small number of training documents is given. This paper consists of five sections. Section 2 describes processes of encoding documents into two structured formats, numerical vectors and string vectors. Section 3 explains the brief concept of SVM, and presents kernel functions for numerical vectors and string vectors. Section 4 presents experimental results where the proposed version of SVM is compared with its traditional version and NB on two test beds. Section 5 mentions significances of this research and further researches to improve the current one, as the conclusion of this paper.
2 Text Representations Since texts are unstructured data in natural languages and computers can not process them directly, they should be encoded into structured data for processing them. This section describes two ways to do that. One is for the traditional version of SVM and NB, and the other is for the proposed version of SVM. 2.1 Numerical Vectors This subsection describes the traditional way to encode texts. In this method, texts are represented into numerical vectors. Main machine learning algorithms such as NB (Naïve Bayes), SVM (Support Vector Machine), MLP (Multi-Layer Perceptron), and Kohonen Network use numerical vectors as their input vectors. This method is necessary to apply such machine learning algorithms for text categorization and text clustering. Features given as attributes of numerical vectors representing texts are words in a collection of texts. If all of words were used as their features, their dimension would be more than 10,000; such a dimension is not feasible for text categorization and text clustering. Previous research suggested several methods of feature selection, such as mutual information, chi-square, frequency based method, and information theory based method [11]. There are three ways to define feature values in numerical vectors representing texts. The first way is to use a binary value for each word indicating its presence or absence in the text. The second way is to use the frequency of each word in each text as its value. In this way, each element is an integer. The third way is to use the weight of each word as a feature value. Such a weight is computed from the equation (1). wit = f it (log 2 N − log 2 d t + 1)
(1)
Support Vector Machine for String Vectors
1059
where wit is the weight of the word, t in the text, i , f it is the frequency of the word, t in the text, i , N is the number of texts in the corpus, and d t is the number of texts including the word, t . 2.2 String Vectors This section describes string vectors as representations of documents for applying the proposed version of SVM to text categorization. A string vector refers to a finite ordered set of words. Each of such words may be mono-gram, bi-gram, or n-gram. A string vector is defined as an ordered set of words with the fixed size, independent of the length of the given document. The string vector representing the document, d i is denoted by vector,
d is = [wi1
wi 2 ... win ] , where n is the dimension of the string
s i
d . An arbitrary example of a four dimensional string vector is expressed as
[computer, software, hardware, machine]. From the given document, a bag of words is generated by indexing the document, as an intermediate representation for a string vector. Figure 1 illustrates the process of mapping a bag of words into a string vector. The dimension of a string vector is determined and the properties of words, such as the word with its highest frequency in the document, a random word in the first sentence, the word with its highest weight, or the word with its highest frequency in the first paragraph, are defined with the features of that vector. For simplicity and convenience of implementing the automatic process of encoding documents into string vectors, we defined the properties of string vectors as the highest frequent word, the second highest frequent word, the third highest frequent word, the fourth highest frequency word, and so on, in the document. In general, the dimension of string vectors is smaller than the size of bags of words. To each property given as a feature, its corresponding word is arranged so as to build a string vector.
Fig. 1. The process of mapping a bag of words into a string vector
Before performing operations on string vectors, from a particular corpus, we need to define a similarity matrix, a word by word matrix whose entries indicate similarities of all possible pairs of words. The idea of building a similarity matrix is based on the research of [2]. However, since word by word matrix defined in the literature does not provide normalized similarity values of words and a similarity between two
1060
M. Lee and T. Jo
identical words varies depending on their frequencies and distributions over a corpus, this research define similarities between words, differently from the previous research. In word by word matrix in this research there are words denoted by w1 , w2 ,..., wN , where N is the number of unique words in the given corpus, except stop words. It is denoted by the matrix, N by N , as follows.
ª s11 «s S = « 21 « ... « ¬sN1
s12 s22 ... sN 2
Its elements are denoted by sij (1 ≤ i, j ity, sij
... s1N º ... s2 N »» ... ... » » ... s NN ¼
≤ N ) , and defined as a semantic similar-
= sim( wi , w j ) , between two words, wi and w j . Each element,
sij (1 ≤ i, j ≤ N ) , is computed based on the collocation of two words, wi and w j , within a same document, using equation (2),
¦ (φ (w ) + φ (w )) r
sij = sim( wi , w j ) =
¦φ
p
( wi ) +
d p ∈Di
where
i
r
j
d r ∈D i ∩ D j
(2)
¦φ (w ) q
j
d q ∈D j
Di is the set of documents including the word, wi , and D j is the set of
φo (⋅) is the function of a word, unique to a function, φo (⋅) , means its occurrence, frequency, or
documents including the word, w j , and particular document,
do . The
weight computed by a particular equation, in the document,
do .
If a similarity matrix is built from a corpus by computing similarities between all possible pairs of words using equation (2), we can compute the similarity between s
two string vectors, denoted by d i
= [ wi1 , wi 2 ,..., win ] and d sj = [ w j1 , w j 2 ,..., w jn ] .
The similarity sim( wik , w jk ) between two words,
wik and w jk is obtained by
looking up the entry with the row corresponding to the word,
wik and the column
corresponding to the word, w jk or with its reverse, from the similarity matrix. The s
similarity between two string vectors di and
sim(d i , d j ) ≈ sim(d is , d sj ) =
d sj is computed using equation (3). 1 n ¦ sim(wik , w jk ) n k =1
(3)
Support Vector Machine for String Vectors
1061
As illustrated in equation (3), the similarity between two string vectors is the average over the similarities of their elements.
3 Support Vector Machine This section describes the brief concept of SVM and presents kernel functions for numerical vectors and string vectors. SVM has been known as a good approach to any classification problem with respect to its theory and practice [4], and it has been applied to various classification tasks including text categorization [2]. Equation (4) models the classification of SVM trained by training examples, x1 , x 2 ,..., x m .
§ m · f (x) = sign¨ ¦ α i k (x ⋅ x i ) + b ¸ © k =1 ¹ where
αi
(4)
is a Lagrange multiplier correspond to a training example,
x i , k (⋅) is a
kernel function which computes an inner product between two mapped examples without mapping them explicitly, and b is a bias of SVM. The function defined in m
equation (4) generates 1 as its output, if the value of
¦ α k (x ⋅ x ) + b is greater i
i
k =1
than zero. Otherwise, it generates -1. The learning of SVM is the process of optimizing the given Lagrange multipliers, α 1 , α 2 ,..., α m , with the given training examples,
x1 , x 2 ,..., x m . Here support vectors are training examples whose Lagrange multipliers are non zero. For optimizing these Lagrange multipliers, SMO (Sequential Minimal Optimization) algorithm is used popularly. Its detail description is included in literatures, [10] and [2]. 3.1 Kernel Functions for Numerical Vectors This subsection presents three typical types of kernel functions for numerical vectors. These types are linear function, polynomial function, and Gaussian function. The linear function is defined in equation (5), and it means the inner product between two vectors in their original space and c is an arbitrary constant.
k (x i ⋅ x j ) = x i ⋅ x j + c
(5)
Polynomial function is defined in equation (6) and it means that a linear function defined in equation (5) is squared several times.
k (xi ⋅ x j ) = ((x i ⋅ x j ) + c )
p
(6)
1062
M. Lee and T. Jo
The third type of kernel function is defined in equation (7) and it could compute inner products between two vectors mapped even into a vector space with infinite dimensionality.
§ (x i − x j )2 · ¸ k (xi ⋅ x j ) = exp¨ − ¨ ¸ σ © ¹
(7)
Among these types of kernel functions, linear function defined in equation (5) is used popularly in text categorization, and there is little difference in their performance in using SVM [10]. 3.2 Proposed Kernel Function This subsection describes the proposed kernel function for string vectors. It is supposed that two documents are represented into two string vectors,
d is = [ wi1 , wi 2 ,..., win ] and d sj = [ w j1 , w j 2 ,..., w jn ] . The proposed kernel function is defined in equation (8).
(
)
(
k d is ⋅ d sj = sim d is , d sj
)
(8)
The proposed kernel function indicates a semantic similarity between two string vectors, and their semantic similarity is expressed in equation (3). Therefore, the proposed version of SVM is expressed in equation (9).
(
)
(
)
§m · § m · f (x ) = sign¨ ¦ α i k d is ⋅ d sj + b ¸ = sign¨ ¦ α i sim d is , d sj + b ¸ © k =1 ¹ © k =1 ¹
(9)
These Lagrange multipliers are optimized using SMO algorithm also in the proposed version, like the traditional version of SVM.
4 Experimental Results This section presents results of evaluating NB and two versions of SVM, on the two test beds, NewsPage.com and Reuter 21578. The NB and traditional version of SVM use 500 dimensional numerical vectors representing documents for their learning and classification, while the proposed version of SVM uses 10 dimensional string vectors in this experiment. The size of input data used in the proposed version of SVM is far smaller than that of input data in the others. This experiment is partitioned into two sets based on the test beds. The first set and the second set of this experiment are evaluations of the three approaches on NewsPage.com and Reuter 21578, respectively. In this experiment, documents are represented into string vectors for using the proposed version of SVM or numerical vectors for using the others. Dimensions of numerical vectors and string vectors representing documents are set as 500 and 10,
Support Vector Machine for String Vectors
1063
respectively, in the two sets of this experiment. For encoding documents into numerical vectors, most frequent 500 words from a given training set for each test bed are selected as their features. The values of the features of numerical vectors are binary ones indicating absence or presence of words in a given document. For encoding documents into string vectors, most frequent 10 words are selected from a given document as values of its corresponding string vector. Here, features of string vectors are the most frequent word, the second most frequent word, the third most frequent word, and so on. SVM has its capacity which means the maximum value of its Lagrange multipliers, as its parameter. In this experiment, its value is set to 4.0, arbitrary, since its value influence very little on the performance of SVM. 4.1 NewsPage.com The first set of this preliminary experiment pursues the evaluation of the three approaches on the test bed, Newspage.com. This test bed consists of 1,200 news articles in the format of plain texts built by copying and pasting news articles in the web site, www.newspage.com. Table 1 shows the predefined categories, the number of documents of each category, and the partition of the test bed into training set and test set. As shown in table 1, the ratio of training set to test set is set as 7:3. Here, this test bed is called Newspage.com, based on the web site, given as its source. Table 1. Training Set and Test Set of Newspage.com
Category Name Business Health Law Internet Sports Total
Training Set 280 140 70 210 140 840
Test Set 120 60 30 90 60 360
#Document 400 200 100 300 200 1200
The task of text categorization on this test bed is decomposed into five binary classification problems, category by category. In each binary classification problem, a classifier answers whether an unseen document belongs to its corresponding category, or not. Table 2 shows the definition of training sets of the predefined categories. In table 2, ‘positive’ indicates that documents belong to the corresponding category, while ‘negative’ indicates that documents do not. For each training set, all of documents not belonging to its corresponding category are allocated as documents in negative class. For each test set, in negative class, documents are allocated as many as positive documents defined in the third column of table 1.
1064
M. Lee and T. Jo
Table 2. The Allocation of Positive and Negative Class in Training Set of each Category
Category Name Business Health Law Internet Sports
Positive 280 140 70 210 140
Negative 560 700 770 630 700
Total 840 840 840 840 840
Figure 2 presents the results of evaluating NB and two versions of SVM on the first test bed, NewsPage.com. In the x-axis of the graph illustrated in figure 2, the left group indicates micro-averaged F1, the right group indicates macro-averaged F1, and each individual bar within each group corresponds to one of NB and two versions of SVM. The y-axis means the value of micro-averaged or macro-averaged F1. In figure 2, ‘SVM-trad’ means the traditional version of SVM and ‘SVM-new’ means the proposed version of SVM. As illustrated in figure 2, the results show that the proposed version of SVM has the best performance in text categorization among the three approaches. They show also that the traditional approach of SVM does not have as good performance even as NB.
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
SVM-trad SVM-new NB
Micro
Macro
Fig. 2. Results of evaluating NB and two versions of SVM on NewsPage.com
4.2 Reuter 21578 The second set of this experiment is for the evaluation of the three classifiers on the test bed, Reuter21578, which is a typical standard test bed in the field of text categorization. In this experiment set, ten largest categories are selected. Table 3 shows ten selected categories and the number of training documents and test documents in each category. The partition of this test bed into training set and test set follows the version, ModApte, which is the standard partition of Reuter 21578 for evaluating text
Support Vector Machine for String Vectors
1065
Table 3. Partition of Training Set and Test Set in 20NewsGroup
Category Name Acq Corn Crude Earn Grain Interest Money-Fx Ship Trade Wheat
Training Set 1452 152 328 2536 361 296 553 176 335 173
Test Set 672 57 203 954 162 135 246 87 160 76
#Document 2124 209 531 3490 523 431 799 263 495 249
clasifers [11]. The number of documents in each category is very variable as shown in table 3. Figure 3 illustrates the results of evaluating NB and two versions of SVM on the test bed, Reuter 21578. The difference of this experiment set from the previous experiment set is that SVM works slightly better than NB. In the results of this experiment set, the proposed version of SVM is evaluated as the worst approach with micro-averaged F1, but as the clearly best approach with macro-averaged F1, as illustrated in the left and right side of figure 3. The results mean that the proposed SVM works very well to sparse categories, each of which the size is very small.
0.7 0.6 0.5 SVM trad
0.4 0.3
SVM new
0.2
NB
0.1 0 Micro
Macro
Fig. 3. Results of evaluating NB and two versions of SVM on Reuter 21578
These two experiment sets validated the performance of the proposed version of SVM. Note that the proposed version of SVM used only ten dimensional string vectors as its input data, while the others used even 500 dimensional numerical vectors as their input data. Considering this point, even if the proposed version is not better than the others, but comparable to them, the proposed version is more recommendable.
1066
M. Lee and T. Jo
However, the proposed version requires building a similarity matrix from a particular corpus, before we apply it to text categorization. Although it takes very much time to build the similarity matrix, it can be reused continually once it is built.
5 Conclusion The significance of this research is to propose a practical version of SVM for text categorization. The proposed version of SVM has better performance than its traditional version and NB, in spite of its far smaller input size. Note that although the traditional version of SVM is tolerable to huge dimensionality of numerical vectors, it is not robust to their sparse distribution together with other traditional machine learning algorithms. This research shows that representing documents into string vectors is more practical than into numerical vectors for tasks of text categorization. A weakness of SVM including the proposed version is its applicability only to binary classification problems. As shown in equation (4) and (9), SVM classifies any entity into one of two classes: positive class and negative class. If we apply it to a multiple classification problem, we must decompose it into several binary classification problems. Other machine learning algorithms, than SVM, such as KNN, NB, and back propagation, are applicable to a multiple classification problem, without its decomposition. It is necessary to modify SVM to be applicable to a multiple classification problem without decomposing it. In addition to the proposed kernel function described in section 3.2, more kernel functions for string vectors could be defined. Current research defined only one kernel function for string vectors. In further research, we plan to define other kernel functions for string vectors and compare them with each other.
Acknowledgements This Research was supported by the MIC(Ministry of Information and Communication), Korea, under the ITRC(Information Technology Research Center) support program supervised by the IITA(Institute of Information Technology Assessment) (IITA-2005-C1090-0502-0023).
References 1. Cristianini, N., Shawe, T. J.: Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, London (2000) 2. Cristianini, N., Shawe-Taylor, J., Lodhi, H.: Latent Semantic Kernels, Journal of Intelligent Information Systems, 18 ( 2-3) (2002) 127-152 3. Drucker, H., Wu, D., Vapnik, V. N.: Support Vector Machines for Spam Categorization. IEEE Transaction on Neural Networks, 10 (5) (1999) 1048-1054 4. Hearst, M.: Support Vector Machines. IEEE Intelligent Systems, 13 ( 4) (1998) 18-28
Support Vector Machine for String Vectors
1067
5. Jo, T.: Neural Text Categorizer: A New Model of Neural Networks for Text Categorization. The Proceedings of the 7th International Conference on Neural Information Processing, Beijing, China (2000) 280-285 6. Jo, T., Japkowicz, N.: Text Clustering Using NTSO. The Proceedings of International Joint Conference on Neural Networks, Sheraton Vancouver Wall Centre, Vancouver, BC (2006) 558-563 7. Joachims, T.: A statistical Learning Learning Model of Text Classification for Support Vector Machines. The Proceedings of the 24th Annual International ACM SIGIR, (1998) 128-136 8. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text Classification with String Kernels. Journal of Machine Learning Research, 2 ( 2) (2002) 419-444 9. Park, S., Zhang, B.: Co-trained Support Vector Machines for Large Scale Unstructured Document Classification Using Unlabeled Data and Syntactic Information. Information Processing and Management, 40 ( 3) (2004) 421-439 10. Platt, J. C: Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Technical Report MSR-TR-98-14, (1998) 11. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Survey, 34 (1) (2002) 1-47