weighted kernel functions for svm learning in string domains - CiteSeerX

0 downloads 0 Views 93KB Size Report
is not trivial as a kernel function has to satisfy a number of properties that result ... D a set (henceforth called a dictionary) of symbols indexed according to ki ∈ {0 ...
WEIGHTED KERNEL FUNCTIONS FOR SVM LEARNING IN STRING DOMAINS : A DISTANCE FUNCTION VIEWPOINT BRAM VANSCHOENWINKEL∗1 , FENG LIU1 , BERNARD MANDERICK1 1

Computational Modeling Lab, Department of Informatics, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussel, Belgium EMAIL:{bvschoen,fengliu,bmanderi}@vub.ac.be

Abstract: This paper extends the idea of weighted distance functions to kernels and support vector machines. Here, we focus on applications that rely on sliding a window over a sequence of string data. For this type of problems it is argued that a symbolic, context-based representation of the data should be preferred over a continuous, real format as this is a much more intuitive setting for working with (weighted) distance functions. It is shown how a weighted string distance can be decomposed and subsequently used in different kernel functions and how these kernel functions correspond to inner products between real vectors. As a case-study named entity recognition is used with information gain ratio as a weighting scheme. Keywords: Support Vector Machines, Kernel Functions, Metrics, Information Gain Ratio, Named Entity Recognition

1. Introduction In support vector machine (SVM) learning the data is mapped non-linearly from the original input space X to a high-dimensional feature space F and subsequently separated by a maximum-margin hyperplane in that space F. By making use of the kernel trick, the mapping to F can stay implicit, and we can avoid working in the high-dimensional space. Moreover, because the mapping to F is non-linear, the decision boundary which is linear in F, corresponds to a nonlinear decision boundary in X. One of the most important design decisions in SVM learning is the choice of kernel func∗ Author

funded by a doctoral grant of the institute for advancement of scientific technological research in Flanders (IWT).

tion K because the hyperplane is defined completely by inner products between vectors in F and calculated through the kernel function K. Moreover, K takes vectors form the input space X and directly calculates inner products in F without having to represent or even now the exact form of these vectors, hence the implicit mapping and computational benefit [2]. In the light of the above it is not hard to see that the way in which K is calculated is crucial for the success of the classification process. Notice that an inner product is in fact one of the most simple similarity measures between vectors as it gives much information about the position of these vectors in relation to each other. In that sense the learning process can benefit a lot from the use of special purpose similarity or dissimilarity measures in the calculation of K [13, 15, 16]. However, incorporating such knowledge in a kernel function is not trivial as a kernel function has to satisfy a number of properties that result directly from the definition of the inner product, i.e. the function K has to be postive definite (PD). This paper concentrates on the use of SVMs on string data. Applying SVMs on such data involves a number of issues that need to be addressed, i.e. SVMs are defined on real vectors and not on string data. Generally spoken two approaches exist : i) the transformation approach, i.e. transform the discrete data to real vectors and ii) the direct approach, i.e. define kernel functions that work on string data but calculate real inner products in F. The transformation approach has been successfully applied to the classification of large texts and is in that context generally known as the bag-of-words approach [5]. For other applications, where classification is done more at the word level, and where a word that has to be classified (the focus word) is represented by a context of p words in front and q words after the focus word, the data is transformed to a real format in a order preserving way. We will refer to this approach as the orthonormal vector approach [14, 4]. Finally, an example of the direct approach is given by a class of

string data classification problems that relies on more complex concepts like substrings, spectrums, inexact matching, substitutions etc., but we will not go into detail about this here [9, 6, 7]. This paper focuses on classification problems at the word level as described above, but in contrast to what is commonly done we will not be making use of the transformation approach and orthonormal vectors but we will be making use of the direct approach. It will be argued that it is better to work directly on the contexts instead of working on a transformed high-dimensional format because, in this way it is much easier to incorporate special purpose (dis)similarity measures into the kernel function as such measures are defined on the string data itself and not on a high-dimensional representation of that data. This approach will be referred to as the context-based approach. Section (2) describes the class of problems we will consider here and how to represent them as contexts, orthonormal vectors and context vectors. Subsequently, Section (3) shows how context-based kernel functions derived from a distance function defined on contexts can be constructed and how these kernels correspond to the orthonormal vector approach for well-chosen values of the non-zero components of the orthonormal vectors. But, at the same time it will also be shown that the context-based approach is a more intuitive way to incorporate special purpose (dis)similarity measures. More specific in this paper we describe a more general form of the polynomial information gain kernel introduced in [16]. Next, Section (4) gives some experimental results on a natural language task called language independent named entity recognition and finally, Section (5) gives a conclusion to this work. 2. Definition and Representation of the Problem

indexed according to ki ∈ {0, . . . , n} with |D| = n the cardinality of the dictionary D and i = 0, . . . , |s| − 1. Contexts are now formed by sliding a window over the sequences s ∈ S , i.e. for every sequence s a set of instances I (s) containing |s| contexts with a window size r = (p + q + 1) is constructed    as follows I (s) = s (i − p) : (i + q) | 0 ≤ i ≤ |s| − 1 with p the size of the left context, q the size of the right context   and with s i : j = ski . . . sk j the subsequence of symbols from index i to j in the sequence s. The total set of contexts is now formed by taking the union of all the I (s), i.e. S I (S ) = S I (s). Notice that for subsequences with indices i < 0 and j > |s| − 1 corresponding positions are filled with the special symbol ‘ − ‘ which can be considered as the empty symbol. Also note that this symbol is assigned the index 0 in the dictionary D. In the following we will give an example of this in the setting of language independent named entity recognition (LINER). Example 1. In LINER it is the purpose to distinguish between different types of named entities. Named entities are phrases that contain the names of persons, organizations, locations, times and quantities. Consider the following sentence: [B-PER Wolff] , currently a journalist in [B-LOC Argentina] , played with [B-PER Del I-PER Bosque] in the final years of the seventies in [B-ORG Real I-ORG Madrid]. Here we recognize 5 types of named entities

(B-PER, I-PER, B-LOC, B-ORG and I-ORG), all other words are tagged as O (outside named entity). Instances for such classification problems are often composed by the word for which we want to determine the entity class and a context of a number of words before and after the word itself. Furthermore, sometimes additional features, like the position of a word in a sentence, are used, but we will not consider these features here. For simplicity we will use a context of 2 words on the left and the right of the focus word (i.e. a window size of 5), in our setting we now have:

This section formalizes the type of string data classification problems considered in this paper. Section (2.1) describes how instances are formed by sliding a window over a sequence of strings and subsequently, Section (2.2) describes how these instances can be represented as real and discrete, symbolic vectors respectively.

1. The dictionary D is the set of all English words and names of persons, organizations etc.

2.1. Sliding Windows and Contexts

3. For the above sentence we have following sequence s =

Consider a collection S of sequences s. Every sequence  of an ordered succession of symbols, i.e.  s consists s = sk0 . . . sk|s|−1 with |s| the length of the sequence and with ski ∈ D a set (henceforth called a dictionary) of symbols

2. The set of sequences S is a set of sentences, i.e. we consider a sentence as being a sequence with the words as components. Notice that in this example there is only one sequence in S as there is only one sentence. Wolff , currently a journalist in Argentina , played with Del Bosque in the final years of the seventies in Real Madrid, with |s| = 23.

S Next, the set of contexts is defined as: I (S ) = S I (s) = I (s) = {∀ i = 0, . . . , 22 | s [(i − 2) : (i + 2)]} = {s [−2 : 2] ,

s [−1 : 3] , s [0 : 4] , . . . , s [18 : 22] , s [19 : 23] , s [20 : 24]} = { ( − − Wol f f , currently ) , ( − Wol f f , currently a ) , ( Wol f f , currently a journalist ) , . . . . . . . . . . . . . . . , ( seventies in Real Madrid . ) , ( in Real Madrid . − ) , ( Real Madrid . − − ) } Finally, in many cases one would like to use additional features next to the symbols in the sequences themselves. An example of such an additional feature, in the case of LINER, are part-of-speech tags (see Section (4)). The possible values the new features can adopt are added to the dictionary D and indexed accordingly. The new features themselves can be added everywhere in the context as long as they are put on corresponding positions in all contexts. The most simple way is to just add them at the end of the contexts as they are formed by sliding the window over the original sequences. Notice that in this sense, we talk about annotated contexts, although we consider them not here. Next, different ways to represent contexts, in a way suitable for a SVM to work with them, will be described. For this purpose we will show different ways to represent the contexts as vectors x, x′ ∈ X.

and s2 =, etc. Now we have the following: Wolff , ... .

= = .. . =

(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) (0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) ... ...... ...... ... (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)

Complete instances are now formed by concatenating the above vectors according to the succession of the different tokens in the order as they appear in the sequence. Note that the empty symbol ′ −′ is encoded by the all-zero vector, so in total we have n + 1 orthonormal vectors. In general such vectors will be represented in a sparse format, i.e. only the components with a non-zero value are written down accompanied by the index of that component. The sparse representation allows for much more compact vectors as the dimensionality of the vectors in this example is n ∗ r = 95 with only a maximum of r components being non-zero. Therefore, calculating with such vectors, represented in a sparse format, is very efficient. Consider following example:

2.2. Different Ways to Represent Contexts We will make an important difference between reasoning about contexts and computing with contexts, i.e. sometimes a certain representation of a problem is easy to reason about while it is not efficient to use that representation to do calculations with, or alternatively the other way around, i.e. a representation can be difficult to reason about while it is very efficient to do calculations with. A well-known approach to representing contexts as suitable vectors to do SVM learning is by encoding them with orthonormal vectors. An orthonormal vector is a vector of length n representing a symbol from D. An orthonormal vector has all zero components except for the component corresponding to the index in D of the symbol it represents, in total there are n orthonormal vectors, i.e. one for each symbol in D. Usually the only non-zero component takes the value of 1 (although other values are also possible as we will see in Section (3.1)), complete vectors are now formed by concatenating all orthonormal vectors corresponding to the symbols in the sequence, the dimensionality of such a complete vector is n ∗ r with only r non-zero components. Consider the following example: Example 2. For the set S of Example 1 we have a set of 19 tokens, i.e. n = 19 and a window size r = 5. Now assume that the tokens in D with |D| = 19 are indexed according to their order of occurrence in S , i.e. s1 = Wol f f

Example 3. Considering the set of sequences S with the corresponding set of instances I (S ) from Example 1 we have a real input space (which is binary in this particular case) X ⊂ Rn∗r with vectors x, x′ containing only r non-zero components that can be represented in a sparse format as follows: X = {( 39:1,59:1,79:1 ) , ( 20:1,40:1,60:1,80:1 ) , ( 1:1,21:1,41:1, 61:1,81:1 ) , . . . . . . . . . . . . . . . . . . , ( 16:1,25:1,55:1,75:1,95:1 ) , ( 6:1,36:1,56:1,76:1 ) , ( 17:1,37:1,57:1 ) } Reasoning about such vectors, e.g. reasoning about the similarity between orthonormal vectors is not easy because it is not directly observable what context they represent. Moreover, similarity measures one would like to use in the kernel function are defined on contexts and not on orthonormal representations of such contexts. For these reasons, we believe that is is better to reason about contexts at the level of the symbols themselves, i.e. in the way as they are represented in Example 1. However, calculating with such context vectors is not efficient at all because the different symbols of the context will have to be checked for equality and the complexity of comparing 2 strings is proportional to the length of the longest string in the comparison. Note that one of the most important requirements for a kernel is for it to be computationally efficient as it has to be calculated for all pairs of vectors in the training set. For these reasons we will represent the symbols by their index ki in the dictionary

D. In this way we only have to compare integers and not strings. Consider following example: Example 4. Once again, considering the set of sequences S with the corresponding set of contexts I (S ) from Example 1 we have a real space (which contains vectors with only integer components) X ⊂ Rr as follows : X = {( 3:1, 4:2, 5:3 ) , ( 2:1 , 3:2 , 4:3 , 5:4 ) , ( 1:1 , 2:2 , 3:3 , 4:4 , 5:5 ) , . . . , ( 1:16 , 2:6 , 3:17, 4:18 , 5:19 ) , ( 1:6, 2:17 , 3:18 , 4:19 ) , ( 1:17 , 2:18 , 3:19 ) } Note that the empty symbol ′ −′ is assigned the index 0 and that here we also make use of the sparse vector format and therefore it is not represented. For LINER, the computational benefit, obtained in this way, will be substantial. Moreover, we can say that calculating with context vectors, represented in this way, becomes equally efficient as calculating with orthonormal vectors while at the same time it stays possible to apply and use similarity measures defined on the contexts themselves. 3. Context Kernels from a Distance Function Viewpoint Notice that from now on we will be reasoning about vectors x, x′ ∈ X with X ⊆ Sl the input space of the SVM with Sl the space of all annotated contexts with r the window size as before and l the total length of the annotated contexts with components xi , xi′ ∈ D. Note that for simplicity we will be talking about contexts instead of annotated contexts as this does not matter for the following. When doing calculations and talking about implementation, for reasons of efficiency, the contexts will be represented by the index of the symbols in D as discussed in the previous section. By doing so we will describe a framework to reason about the contexts themselves while at same time taking advantage of the computational benefit by representing them by their index in D at the level of the actual computations. Next, orthonormal vectors will be denoted by x˜ , x˜ ′ ∈ X˜ with X˜ ⊆ Rl∗n with n the cardinality of D as before. 3.1. A Weighted Distance Between Contexts Consider the following weighted distance function that works on contexts x and x′ ∈ X ⊆ Sl with x, x′ and D with |D| = n as before: dw : X × X → R+ : dw (x, x′ ) =

l−1 X i=0

wi d(xi , xi′ )

(1)

with d(xi , xi′ ) = 0 i f xi = xi′ , ′ −′ , else 1. Notice that in order for dw to be always positive and make sense, all the wi should be greater than 0. Next, when doing calculations, we will present the contexts by their index in D as discussed before. Therefore, consider the following alternative expression to calculate the RHS of Equation (1): wi d(xi , xi′ ) = wi − δ(ki , li )

(2)

with ki , li ∈ {0 . . . n} the indices of xi and xi′ in D and with δ : {0 . . . n} × {0 . . . n} → {wi , 0} : δ(ki , li ) = wi i f ki = li , 0, else 0

(3)

Notice that by making use of Equations (2) and (3) we avoid doing any string comparisons and moreover isolating the wi in (2) and (3) will allow us to calculate our weighted inner product by at most l integer comparisons and floating point sums. This will be shown in the next section. 3.2. A Weighted Inner Product between Contexts We start by defining an inner product-based on the distance from Equation (1) and making use of the alternative formulation in (2) and (3) gives rise to the following definition: l−1

X δ(ki , li ) h . | . i : X × X → R : x|x′ =

(4)

i=0

Next, it will be argued that above inner product is equivalent to the standard inner product between the corresponding orthonormal vectors. This is stated in the following proposition. Proposition 1. Let X ⊆ Sl be a discrete space with contexts x and x′ with l the length of the contexts, with components xi and xi′ ∈ D and with |D| = n as before. Let X˜ ⊆ Rn∗l be the space of n ∗ l-dimensional vectors x˜ and x˜ ′ the corresponding orthonormal vector representation of the  √ √ w0 [˜x]0 , . . . , wl−1 [˜x]l−1 , vectors x and x′ as follows: x˜ =  √ √ w0 [x˜′ ]0 , . . . , wl−1 [x˜′ ]l−1 , with [˜x]i and [x˜′ ]i the x˜′ = orthonormal binary vectors as defined in Example 2 and corresponding to the symbols xi and xi′ respectively. Then, the function h . | . i as defined in Equation (4) is positive definite (PD) and hx|x′ i = h˜x, x˜ ′ i. Proof. We will prove that hx|x′ i = h˜x, x˜ ′ i, the positive definiteness of the function h . | . i will follow automatically. We start by noting that the vectors x˜ and x˜′ are composed out of l orthonormal vectors and every [˜x]i corresponds with the

√ symbol xi from the vector x. Multiplying the [˜x]i with wi comes down to using orthonormal vectors where the non-zero component of the orthonormal vector corresponding to the symbol xi does not take the value of 1 but the square root of the corresponding weight wi from Equation (1). Therefore, in the following we will assume that the non-zero components √ of the [˜x]i take the value wi instead of always writing out the multiplication itself. For l = 1 and ∀x, x′ ∈ X ⊆ S1 we have the following: (

′ w0 if x0 = x0′ x|x = δ(k0 , l0 ) = 0 if x0 , x0′ For the discrete case this follows directly from the definition of the kernel function and the distance function it is based on, see Equations (2),(3) and (4). For the orthonormal case it is sufficient to note that for the inner products between all orthonormal vectors [˜x]i , [x˜′ ]i it holds that: E ( wi if [˜x]i = [x˜′ ]i D ′ ˜ (5) [˜x]i , [x ]i = 0 if [˜x]i , [x˜′ ]i Next, because l = 1 we need only one orthonormal vector to construct complete instances, i.e. x˜ = [˜x]0 and x˜′ = [x˜′ ]0 and thus:

′ x˜ , x˜

=

l−1 D X

[˜x]i , [x˜′ ]i

i=0

=

D

[˜x]0 , [x˜′ ]0

= δ(k0 , l0 )

E

(6)

E

Where the last step is justified by Equation (5) and by the assumption that [˜x]i is the orthonormal vector corresponding to the token xi . Next, assume that the proposition holds for l = m. Subsequently, we will prove that the proposition holds for l = m + 1 and by induction we will be able to conclude that it holds for all l. We start by showing that the calculation of the kernel values for l = m + 1 can be decomposed in terms of l = m: ( m−1 ′

′ X wm if xm = xm x|x = δ(ki , li ) + (7) ′ 0 if xm , xm i=0

Now it can be readily seen that the the proposition holds for l = m + 1 because we know by assumption that for the left part of the RHS of Equation (7) it holds that E Pm−1 Pm−1 D ′] ˜ and for the right part of δ(k , l ) = [˜ x ] , [ x i i i i i=0 i=0 ′ the RHS makingDuse of Equations (5) and (6) xm = xm and E ′ ′ ˜ xm , xm implies [˜x]m , [x ]m = wm and 0 respectively.

Next, it will be shown how the above inner product, which can be considered as a simple linear kernel, can be used in more complex kernel functions that are suited to do SVM learning for problems with more complicated non-linear decision boundaries. 3.3. Context-sensitive Kernel Functions Two kernels will be described in this section, both based on standard kernel functions, i.e. the polynomial kernel and the radial basis kernel [2]. For w = 1 the kernels are actually equal to these kernels, however for w , 1 they are not the same and in that case we call them context-sensitive as they take into account the amount of information that is present at every position i in the contexts through the weights wi . We start with the polynomial-based kernel, which takes following form: K(x, x′ ) =



d x|x′ + c

(8)

For w = 1 we call the kernel K sok (Simple Overlap Kernel) and for w , 1 we call the kernel Kwok (Weighted Overlap Kernel). Notice that in practice K can be normalized, i.e. scaled between 0 and 1 as follows: Knorm (x, x′ ) = √

K(x, x′ ) √ K(x, x) K(x′ , x′ )

This normalization has the same purpose as the normalization of numerical input vectors in the continuous case, of course here we cannot normalize the inputs as they are symbolic. As the experimental results will show, normalization does not have a big effect in this case, probably due to the low dimensionality of the contexts, i.e. in cases with very large contexts normalization would have a bigger effect. Next, we describe the radial basis-based kernel which takes following form: 

2  K(x, x′ ) = exp − γ

x˜ − x˜′





  = exp − γ hx|xi − 2 x|x′ + x′ |x′   = exp − 2γdw x, x′ (9) For w = 1 we call the kernel Korb f (Overlap Radial Basis Function) and for w , 1 we call the kernel Kwrb f (Weighted Radial Basis Function). Note that normalization in this case is not necessary as the function exp is automatically scaled between 0 and 1. Finally, we briefly repeat the advantages of looking at things from a distance function and context-based viewpoint:

1. (Dis)Similarity measures are defined on string vectors and not on orthonormal vectors, in that sense it is much easier to see when two string vectors are equal than it is the case for orthonormal vectors.

development set is used to do parameter optimization and the test set is used to assess the accuracy of the trained classifier. The experiments have been done with LIBSVM, a C/C++ and Java library for SVMs [1].

2. In the light of the first point it is conceptually much more intuitive to reflect about this type of kernel functions in this way because it makes it easier to incorporate more specific knowledge like for example context weights, i.e. it is only by considering things from a context-based distance viewpoint that we came to the finding that or√ thonormal vectors with wi as non-zero components is actually a well founded choice.

To represent the instances of the LINER problem we made use of 3 types of features next to the words themselves : 1. part-of-speech tags for each of these words, like e.g. N(oun), V(erb), Num(ber), etc. , 2. orthographic information for each of these words, like e.g. CAP(ital letter), 3. information about the position of a particular word in a sentence. In this light and taking into account that we used a window size equal to 5, the resulting (annotated) contexts have length 16.

3. Moreover it is not only conceptually but also computationally better to work with string vectors (represented by their index in D) as we only have to sum the w whereas for orthonormal vectors we have to sum √ i√ wi ∗ wi for all equal vector components.

As a weighting scheme we used a quantity called information gain ratio which calculates for every feature the amount of information it contains with respect to the determination of the class label. In previous work we used information gain, this quantity however favors features with a smaller number of possible values, for more details we refer to [10, 3, 16].

4. Working with string vectors makes it possible to consider a degree of similarity between different vector components, e.g. one could compare strings with the levenshtein distance [8] or work with similarity matrices [15] etc. 4. Experiments The following section discusses the experimental results on language independent named entity recognition (LINER). For an introduction to LINER see Example 1 of Section (2.1). Notice however that we consider in total 8 named entities, in addition to the 5 types of named entities B-PER, I-PER, BLOC, B-ORG and I-ORG we also consider I-LOC, B-MISC and I-MISC. So including the O we have a classification problem of 9 classes. 4.1. Experimental Setup The data we used for our experiments was taken from the CoNLL 2002 shared task [12]. The data consist of four editions of the Belgian newspaper ”De Morgen” of 2000 (June 2, July 1, August 1 and September 1). The data is divided into three separate sets: i) a training set consisting of 202931 instances, ii) a development set consisting of 37761 instances, iii) a test set consisting of 68875 instances. The data in the training, development and test sets consist of three columns: the first column contains the word in consideration, the second column consists a part-of-speech tag for that word and the third column contains the class label for that word. The training set is used to train the classifier, the

For the evaluation of the results we used a PERL script that is also used for the final evaluation of the CoNLL 2002 shared task. It calculates for every type of named entity the precision, recall and Fβ=1 rate. Additionally it also calculates the overall precision, recall and Fβ=1 rate. The Fβ=1 rate serves to asses the global performance, it is calculated by a combination of precision and recall. Due to spacial limitations only the F rates will be reported.

4.2. Results

The following table shows the results of the different kernel functions on the LINER task. For the sok and wok kernels the value of c has been fixed to 0 and the value of d has been fixed to 2. In this way we can reduce the number of tunable parameters considerably, this is important as we have to train (n ∗ (n − 1)) /2 = 36 classifiers and thus optimizing the parameters making use of a fine grid search can take quite some time. Note however that it is our experience that higher values for d leads to worse results and values for c , 0 has little effect on the accuracy. The only parameter that is left to optimize in this way is the trade-off parameter C, which we optimized by trying different values through cross-validation on the development set. For the orbf and wrbf kernels however there are 2 tunable parameters, i.e. γ and C. Therefore in this case we performed a fine grid search to obtain the optimal value [1].

kernel K sok (C = 5) K sok (C = 5) (normalized) Kwok (C = 30) Kwok (C = 19) (normalized) Korb f (C = 64, γ = 0.015625) Kwrb f (C = 32, γ = 0.5)

Fβ=1 70.08 ± 1.71 70.84 ± 1.64 73.09 ± 1.62 73.07 ± 1.60 72.05 ± 1.64 73.17 ± 1.58

The significance intervals for the F rates have been obtained with bootstrap resampling [11]. F rates outside of these intervals are assumed to be significantly different from the related F rate (p < 0.05). From the results it becomes clear that the context-sensitive kernel functions perform better than their unweighted counterparts although for the orbf and wrbf not with a statistically significant difference. Notice that in the results there is almost no difference between the normalized and non-normalized polynomial-based kernels, this is probably due to the low dimensionality of the contexts, i.e. the effect of the normalization is not really noticeable. Next, also notice that besides an increased accuracy there is also a noticeable drop in model complexity, i.e. for the contextsensitive kernel functions the number of support vectors is always lower, this can be seen in the following table. kernel K sok (c = 5) K sok (c = 5) (normalized) Kwok (c = 30) Kwok (c = 19) (normalized) Korb f (c = 64, γ = 0.015625) Kwrb f (c = 32, γ = 0.5)

number of SVs 24132 23779 21886 22421 24012 21792

Note that although the differences aren’t that big in comparison to the total number of support vectors it still results in a noteworthy increase of the classification speed as kernel values have to be calculated for the example to be classified with all support vectors. For applications where the classification of new examples is time critical this can be very important. 5. Conclusion This paper described the use of SVMs on applications characterized by sliding a window over a sequence of symbols. For this type of applications we described a different viewpoint completely based on distance functions defined on contexts. The advantage of this approach is a much more intuitive setting to design and reflect about this type of kernel functions. The latter was illustrated by extending the idea of a weighted distance function to kernel functions and SVMs. In general the approach is also usable on continuous data

and with different distance functions. Finally, the experimental results showed that the weighted kernel functions making use of information gain ratio weights outperform their unweighted counterparts not only in terms of accuracy but also in terms of model complexity. References [1] C. Chih-Chung and L. Chi-Jen. LIBSVM : A Library for Support Vector Machines. 2004. [2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and other Kernel-based Learning Methods. Cambridge University Press, 2000. [3] W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bosch. Timbl: Tilburg memory-based learner, version 4.3. Technical report, Tilburg University and University of Antwerp, 2002. [4] H. Isozaki and H. Kazawa. Efficient support vector classifiers for named entity recognition. In Proceedings of COLING2002, pages 390–396, 2002. [5] T. Joachims. Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, 2002. [6] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel : A string kernel for svm protein classification. In Proceedings of the Pacific Symposium on Biocomputing, pages 564–575, 2002. [7] C. Leslie and R. Kuang. Fast kernels for inexact string matching. Lecture Notes in Computer Science, 2777, 2003. [8] V. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics - Doklady 10,, 10:707–710, 1966. [9] H. Lodhi, J. Shawe-Taylor, N. Cristianini, and C. J. C. H. Watkins. Text classification using string kernels. In NIPS, pages 563–569, 2000. [10] T. Mitchell. Machine Learning. The McGraw-Hill Companies, Inc., 1997. [11] E. W. Noreen. Computer-Intensive Methods for Testing Hypotheses. John Wiley & Sons, 1989. [12] E. T. K. Sang. Introduction to the conll-2002 shared task: Language-independent named entity recognition. In Proceedings of CoNLL-2002, Taipei, Taiwan, pages 155 – 158, 2002. [13] B. Sch¨olkopf. The kernel trick for distances. Technical report, Microsoft Research, 2000. [14] K. Takeuchi and N. Collier. Use of support vector machines in extended named entity. In Proceedings of CoNLL-2002, pages 119–125. Taipei, Taiwan, 2002. [15] B. Vanschoenwinkel. Substitution matrix based kernel functions for protein secondary structure prediction. In The Proceedings of ICMLA-04 (International Conference on Machine Learning and Applications), 2004. [16] B. Vanschoenwinkel and B. Manderick. A weighted polynomial information gain kernel for resolving pp attachment ambiguities with support vector machines. In The Proceedings of the Eighteenth International Joint Conferences on Artificial Intelligence (IJCAI-03), pages 133–138, 2003.

Suggest Documents