CONTEXT-SENSITIVE KERNEL FUNCTIONS: A COMPARISON

0 downloads 0 Views 84KB Size Report
This paper considers weighted kernel functions for support vector machine ... called symbols) where each symbol belongs to a given dictionary, moreover to ...
CONTEXT-SENSITIVE KERNEL FUNCTIONS : A COMPARISON BETWEEN DIFFERENT CONTEXT WEIGHTS Bernard Manderick a

Bram Vanschoenwinkel ∗a a

Computational Modeling Lab, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussel, {bvschoen,bmanderi}@vub.ac.be, http://como.vub.ac.be/ Abstract This paper considers weighted kernel functions for support vector machine learning with string data. More precisely, applications that rely on a context of a number of symbols before and after a target symbol will be considered. It will be shown how contexts can be organized in vectors and subsequently, different weighted kernel functions working directly on such contexts will be described. The weighted kernel functions allow to weight every symbol in a context according to its relevance for the determination of the class label of the target symbol. Different weighting schemes will be described and compared to each other in two natural language learning problems. From the results it becomes clear that context weighting has a positive influence not only on the classification accuracies but also on the model complexity.

1 Introduction In support vector machine (SVM) learning the data is mapped non-linearly from the original input space X to a high-dimensional feature space F and subsequently separated by a maximum-margin hyperplane in that space F. By making use of the kernel trick, the mapping to F can stay implicit and we can avoid working in a high-dimensional space. Moreover, because the mapping to F is non-linear, the decision boundary which is linear in F, corresponds to a non-linear decision boundary in X. One of the most important design decisions in SVM learning is the choice of kernel function K because the hyperplane is defined completely by inner products between vectors in F and calculated through the kernel function K. In addition, K takes vectors from the input space X and directly calculates inner products in F without having to represent or even know the exact form of these vectors, hence the implicit mapping and computational benefit [3]. In the light of the above it is not hard to see that the way in which K is calculated is crucial for the success of the classification process. Notice that an inner product is in fact one of the most simple similarity measures between vectors as it gives much information about the position of these vectors in relation to each other. In that sense the learning process can benefit a lot from the use of special purpose similarity or dissimilarity measures in the calculation of K [9, 10, 12]. The problems we will consider here are 1) part-of-speech (POS) tagging and 2) language independent named entity recognition (LINER). In LINER it is the purpose to determine for all the words in a text whether the word refers to a proper name or not, and, if it refers to a proper name, it will also be indicated what kind of name, i.e. the name of a person, organization, location etc. Part-of-speech tagging is the process of marking up the words in a text with their corresponding parts of speech, i.e. the syntactic categories that words belong to. Each instance of both problems can be represented by a sequence of words (henceforth more generally called symbols) where each symbol belongs to a given dictionary, moreover to determine the class label of a symbol we will look at the context of each symbol in a given sequence, i.e. the p symbols before and the q symbols after that symbol. Additionally, in the case of LINER we will extend a context by using additional features, like orthographic features or POS tags for example. In that case we talk about an annotated context. In our case a symbol can be a word, a proper name, an orthographic feature or a punctuation mark etc., the dictionary contains all words of the language in consideration together with proper names, punctuation marks and additional features. A sequence of symbols is a sentence and the context of a given word are the p words in front and the q words after the given word. ∗ Author

funded by a doctoral grant of the institute for advancement of scientific technological research in Flanders (IWT).

1

In classifying contexts, the data is often transformed to real data and then a standard kernel working on the transformed input examples is used to do classification. In this work however we use kernel functions that work on the contexts themselves by making use of a simple similarity measure defined on context vectors. It will be shown that by introducing feature weights into the similarity function, contexts can be weighted according to their relevance for the determination of the class label for the problem under consideration. The kernels weighted in this way will be called context-sensitive kernels as they are sensitive to the amount of information that is present in the context. From the experiments it will become clear that context-sensitive kernels always outperform their unweighted counterparts and that some weighting schemes are better than others.

2 Sequences of Symbols and Contexts The following sections give a formalized description of the type of problems we consider here and show how these problems can be represented by contexts formed by sliding a window over a sequence of symbols. Next, three different ways to present such contexts to a SVM, i.e. by orthonormal real vectors, by symbols or by the index of the symbols in a dictionary, will be briefly described and finally it will be motivated why we choose to represent the contexts by their indexes in the dictionary.

2.1

Contexts

Consider a collection S of sequences s. Every sequence s consists of an ordered succession of symbols, i.e.   s = sk0 . . . sk|s|−1 with |s| the length of the sequence and with ski ∈ D a set (henceforth called a dictionary) of symbols indexed according to ki ∈ {0, . . . , n} with |D| = n the cardinality of the dictionary D and i = 0, . . . , |s| − 1. Contexts are now formed by sliding a window over the sequences s ∈ S , i.e. for every sequence s a set of instances I (s) containing |s| contexts with a window size r = (p + q + 1) is constructed as    follows I (s) = s (i − p) : (i + q) | 0≤ i ≤ |s| − 1 with p the size of the left context, q the size of the right   context and with s i : j = ski . . . sk j the subsequence of symbols from index i to j in the sequence s. The S total set of contexts is now formed by taking the union of all the I (s), i.e. I (S ) = S I (s). Notice that for subsequences with indexes i < 0 and j > |s| − 1 corresponding positions in the sequence are filled with the special symbol ‘ − ‘ which can be considered as the empty symbol which has index 0 in D. In the following we will give an example of this in the setting of language independent named entity recognition (LINER). Example 1 In LINER it is the purpose to distinguish between different types of named entities. Named entities are phrases that contain the names of persons, organizations, locations, times and quantities. Consider the following sentence: [B-PER Wolff] , currently a journalist in [B-LOC Argentina] , played with [B-PER Del I-PER Bosque] in the final years of the seventies in [B-ORG Real I-ORG Madrid]. Here we recognize 5 types of named entities (B-PER, I-PER, B-LOC, B-ORG and I-ORG), all other words are tagged as O (outside named entity). Instances for such classification problems are often composed by the word for which we want to determine the entity class and a context of a number of words before and after the word itself. Furthermore, sometimes additional features, like the position of a word in a sentence, are used, but we will not consider these features here. For simplicity we will use a context of 2 words on the left and the right of the focus word (i.e. a window size of 5), in our setting we now have a set of contexts I (S ) is defined as: [ I (s) = {s [(i − 2) : (i + 2)] | 0 ≤ i ≤ 22 } S

= {s [−2 : 2] , s [−1 : 3] , s [0 : 4] , . . . , s [18 : 22] , s [19 : 23] , s [20 : 24]} = {(− − Wol f f , currently) , (− Wol f f , currently a) , (Wol f f , currently a journalist) , . . . , . . . , (seventies in Real Madrid) , (in Real Madrid−) , (Real Madrid − −)} Notice that the ‘ − ‘ represent values that are not present because they fall before the beginning or after the end of a sequence. In addition, as mentioned in Example 1, in many cases one would like to use additional features next to the symbols in the sequences themselves. These additional features are defined in function of the symbols in the sequences and possibly some other information, like the position of the symbol in the sequence for example.

The possible values the new features can adopt are added to the dictionary D and indexed accordingly. The new features themselves can be added everywhere in the context as long as they are put on corresponding positions in all contexts. The most simple way is to just add them at the end of the contexts as they are formed by sliding the window over the original sequences. Consider the following example: Example 2 In the case of LINER we distinguish 3 types of additional features: 1. POS tags: POS tags are defined for every symbol in the original dictionary D and describe what type of word we are considering, e.g. V(erb), N(oun), etc. This feature is defined by the function fPOS : D → P that maps every symbol from D to some POS tag p from the set P of all possible POS tags. This feature is defined for every symbol in the original context I(s). 2. Orthographic features: Like information about the capitalization of a word, digits or hyphens in a word, etc. This feature is defined by the function fORT : D → O. This feature is also defined for every symbol in the original context I(s). 3. Position feature: This feature is binary and tells something about the position of the focus word in the original sequence, i.e. whether the focus word is the first word in the sentence or not. Consequently this feature is defined in function of the position of the symbol in the original sequence s, i.e. fwp : {0 . . . |s| − 1} → { f w, ow} with f w referring to first word and ow referring to other word. S S S The total dictionary Dtot is now formed by D P O { f w, ow} and an annotated context based on a sequence s is made out of 4 blocks as follows: i j i j , . . . , fPOS , fORT , . . . , fORT , fwp ) (ski , . . . , sk j , fPOS

From now on we will consider D to be the dictionary containing all symbols including the values of the additional features. Moreover, note that for the following it does not matter whether we are working with contexts or annotated contexts. Therefore, from now on we will not distinguish between contexts and annotated contexts. Next, different ways to represent contexts, in a way suitable for a SVM to work with them, will be described. For this purpose we will show different ways to represent the contexts as vectors x, x′ ∈ X.

2.2

Different Ways to Represent Contexts

We will make an important difference between reasoning about contexts and computing with contexts, i.e. sometimes a certain representation of a problem is easy to reason about while it is not efficient to use that representation to do calculations with, or alternatively the other way around, i.e. a representation can be difficult to reason about while it is very efficient to do calculations with. A well-known approach to representing contexts as suitable vectors to do SVM learning is by encoding them with orthonormal vectors. An orthonormal vector is a vector of length n representing a symbol from D. An orthonormal vector has all zero components except for the component corresponding to the index in D of the symbol it represents, in total there are n orthonormal vectors, i.e. one for each symbol in D. Usually the only non-zero component takes the value of 1, complete vectors are now formed by concatenating all orthonormal vectors corresponding to the symbols in the sequence, the dimensionality of such a complete vector is n ∗ r with only n non-zero components. Calculating with orthonormal vectors is very efficient since most SVM implementations work with a sparse vector format, i.e. only the non-zero components have to represented. Reasoning about such vectors, e.g. reasoning about the similarity between orthonormal vectors is not easy because it is not directly observable what context they represent. Moreover, similarity measures one would like to use in the kernel function are defined on contexts and not on orthonormal representations of such contexts. For these reasons, we believe that it is better to reason about contexts at the level of the symbols themselves, i.e. in the way as they are represented in Example 1. However, calculating with such context vectors is not efficient at all because the different symbols of the context will have to be checked for equality and the complexity of comparing 2 strings is proportional to the length of the longest string in the comparison. Note that one of the most important requirements for a kernel is for it to be computationally efficient as it has to be calculated for all pairs of vectors in the training set. For these reasons it is a logical choice to represent the symbols not by strings and characters but by their index ki in the dictionary D. In this way we only have to compare integers and not strings. For LINER and POS tagging the computational benefit, obtained in this way, will be substantial.

3 Context-sensitive Kernel Functions Notice that from now on we will be working with vectors x, x′ ∈ X with X ⊆ Sl with Sl the space of all contexts with l the length of the contexts. Note however that contexts are subsequences s[i, j] from the sequence s they have been derived from, however for the following it does not matter from which sequence the context vectors have been derived and thus x, x′ are just considered to be sequences of symbols with components xi , xi′ ∈ D and corresponding indexes ki , ki′ ∈ {0, . . . , n}. Remind that in practice the contexts will actually be represented by the indexes ki , ki′ as discussed in the previous section.

3.1

A Simple Context-sensitive Kernel

We start by considering the most basic context-sensitive (CS) kernel that is defined as follows: Definition 1 Let X ⊆ Sl with l-dimensional contexts x and x′ with components xi and xi′ ∈ D with |D| = n as before. Then the basic context-sensitive kernel KCS : X × X → R is defined as KCS (x, x′ ) =

l−1 X

wi δ(xi , xi′ )

(1)

i=0

with δ : D × D → {1, 0} defined as δ(xi , xi′ ) = 1 i f ki = ki′ , 0, else 0

(2)

and with ki , ki′ the indexes of xi , xi′ in D, with wi > 0 a context weight for the i-th symbol in the context and 0 the index of the empty symbol ‘ − ‘. Notice that the function KCS satisfies the necessary requirements of a kernel function, i.e. it is positive semi-definite (PSD). This can be easily seen, because for w = 1 the function KCS is equal to the standard inner product between the orthonormal binary vectors representing the contexts. For w , 1 the function KCS is equal to the standard inner product between orthonormal vectors, with the non-zero component √ corresponding to symbol xi , taking the value wi , for more details see [11]. But, although we can transform the simple kernel from Equation 1 to an orthonormal equivalent it is much more intuitive to reason about it at the level of the contexts themselves and considering KCS instead of working with inner products between orthonormal vectors with square roots of the weights as non-zero components. Moreover, it is only by reasoning at the level of the contexts, by considering Equation 1 and analyzing it in more detail that we came to this finding. Notice that the simple CS kernel is too simple for classification problems with complex, non-linear decision boundaries. Therefore, the kernel will be extended in two ways, by making use of the closure properties of PSD kernels [3].

3.2

More Complex Kernels

First, we consider a polynomial kernel that is similar to the standard polynomial kernel in real space [3]. We call it an overlap kernel after the metric it is based on, i.e. the simple overlap metric (see Equation 1 and [4]): Definition 2 Let X ⊆ Sl with l-dimensional contexts x and x′ with components xi and xi′ ∈ D with |D| = n as before. Then the overlap kernel KOK : X × X → R is defined as  KOK (x, x′ ) = KCS (x, x′ ) + c d

(3)

with KCS as in Definition 1, with c ≥ 0 and with d > 0. For w = 1 the resulting kernel is called the Simple Overlap Kernel (SOK) and for w , 1 it is called the Weighted Overlap Kernel (WOK). The second extension is based on the standard radial basis function (RBF) kernel in real space and based on the fact that we can calculate kx − x′ k by making use of the simple CS kernel from Equation 1 gives rise to the context-sensitive radial basis function kernel defined as follows:

Definition 3 Let X ⊆ Sl with l-dimensional contexts x and x′ with components xi and xi′ ∈ D with |D| = n as before. Then the context-sensitive radial basis function kernel Kcsrb : X × X → R is defined as  Kcsrb (x, x′ ) = exp −γ KCS (x, x) − 2KCS (x, x′ ) + KCS (x′ , x′ ) (4) with KCS as in Definition 1, with γ > 0. For w = 1 the resulting kernel is called the Overlap Radial Basis Function (ORBF) kernel and for w , 1 it is called the Weighted Radial Basis Function (WRBF) kernel. Next, in the following section different context weights, that can be used in the weighted kernels, will be described.

4 Context Weights The unweighted kernels from the previous section (i.e. for the cases where w = 1) simply calculate the number of matching symbols in both contexts. If we have no information about the relevance of a certain symbol at a certain position in the context, this is a good choice to measure the similarity between such instances. However, by computing statistics on all contexts in our training set we can get a pretty good idea of which positions in the context are good predictors of the class labels [4]. It is in that sense that we call our weighted kernel functions context-sensitive because they are in fact sensitive to the amount of information that is present at different positions in the contexts they are working on. Information theory provides many useful tools for measuring statistics in the way described above [6, 7]. We start with a measure that is called information gain. Information Gain (IG) weighting looks at every position in the context separately and measures how much information it contributes to our knowledge of the correct class label. The IG of position i in the context is measured by calculating the entropy between the cases with and without knowledge of the symbol at that position: X P(v) × H(C |v ) (5) wi (IG) = H(C) − v∈Vi

with C the set of class labels, Vi ⊆ D the set of values for position i in the context and with H(C) the entropy of the class labels calculated as follows: X H(C) = − P(c) log2 P(c) c∈C

However, IG tends to overestimate the relevance of context positions with large numbers of values. To overcome this we can normalize the IG measure, the measure that is formed in this way is called the gain ratio. Gain Ratio (GR) is IG divided by a quantity called split info, i.e. the entropy of the context symbols: X P(v) log2 P(v) (6) si(i) = − v∈Vi

Next, the GR measure for context position i can be calculated in the following way: wi (GR) =

wi (IG) si(i)

(7)

with wi (IG) as in Equation 5 and si(i) as in Equation 6. Unfortunately the GR measure still has unwanted bias toward context positions with more values [1]. The next measure that will be described is called the shared variance measure and tries to overcome this problem. The Shared Variance (SV) measure is calculated based on the χ2 statistic [4, 1]. Start by considering the following equation that calculates this statistic:  2 X X Ei j − Oi j 2 χ = (8) Ei j i j with Oi j the observed number of cases with a value vi in class c j and Ei j is the expected number of the cases with a value vi in class c j . We can now use the χ2 values from Equation 8 or we can use a the SV measure which corrects the χ2 measure for the degrees of freedom as follows: wi (S V) =

χ2i N × (min(|C|, |Vi |) − 1)

(9)

with |C| and |Vi | the number of classes and the number of values of context position i and with N the number of instances.

5 Experiments In the experiments it will be shown that by weighting the contexts not only the classification accuracy improves but also both the model complexity and the convergence time of the SVM are reduced considerably. For this purpose we will make use of two natural language learning tasks discussed earlier, i.e. LINER and POS tagging.We start this section with a brief overview of the data and software that has been used for the experiments.

5.1

Software and Data

The experiments are conducted with LIBSVM [2] a Java/C++ library for SVM learning. The data we used for the LINER experiments was taken from the CoNLL 2002 shared task [8]. The data is divided into three separate sets, i.e. a training set consisting of 202931 instances, a development set consisting of 37761 instances and use for parameter selection and a test set consisting of 68875 instances. The contexts we used in our experiments consisted of 4 parts as described in Example 2, in total the (annotated) context length is equal to 16. In total we consider 4 types of entity classes, i.e. locations, persons, organizations and miscellaneous. For the POS tagging experiments we used a dataset that has been extracted from the WSJ corpus, it consists of a training set of 497522 instances and a test set of 46512 instances. Every instance consists of the word in the sentence and an annotation in a similar way as for the LINER instances, in total we consider 37 syntactical categories and we use a window size of 5. This comes down to a classification problem with 37 classes and a context length of 10. Finally, for the multi-class classification a one-against-one method is used [2] and for the evaluation of the results of LINER we used a PERL script that is used for the final evaluation of the CoNLL 2002 shared task. The indicated significance intervals for the F rates have been obtained with bootstrap resampling [5]. F rates outside of these intervals are assumed to be significantly different from the related F rate (p < 0.05). For POS tagging we just state the classification accuracies.

5.2

Results

We start with the Fβ=1 results for LINER and the polynomial based kernels, i.e. the SOK and WOK for different weights GR, IG and SV. We tried different values for c but did not see any significant difference in accuracy, therefore the parameter c has been fixed to 0. In a similar way d has been fixed to 2. For the cost parameter C we tried a wide range of values from which it could be observed that results stabalized around C = 25, because for values higher than 25 there was almost no observable difference in the F rates anymore, i.e. the differences between the measured F rates became insignificantly small. Since the best value of C C/kernel 0.5 1 5 25

S OK 63.93 ± 1.74 66.88 ± 1.69 70.08 ± 1.71 70.07 ± 1.72

WOK(gr) 63.42 ± 1.79 67.64 ± 1.68 72.12 ± 1.59 73.07 ± 1.58

WOK(ig) 65.69 ± 1.58 69.51 ± 1.63 72.30 ± 1.60 72.25 ± 1.64

WOK(sv) 65.19 ± 1.65 69.26 ± 1.78 71.28 ± 1.61 71.34 ± 1.71

Table 1: Results for LINER. Fβ=1 rates for the simple overlap kernel and context-sensitive kernels with different context weights. is for the results in the lower part of the table we concentrate on these F rates to do the comparison. From the results it can be seen that all weighted CS kernels outperform the unweighted kernel SOK. For the GR weights and the IG weights the difference with the SOK is statistically significant. For the GR weights and the SOK the significance is comfortable, for the IG weights and the SOK the significance is rather small and for the SV weights and the SOK the difference is not statistically significant. From the above we can conclude that the GR weights have the best influence on the results. The fact that GR weights perform slightly better than IG weights is most probably due to the fact that there is a big difference in the number of possible values at different context positions, i.e. for the context positions that

contain words there will be several hundreds of different values, but for the additional features (see Example 2) on the other hand there will be typically much less and therefore the normalizing effect of the split info used in the GR measure has a positive effect. For the somewhat disappointing results of the SV measure however we do not have an explanation. 250000 40000

242787 36272

35000

36167

34797

34222

240211

237231

240000

230000

30000

27415 25322

24225

25000

21166 19561

20791 19177

#SV 20000

27270 24289 22394

24132 21829

220000 #SV 208635

210000

208629

206256

15000 200000 10000 190000 5000 180000 0

1 0.5

1

5

25

25

100

C

C 350000

5000000 306512 291553

300000

277891

286402

4000000

258346 250000

226784

4678182

4500000 286006

283364

4194739 3905716 3726804

3500000

226984

207878

3774338

3282725

3000000

200000

184355

#iter 2500000

163307 152231

#iter 141758

150000

2000000

120991 111718

1500000

100000

1000000 50000

500000 0

0 0,5

SOK

WOK(gr)

WOK(ig)

1

WOK(sv)

5

1

25

C

SOK

WOK(gr)

25

100

C

Figure 1: Model complexity (top) and convergence (bottom). Results for LINER (left) and results for POS tagging (right). Additionally, it can be seen in the top left part of Figure 1 that GR weights also result in a smaller model complexity compared to the SOK, i.e. there are a smaller number of support vectors while at the same time classification is significantly better. In this sense the GR weights exhibit a similar behavior as in decision tree learning where the GR measure makes sure that shorter trees are preferred over longer ones [7]. For the other weights the model complexity increases, but this also with the benefit of better classification. On the bottom left side of Figure 1 we plot the number of iterations the SVM required to converge. For low C we can see that the CS kernel weighted with GR required less iterations than the SOK although the difference is rather small. For larger C the GR kernel actually requires a little more iterations than the SOK. For the other weights we can see that they require more iterations for lower C and less iterations for higher C, although the differences are rather small. For the POS tagging problem however we can see that the GR weighted kernel compared to the SOK always requires less iterations, this can be seen on the bottom right of Figure 1. Note that a smaller set of support vectors results not only in a smaller model but also in faster classification of new examples as such examples have to be multiplied with every vector in the model. Next we give the results for POS tagging. Notice that for a classification problem with 37 classes 666 classifiers have to be trained! For this reason we only compare between the SOK and the WOK weighted with GR weights. In the results for POS tagging we see the same tendency as for the LINER results, i.e. the C/kernel 1 5 25

S OK 94.4659% 95.5432% 95.5022%

WOK(gr) 95.5925% 96.38% 96.23%

Table 2: Results for POS tagging. Accuracies for the simple overlap kernel and context-sensitive kernel with information gain ratio. GR weighted kernel outperforms the unweighted SOK. Finally, again considering Figure 1 we can see on the top right that the model complexity for the CS kernel weighted with GR is much lower and as indicated before, and in contrast with the results for LINER, convergence is considerably faster for the CS kernel

weighted with GR weights for all values of C. Finally, note that the results for the WRBF and ORBF are similar but we don’t have the space to discuss them here.

6 Conclusion In this paper it was shown how instances, that are formed by sliding a window over a sequence of symbols or words, can be represented by (annotated) contexts and it was argued why it is better to work with the contexts directly instead of with an orthonormal real representation of the contexts. Subsequently a number of context-sensitive kernel functions have been introduced, the kernels are called context-sensitive because they allow to weight the context according to the information it contains. A number of weighting schemes were proposed and, in the experiments, compared to each other and an unweighted kernel. From these experiments it becomes clear that gain ratio weighting is the best choice to use as weighting scheme for LINER and POS tagging, because it results in better classification results and reduced model complexity, probably due to the normalizing effect of the split info.

References [1] W. Liu A.P. White. Bias in information-based measures in decision tree induction. Machine Learning, 15(3):321–329, 1994. [2] Chang Chih-Chung and Lin Chi-Jen. LIBSVM : A Library for Support Vector Machines. 2004. [3] Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines and other Kernel-based Learning Methods. Cambridge University Press, 2000. [4] Walter Daelemans, Jakub Zavrel, Ko van der Sloot, and Antal van den Bosch. Timbl: Tilburg memorybased learner, version 5.1. Technical report, Tilburg University and University of Antwerp, 2004. [5] Eric W. Noreen. Computer-Intensive Methods for Testing Hypotheses. John Wiley & Sons, 1989. [6] J.R. Quinlan. Induction of decision trees. Machine Learning, 1:81–206, 1986. [7] J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993. [8] Erik F. Tjong Kim Sang. Introduction to the conll-2002 shared task: Language-independent named entity recognition. In Proceedings of CoNLL-2002, Taipei, Taiwan, pages 155 – 158, 2002. [9] Bernhard Sch¨olkopf. The kernel trick for distances. Technical report, Microsoft Research, 2000. [10] Bram Vanschoenwinkel. Substitution matrix based kernel functions for protein secondary structure prediction. In Proceedings of ICMLA04 (International Conference on Machine Learning and Applications, 2004. [11] Bram Vanschoenwinkel, Liu Feng, and Bernard Manderick. Weighted kernel functions for svm learning in string domains: A distance function viewpoint. In Proceedings of ICMLC (International Conference on Machine Learning and Cybernetics), Guangzhou, China, August 19-21, 2005. [12] Bram Vanschoenwinkel and Bernard Manderick. A weighted polynomial information gain kernel for resolving pp attachment ambiguities with support vector machines. In The Proceedings of the Eighteenth International Joint Conferences on Artificial Intelligence (IJCAI-03), pages 133–138, 2003.

Suggest Documents