Fast Transpose Methods for Kernel Learning on Sparse Data
Patrick Haffner AT&T Labs–Research, 180 Park Avenue, Florham Park, NJ 07932
Abstract Kernel-based learning algorithms, such as Support Vector Machines (SVMs) or Perceptron, often rely on sequential optimization where a few examples are added at each iteration. Updating the kernel matrix usually requires matrix-vector multiplications. We propose a new method based on transposition to speedup this computation on sparse data. Instead of dot-products over sparse feature vectors, our computation incrementally merges lists of training examples and minimizes access to the data. Caching and shrinking are also optimized for sparsity. On very large natural language tasks (tagging, translation, text classification) with sparse feature representations, a 20 to 80-fold speedup over LIBSVM is observed using the same SMO algorithm. Theory and experiments explain what type of sparsity structure is needed for this approach to work, and why its adaptation to Maxent sequential optimization is inefficient.
1. Introduction Kernel-based methods, such as SVMs (Cortes & Vapnik, 1995; Vapnik, 1998), represent the state-of-the-art in classification techniques. However, their application is limited by the scaling behavior of their training algorithm, which, in most cases, scales quadratically with the number of training examples. This scaling issue may be one of the reasons why other techniques, such as Maximum Entropy or Maxent (Berger et al., 1996), are often chosen for Natural Language applications. Maxent learning times usually scale linearly with the number of examples (Haffner, 2005). Their optimization techniques operate in the feature space (also called the primal space) and seem particularly appropriate for sparse data. Recent advances in linAppearing in Proceedings of the 23 rd International Conference on Machine Learning, Pittsburgh, PA, 2006. Copyright 2006 by the author(s)/owner(s).
[email protected]
ear SVM optimization (Joachims, 2006) also provide a linear scaling property for SVMs. However, this paper focuses on non-linear SVMs and other kernel-based methods. In this paper, data is called sparse when the number of active or non-zero features in a training vector is much lower than the total number of features. It is a very common property of language data. For instance, in the bag of n-gram input, features represent the occurrence of a word or token n-gram in a given sentence. In every sentence, only a small portion of the total n-gram vocabulary will be present. This type of input can be used for text classification (Joachims, 1998b), Machine Translation, text annotation and tagging (Bangalore & Joshi, 1999). Scaling up kernel algorithms (beyond the linear case) has recently been the subject of extensive research with improved algorithms (Fan et al., 2005), parallelization (Graf et al., 2005), and techniques based on on-line or active sampling (Bordes et al., 2005). Variations on the Perceptron algorithm (Crammer et al., 2004) can also be considered as fast on-line approximations of SVMs. All these iterative algorithms have produced considerable speedups, but none of them reconsider the central computation step, i.e., the computation of kernel products. In most iterative algorithms, the kernel computation can be folded into a matrix-vector multiplication. This paper shows that, for a large class of kernels, this multiplication can be optimized for sparse data. The methods presented here can be combined with most other SVM speedups. Usually, the data is represented as a set of examples, where each example is a list of features. The transpose representation, known as the inverse index in information retrieval, views the data as a set of features, where each feature is a list of examples. Based on this transpose representation, this paper shows how to speed up the kernel computation in an existing kernel iterative algorithm. Note that the entire kernel learning problem can also be described as a linear system whose optimization can be simplified with matrix approximation. Such global ap-
Fast Transpose Methods for Kernel Learning on Sparse Data
proaches (Yang et al., 2005) have been demonstrated on non-sparse low dimension data. Section 2 introduces kernel learning and provides a generic description of the algorithm for matrix-vector multiplication. Section 3 describes the traditional approaches to this multiplication and introduces new transpose methods. Section 4 provides a theoretical and experimental complexity analysis. Section 5 describes in more detail the implementation in the case of SVMs. Section 7 outlines large-scale experiments on text and language learning problems: Natural Language Understanding, Tagging, and Machine Translation.
2. Sequential Kernel Learning Kernel based algorithms rely on the computation of kernels between pairs of examples to solve classification or clustering problems. Most of them require an iterative learning procedure. The most common of these procedures is the Sequential Minimal Optimization (SMO (Platt, 1998)) used for SVMs, but many other procedures are possible (for instance, the Perceptron algorithm). The kernel classifier is represented as a list of support vectors xk and their respective multipliers αk (in the classification case, the label yk ∈ {−1, 1} gives the signPof αk ). The classifier score for vector x is f (x) = k αk K(x, xk ). Each iteration consists of addition or modification of one (Perceptron), two (SMO) or more (other SVM algorithms) support vectors. At each iteration, we want to find the best candidate support vector to add or update. For that purpose, we need to keep an update of the scores of all training examples or a large subset of these training examples (called the active set). When adding factor δαk to the multiplier αk of support vector xk , these scores must be incremented as follows: ∀i; f (xi ) = f (xi ) + δαk K(xi , xk )
(1)
For each modification of a support vector multiplier, the main required computation is the kernels K(x, xk ) between the support vector xk and each vector of the active set. The kernel K(x1 , x2 ) is usually a function of the dot product between the two vectors x1 ·x2 . This includes most major vector kernels (Vapnik, 1998): polynomial, Gaussian and Sigmoid kernels. For instance, the Gaussian kernel can be written as 1 K(x1 , x2 ) = exp − 2 (kx1 k + kx2 k − 2x1 · x2 ). (2) σ where the norms kx1 k and kx2 k are computed in ad-
sMxV(M, x) for (i=0; i