Fast Modular Network Implementation for Support ... - Semantic Scholar

18 downloads 100701 Views 1MB Size Report
neural quantizer modular, region computing, support vector machines ..... the CPU speeds between their computer and ours is considered. ..... 349–365, 2003.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005

1651

Fast Modular Network Implementation for Support Vector Machines Guang-Bin Huang, Senior Member, IEEE, K. Z. Mao, Chee-Kheong Siew, Member, IEEE, and De-Shuang Huang, Senior Member, IEEE

Abstract—Support vector machines (SVMs) have been extensively used. However, it is known that SVMs face difficulty in solving large complex problems due to the intensive computation involved in their training algorithms, which are at least quadratic with respect to the number of training examples. This paper proposes a new, simple, and efficient network architecture which consists of several SVMs each trained on a small subregion of the whole data sampling space and the same number of simple neural quantizer modules which inhibit the outputs of all the remote SVMs and only allow a single local SVM to fire (produce actual output) at any time. In principle, this region-computing based modular network method can significantly reduce the learning time of SVM algorithms without sacrificing much generalization performance. The experiments on a few real large complex benchmark problems demonstrate that our method can be significantly faster than single SVMs without losing much generalization performance. Index Terms—Large complex problems, modular network, neural quantizer modular, region computing, support vector machines (SVMs).

I. INTRODUCTION

S

UPPORT VECTOR MACHINEs (SVMs) have attracted a lot of interest from researchers and have been extensively used in widespread applications. However, due to the intensive computational complexity of their training algorithms, which are at least quadratic with respect to the number of training examples, it is difficult to deal with large problems using single conventional SVMs. Recently, many researchers have been investigating efficient methods to make SVMs applicable to large complex applications. Basically, such research activities have focused on two parts: 1) to investigate approaches to making single SVMs themselves applicable to large complex problems; and 2) to construct a combination model of SVMs which consists of several SVMs each solving a small problem. One approach to making single SVMs themselves on problems with many training examples tractable is to decompose those problems into a series of smaller tasks [1]–[4]. This decomposition (as implemented in the SVM package [2], [5]) splits quadratic optimization problems (QP) in an in-

Manuscript received November 12, 2003; revised December 16, 2004. G.-B. Huang, K. Z. Mao, and C.-K. Siew are with the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore (e-mail: [email protected]). D.-S. Huang is with the Intelligent Computing Lab, Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui 230031, China. Digital Object Identifier 10.1109/TNN.2005.857952

active and an active part—the so-called “working set,” and can finally solve the problems efficiently. Without modifying single SVMs themselves, the combination model of SVMs partitions the whole training observations into several subsets (of approximately same size) and make each individual SVMs trained on their respective training subsets. Each of these SVMs solves a small problem and the whole problem can then be solved by a combination of these individual SVMs. One of the combination models of SVMs is called the mixture of SVMs [6], [7]. Gate networks are used in such implementations to weight the contribution of individual SVMs to the whole problem. It is noted that gate networks could become the bottleneck in such SVM mixtures and the overall performance of such mixture implementation may not be optimal. For instance, the gate network could be implemented using neural networks (usually feedforward networks) [6], [7], and the issues (i.e., lower learning speed, local minima, etc.) existing in neural networks may also be inherited and tend to appear in such mixture of SVMs. If SVMs are themselves much faster than neural networks in applications, the learning time of the mixture model may be much larger than the sum of the learning time spent in individual SVMs due to the slower gate network implementation. With the introduction of gating networks, some computation overhead may most likely be introduced into the mixture model as well. That would be interesting if the time spent on training gating network could be minimized to almost nearly zero. Bayesian committee machine (BCM) [8], [9] is another type of combination model of SVMs. Similar to the mixture of SVMs [6], [7], extra efforts such as the estimate of predictive probability densities need to be made by users. The major aim of this paper is to provide a faster generic implementation approach for any SVM learning algorithms including the decomposition algorithms [2]–[4] as simple as possible without modifying those single SVM learning algorithms themselves. This paper proposes a generic modular network of SVMs by simply extending the modular network architecture designed for feedforward neural networks in our previous work [10] to SVM applications. Different from subset based approaches [2], [3], [6], [7], in our modular network implementation the whole input space is arbitrarily divided into nonoverlapped subregions and individual SVMs will be assigned to solve small problems limited within subregions. In the new modular network architecture for SVMs, the gate networks of the SVM mixture are replaced by the neural quantizer modules which are implemented with almost zero time. The learning time of the modular networks is almost equal to the total time spent on training all the individual sub-SVMs

1045-9227/$20.00 © 2005 IEEE

1652

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005

without any extra overhead computation. Since the learning time for individual sub-SVMs can be dramatically decreased, the overall learning time spent for the modular network is much less than the time spent for single SVMs. Simulation results shown in Section IV demonstrate that the learning speed of our modular network can be significantly faster than a single SVM implementation without sacrificing much generalization performance. In the mixture of SVM models proposed by Kwok [11] and Collobert et al. [6], [7], multiplicative units are used in the second layer of the model (cf. [11, Fig. 1 ]) to multiply the input from the SVM expert and the input from the gate network. Different from this type of multiplicative units, within our proposed neural quantizer modular network architecture, all the computation units are designed to use the summing neuron whose input is the sum of weighted inputs from the linked neurons.1 Neural quantizer modules are used to inhibit the outputs of the individual SVMs whose specialized subregions are not the ones the inputs come from and to allow the local SVMs specializing in corresponding subregions to fire. Unlike BCM [8], [9] and the mixture of SVMs [6], [7] which may require users to have knowledge on probability or neural networks, the modular network of SVMs proposed in this paper is so simple that the users of such modular networks need not have any knowledge and background other than SVMs. This paper is organized as follows. Section II briefs the computation complexity of single SVMs. Section III first reviews the modular network proposed for feedforward networks introduced by Huang [10] and then extends it to a generic modular network applicable for those learning algorithms including SVMs whose complexity factors are more than one.2 Experimental results on a few real benchmarking large complex classification and function approximation problems are given in Section IV. Some discussions are given in Section V and conclusions are made in Section VI. II. COMPUTATION COMPLEXITY OF SINGLE SVMS A brief discussion of the computation complexity of support vector machines is given here. A. Support Vector Classification Given a set of training samples

,

, where and , an SVM is used to find a hyperplane to separate these samples correctly. The decision function of the SVM [13]–[16] is (1) corresponds to a training where each Lagrange multiplier and is a kernel function. Training example

a SVM is equivalent to solving the following optimization problem

(2) where

is a user-specified constant.

B. Support Vector Machine for Regression (SVR) Given a set of training samples

, and

, where , the output

function of the SVR [3], [14], [16] is (3) where Lagrange multipliers and correspond to training . Training a SVM is equivalent to solving the example following optimization problem

(4) Both and must be selected by users. Without loss of generality, we choose the radial basis function as the kernel in this . paper: Both optimization problems (2) and (4) are quadratic programming (QP) problems where the number of parameters is equal to , the number of training data. The computation comor even higher [2], [3], [6], plexity of SVMs may be [17]. III. GENERIC MODULAR NETWORK MODEL In this section, we first give a brief introduction to a modular network model for feedforward networks proposed in our earlier work [10], and then extend it to a generic modular network architecture applicable for most applications including SVM learning algorithms whose computation complexity factor is larger than one. A. Generic Modular Network Architecture

1Refer

to Schmitt [12] for the clear descriptions of multiplicative and summing neurons/units. 2If for an application with N training observations the computational complexity of a learning algorithm is O(N ), the complexity factor of the learning algorithm for this application is defined as in the context of this paper.

Given the training set sampled from the whole input space . A vector can be randomly generated and the training samples . Let be the are reindexed such that

HUANG et al.: FAST MODULAR NETWORK IMPLEMENTATION FOR SVMs

1653

111

 common hidden neurons. Fig. 1. A specific TLFN architecture with L neural quantizer modules and L SLFN learning machines M (p = 1; ; L) sharing L of the pth neuron quantizer The pth neuron in the second hidden layer is the output neuron of the pth SLFN learning machine M . Two neurons A and B = U . The weights of the connections linking the input module only link to the pth neuron in the second hidden layer and their weights have the same value: w = T and the weights of the connections linking the input layer to all the B -type neurons B is  = T layer to all the A-type neurons A is  [10].

w

points which can partition the interval subintervals ,

0

w

into

(5) where denotes the nearest integer greater than or equal to . Thus, the whole input sampling space can be divided into subregions (6) which means the whole input space is arbitrarily divided into subregions by parallel hyperplanes , , and each subregion consists of almost training inputs. The modular network model proposed in our earlier work [10] is a type of compact two-hidden layer feedforward network (TLFN) which consists of neural quantizer modules and single-hidden layer feedforward networks (SLFNs) . (cf. Fig. 1) A each specializing locally on a subregion sigmoidal activation function is used in all the neurons except for the output neuron. Each neural quantizer module simply and , respectively, consists of two neurons labeled as . The input neurons are linked to all these SLFNs and neural quantizer modules which form the first hidden layer and are set of the TLFN. The biases of neurons as and , respectively. The weight is and the linking input neuron to neuron is weight linking input neuron to neuron where parameter , called the neural quantizer factor in our network architecture, should be set large enough. All the linking the neural quantizer to the second hidden weight . The neural quantizer modules layer is set as

w

0 w

can inhibit the outputs of those SLFNs and make them nearly if the input does not zero by adjusting its quantizer factor come from their specialized subregions. Thus, the th neuron in the second layer of the TLFN (the output of the th SLFN) produces the target output if and only if the input comes from the th subregion whereas the outputs of the other neurons are inhibited by neural quantizer modules and can be neglected. (Refer to [10] for details.) In fact, we can get a more generic modular network model which may be applicable to many learning algorithms. By simply replacing these SLFNs with any learning machines we can have a modular network architecture as shown in Fig. 2. These learning machines can be similar or quite different from each. They can be neural networks, support vector machines, or other type of learning machines. and shift the target Suppose that the target function is up by so that . function Let the bias of the output neuron of the modular network be . For the sake of simplicity, the activation function used in neural quantizer module is chosen as a hardlimit function as shown in Fig. 3(a). In this case, the neural quantizer factor can be simply chosen as one. Thus, for all applications/cases the weight vector of the connections linking input neurons to neuron and the of the connections linking input neurons weight vector simply become and to neuron accordingly. The activation function used in the second hidden layer is chosen as as shown in Fig. 3(b). Since these learning machines could not be neural network based and their output unit may not be a neuron, we can let the output of each learning machine simply link to one and only one neuron in the second hidden layer of the modular network. That is, the output of the th learning machine links to the th neuron in the second hidden layer only (cf. Fig. 2). At any time only one neuron of the second hidden layer of the network transfers

1654

Fig. 2.

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005

A generic modular network architecture with L learning machines M (p = 1;

111

; L)

and L neural quantizer modulars.

Fig. 3. Two type of activation functions: a) hardlimit function used in neural quantizer modular; b) hillside function used in the second hidden layer of the designed modular network of SVMs.

the output of its corresponding learning machine to the output neuron of the modular network architecture and the outputs of all the other neurons are inhibited. For instance, if the input belongs to the th subregion , the output of the th neural quantizer module is zero and the others are one. Regarding the th neuron in the second hidden layer, the input from the th neural quantizer modular is zero and the sum of inputs is equal to the output of the th learning machine, thus, its output is the same as the output from the th learning machine. For all the other learning machines, the input from their corresponding , by setting the neural quantizer neural quantizer modules is factor large enough, the sum of the inputs to all these neurons would be negative and the outputs of these neurons would be zero. By setting the neural quantizer factor large enough, these neural quantizer modules can always inhibit all the outputs from other learning machines but allow the output of one learning machine specializing on a subregion to be passed further to the final output neuron of the modular network. Suppose that the learning complexity of the used machine where is the number of training obis servations. The learning complexity of each submachine is , which is of the complexity of the single SVM. Since the learning time spent on the neural quantizer modules is nearly zero, it can be estimated that the total learning time spent on the whole modular network is almost equal to the sum of the learning time of individual

. That means, in principle, learning machines: for learning algorithms (i.e., SVMs, etc.) whose complexity , the learning speed can increase at least factor times if these learning algorithms are implemented with the proposed modular network. B. Modular Network for SVMs Now, we can consider a modular architecture of SVMs consisting of SVMs and corresponding quantizers. Simply choose SVMs as individual learning machines in the modular network architecture as shown in Fig. 2. Since the radial basis , function3 is chosen as the kernel in this paper and , , the quantizer factor can be chosen as and for both the modular network of SVM and the modular network of SVR. Thus, for the modular network of SVMs (including SVRs), the weights linking the quantizer and the neurons in the second hidden layer of the modular network architecture are (7) 3Although we take the radial basis function kernel as an example in the context of this paper, the proposed modular network is applicable to all type of SVM kernels K (x; x ). In general, according to (1) and (3) we can choose U = max C y (max K (x ; x )+b and w = U.

0

HUANG et al.: FAST MODULAR NETWORK IMPLEMENTATION FOR SVMs

1655

TABLE I PERFORMANCE COMPARISON OF SINGLE SVMS AND MODULAR NETWORK ON MNIST OCR PROBLEM

where the bias of the th SVM, and the label of the th the number of training training sample in the th SVM, and . data in the th subregion The output of the modular network is given by (8) where

is the output of the th SVM for input , and is the output of the th corresponding neural quantizer module taking also as input. Suppose that each SVM can exactly learn the target function in its specialized subregion: if . iff and iff . For any input , there exists a subsuch that , we have region and for all . Thus, . Algorithm: Given a training set step 1 Dividing training sample space into subregions: and reindex the (a) Randomly generate a vector . training samples such that (b) Divide the whole input space into groups according to (6): step 2 Initialization: Set learning parameters for sub-SVMs. step 3 Learning step: on th subregion , (a) Train sub-SVM . (b) Train neural quantizer modules. (i) Set all the weights of the connections linking the neural quantizer modules to the second hidden layer according to (7). of the connections (ii) Set the weight linking the input neurons to all -type and the weight neuron as of the connections linking the input neurons . to all -type neuron as (iii) Set the biases of neurons and as and , , according to (5).

IV. BENCHMARKING WITH REAL-WORLD LARGE COMPLEX APPLICATIONS Some simulations have been done to test the performance (learning speed and generalization performance) of the proposed modular network architecture for SVMs. These simulations include eight real-world large-scale benchmark classification and function approximation problems. All simulations were run on an ordinary PC with single 2.99-GHz CPU. The popular compiled C-coded SVM packages: LIBSVM [18] is used in our simulations. The basic algorithm of this C-coded SVM packages is a simplification of three key works: original SMO by Platt [19], SMOs modification by Keerthi et al. [20], by Joachims [5]. and A. Optical Character Recognition (OCR): MNIST Problem In the first experiment of classification, the MNIST database [21] of handwriting digits is used to test the performance of the proposed algorithm.4 In this database, each observation represents a handwritten digit zero to nine and each observation was composed of 28 28 pixels (784 input attributes per observation). There are totally 70 000 observations in the database. Two simulations with 20 000 and 60 000 training samples have been done, respectively. Ten trial each with training and testing data randomly generated has been done for all the cases. Seen from Table I, for the simulation with 20 000 training samples and 50 000 testing samples, the performance for single SVM obtained by us is similar to the results obtained by Lin et al. [22] if the difference of the CPU speeds between their computer and ours is considered. When the whole input space is randomly divided into two par, the learning speed is increased 1.6 allel subregions times. When the number of the training samples is increased by 3 times to 60 000. It takes more than 1 h for a single SVM to complete learning, however, the modular network with six sub-SVMs only spends 12.4 min to get close generalization performance. Compared to the single SVM implementation, the learning speed of the modular network has been increased 6.03 times. In both cases, the learning speed is increased around times, where is the number of subregions divided from the 4We would like to thank C.-J. Lin for providing the scaled MNIST OCR database during our personal communication.

1656

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005

TABLE II PERFORMANCE COMPARISON OF SINGLE SVMS AND MODULAR NETWORK ON FOREST TYPE PREDICTION

TABLE III INFORMATION OF THE BENCHMARKING PROBLEMS: SATIMAGE, SHUTTLE, AND BANANA

whole input space. On the other hand, the testing time is decreased dramatically as well. Since the neural quantizers are used in the modular network, at any time only one sub-SVM can receive the testing data from its specified subregion while the other sub-SVMs are inhibited, so overall computation for testing data is much lower even though the whole modular network may have more support vectors (computation units) than single SVM. (Refer to Section V-C for more analysis on testing complexity.) B. Forest Cover Type Prediction This is an extremely large complex classification problem with seven classes. The forest cover type [23] for 30 30 meter cells was obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. There are 581 012 instances (observations) and 54 attributes per observation. In order to compare with the previous works [6], [7], similarly it was modified as a binary classification problem where the goal was to separate class 2 from the other six classes. Simulations with 100 000 and 400 000 training samples have been done, respectively. Fifty trials have been conducted for all the simulations of the modular network and single trial of simulation for single SVM since single SVM need to spend huge training and testing on this dataset. Seen from Table II, for the simulation with 100 000 training observations and 481 012 testing observations, a single SVM spends 480.23 min CPU time on learning. We choose the

of sub-SVMs in our modular network same number as done in Collobert et al. [6], the proposed modular network only spent 2.13 min on learning which is 230 times faster than a single SVM. For this case, the proposed modular network appears to be 2.8 times than the hard probabilistic mixture by Collobert et al. [7] after considering the fact that the hard probabilistic mixture is 80 times faster than a single SVM in the single CPU environment (cf. [7, Tables I and III]). For the simulation with 400 000 training observations and 181 012 testing observations, the training time spent on single SVMs is 3.8 d (single CPU). However, the proposed modular network with 200 sub-SVMs only spent 8.57 min (single CPU) on learning this very large problem. The learning speed has dramatically been increased 640 times. The gated SVM mixture [6] needs to spend around 7 h parallel running on 50 CPU and the hard probabilistic mixture by Collobert et al. [7] spent around 10 hours on single CPU. C. Some Other Benchmarking Classification Problems The modular network performance has also been tested on the large database Banana5 and some multiclass databases from the Statlog collection [23]: satimage and shuttle. We randomly choose 50% data as training data and leave the rest as testing data for all cases. The information on number of data, number of attributes, and number of classes is listed in Table III. Five trial simulations have been done for the single SVM running for 5http://www.first.gmd.de/~raetsch/data

HUANG et al.: FAST MODULAR NETWORK IMPLEMENTATION FOR SVMs

1657

TABLE IV PERFORMANCE COMPARISON OF SINGLE SVMS AND MODULAR NETWORK ON BENCHMARKING PROBLEMS: SATIMAGE, SHUTTLE, AND BANANA

TABLE V PERFORMANCE COMPARISON OF SINGLE SVRS AND MODULAR NETWORK ON CALIFORNIA HOUSING PREDICTION

Banana problem and 50 trials have been done for all the other cases. The single SVM for Banana is obviously much longer than the modular network implementation. The average results have been obtained for each case (cf. Table IV). The generalization performance obtained by the modular network is very close to the generalization performance of single SVMs. However, different from many other cases, when Banana dataset was tested, it was found that the testing performance is clearly improved when the number of subregions is increased. The possible reasons for the performance improvement of Banana case will be analyzed in Section V-B. D. Function Approximation Problem: California Housing Prediction California Housing is a dataset obtained from the StatLib repository.6 There are 20 640 observations for predicting the price of houses in California. Information on the variables were collected using all the block groups in California from the 1990 Census. In this sample a block group on average includes 1425.5 6http://www.niaad.liacc.up.pt/~ltorgo/Regression/cal_housing.html.

individuals living in a geographically compact area. Naturally, the geographical area included varies inversely with the population density. Distances among the centroids of each block group were computed as measured in latitude and longitude. All the block groups reporting zero entries for the independent and dependent variables were excluded. The final data contained 20 640 observations on nine variables, which consists of eight continuous inputs (median income, housing median age, total rooms, total bedrooms, population, households, latitude, and longitude), and one continuous output (median house value). For simplicity, the eight input attributes and one output have been normalized to the range [0,1] in our experiment. For this problem, the performance comparison between single SVMs and the modular network of SVMs was first conducted on 8000 training data and 12 640 testing data randomly generated from the California Housing database, and then on randomly generated 16 000 training data and 4640 testing data. Fifty trials have been conducted for all the simulations and the average results have been obtained as shown in Table V. The learning speed of the modular network of SVMs is increased around times, where is the number of subregions

1658

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005

divided from the whole input space. The generalization performance obtained by the modular network is very close to the generalization performance of single SVMs.

neural quantizer modules do not make any effect to the output of the sub-SVM which is supposed to react and fire correctly. B. Subregions and Subsets of the Whole Input Space

V. DISCUSSIONS A. Architectures of Modular Network and Mixture Network Kwok [11] incorporated the SVMs into the mixture of experts model [24], [25] to form a support vector mixture. Although such a support vector mixture allows different SVM experts in different regions of the whole input space, as analyzed in Kwok [11], compared with single SVMs, the QP problems of the support vector mixture model proposed by Kwok [11] are basically unchanged with the slight changes in the set of linear constraints and the introduced extra term for gating networks. Thus, it basically faces the same difficulty in solving large scale problems as a single SVM. Collobert et al. [6], [7] further proposed a mixture implementation of SVMs where each SVM is trained on a subset selected from the whole input space based on some criteria.7 Compared with a single SVM implementation, the overall computation complexity for these SVM experts may dramatically be reduced. However, in the mixture of SVM models, a kind of gate networks need to be used which coordinate the contributions of the SVM experts to the overall performance of the mixture model. The gate networks can be either implemented with a neural network or probabilistic model, which are obviously much more complicated than our proposed neural quantizer models. Even though the individual SVM experts can run very fast in each subset of the input space, the learning performance of the gate network could become a bottleneck to the mixture of SVM model. In some cases, the computational overhead spent on gating networks may possibly be much longer than the time spent on the SVM experts themselves. For example, in some cases, neural networks may possibly not perform better than SVMs. The gating network tends to become bottleneck in such cases. With the implementation of neural quantizer modules, the gate network implementation can be reduced to almost zero and such bottlenecks do not exist any more. Since the learning time spent on neural quantizer modules in our proposed modular network is nearly zero, the total learning time spent on our proposed modular network of SVMs is exactly equal to and only depends on the time spent on individual SVMs. There is no extra computational overhead in the implementation of the new proposed network. The role of neural quantizer module is to let the sub-SVM fire if the input belongs to its specialized region and to inhibit the outputs of all the other sub-SVMs whose specialized regions are not the one the input belongs to. The outputs of the other sub-SVMs are generally not zero and their values are unknown (possibly even larger than the output of the sub-SVM specializing in the region), and will be inhibited by the corresponding neural quantizer modules. On the other hand, the 7For example, these subsets may be reconstructed and iteratively repartitioned

during the training loop according to the values of gaters and number of training samples in the SVM experts [6], [7].

It may be interesting to simply analyze the reasons why subregions instead of subsets are preferred in our implementation. Suppose that the whole input space is randomly divided into subsets .8 It may be noted that from the statistical point may have no major difference except of view, these subsets for the difference in the number of contained training observaexperts specializing in these subsets tions. Obviously, the could tend to have similar performance since the statistical features of these different subsets are likely similar. While the whole input space is divided into nonoverlapped subregions by several parallel hyperplanes the individual SVM experts tend to specialize in their own regions without any interference from other subregions. In the proposed modular network of SVMs, each individual SVM only cares about their own specialized subregions. For example, for the Banana problem as shown in Section IV, the distribution of its two-dimensional training observations is as shown in Fig. 4(a), it can be arbitrarily divided and by an arbitrary hyperplane as into two subregions shown in Fig. 4(b). It can be seen that the distributions of these two subregions are quite different and SVMs specializing in each such subregion should have its behavior adapted to its corresponding subregion. However, when the whole space is ranand as shown in Fig. 5, domly divided into two subsets obviously there are no major differences between these two subsets. It is reasonably to think that the SVM experts trained on these two subsets tend to have similar behaviors. Seen from the experimental results, the testing performance of the modular network is often slightly inferior to that of the single SVM (except on the banana data set), one could think that if some performance degradation can be tolerated by users a much simpler approach on handling large data sets could be to as aforeperform data sampling (i.e., choosing one subset mentioned) and then train a single SVM. The facts are: 1) if a small number of data is sampled the generalization performance obtained by the single SVM will be very low in many cases; 2) in order to maintain the generalization performance similar to or higher than that obtained by the modular network the number of sampled data should not be too small, but then the training time and testing time will be very high compared with those spent by the modular network with larger number of training data. Take forest case for example, as shown in Table II, the single SVM still needs to spend 480.23 min on training a small dataset with 100 000 training data while the proposed modular network only needs to spend 8.57 min on a large dataset with 400 000 training data and both have same generalization performance. “Random partitioning” used in our proposed modular network is not fully “random.” The direction of its partitioning hyperplanes can be randomly chosen, however, these hyperplanes 8It should be noted that the subsets used in Collobert et al. [6], [7] are not randomly partitioned in the end of the training since there exists a kind of loop in the training which allows these subsets to be reconstructed/repartitioned according to the values of gaters and number of training samples in the SVM experts.

HUANG et al.: FAST MODULAR NETWORK IMPLEMENTATION FOR SVMs

Fig. 4. (a) Distribution of the training data of a two-class problem: Banana. (b) Two different subregions and of the input space of Banana.

V

V

are parallel and need to balance the number of data in each subregion. Thus, it may help to balance in each partitioning the information which needs to be learned. For example, two different partitioning (different partitioning directions) are shown in Fig. 6 where different subregions have almost same size of data although the distances among those hyperplanes may be different. Although Banana dataset (low input dimension) also looks complex for single dataset which is trained by single SVM, when more and more partitions are produced the dataset in each region tend to be separable (for example, Subregion of Fig. 6) and the generalization performance of the modular network tends to be improved [cf. Fig. 7(a)]. The modular network may produce slightly worse generalization performance for applications with high input dimension and complicated data distribution. For example, in Forest case, as shown in Fig. 7(b)

1659

D

Fig. 5. Two subsets of the input space of Banana. (a) Subset . (b) Subset , which look similar. Two SVM experts tend to have similar performance when trained on these two subsets.

D

the generalization performance will slightly reduce when the number of partitioning subregions is increased. C. Computational Complexity 1) Complexity Factor: Given number of training data, suppose that the training complexity of a single SVM on the whole input training data is , where , called complexity factor in the context of this paper, depends on the adopted learning algorithms and the application themselves. In the implementation of the modular network of SVMs, the whole input space is divided into subregions each consisting of around training data. In real applications, the distribution of subregions may be different and the complexity

1660

Fig. 6.

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005

Different random partitioning with different partitioning directions.

of the subregions may be different region by region. In general, it can be estimated that the average computational complexity for these sub-SVMs are around . Then the total computational complexity of the sub-SVMs is estimated as since the training time of neural quantizer modules is nearly zero and can be ignored. That means, the learning speed can be increased at least . For single SVM algorithms, the complexity factor is normally . Thus, the learning speed can be increased by the proposed modular network at a factor of , as verified by the simulation results discussed in Section IV. In general, the higher the complexity factor is, the more speedup the modular network can achieve. 2) Scale Factor: Suppose that the total number of training data has been increased times to . The number of subregions can also be increased by times to make the number of

Fig. 7. Randomly partitioning may have different impacts on different cases. (a) Testing error will decrease in Banana case when more subregions are partitioned. (b) Testing error slightly increases in forest covtype prediction case when more subregions are partitioned.

training data for each region fixed as and the computational complexity of each sub-SVM remains almost unchanged. Different from gating mixture of SVMs, the training time spent on the neural quantizer modules of the modular network of SVMs is nearly zero and can be ignored. Thus, the total computational complexity of the modular network of SVMs for the increased training set is , which is around times the original modular network implementation before the training data are scaled. That means, the computation complexity of the proposed modular network of SVMs increases only linearly with the number of training data. This has also been evidenced by all the simulation results shown in Section IV. 3) Testing Speed: In most of the tested cases the number of support vectors obtained by all sub-SVMs of the proposed modular network is more than that of corresponding single SVM implementation. However, our modular network still

HUANG et al.: FAST MODULAR NETWORK IMPLEMENTATION FOR SVMs

Fig. 8. The relationship between learning speedup and the number of partitioned subregions: forest covtype prediction case. One may need to find the acceptable tradeoff between the expected learning speedup and the expected generalization performance by properly adjusting the number of partitioned subregions in practical applications.

has much faster testing speed. In the implementation of our proposed modular network, similar to training phase, during testing stage, individual testing data is simply input to one and only one sub-SVM specializing in that subregion where the testing data is located and each sub-SVM has much less computational units (support vectors) while in the implementation of single SVMs individual testing data needs to be computed on all computational units (support vectors) which is much more than that of each sub-SVMs of the modular networks. D. Tradeoff Between Learning Speedup and Generalization Performance Although the proposed modular network can significantly speedup the learning process, one may need to find the acceptable tradeoff between the expected learning speedup and the expected generalization performance by properly adjusting the number of partitioned subregions in practical applications. One cannot only be greedy on the learning speedup without considering the generalization performance. For example, for Forest case with 100 000 training data (cf. Table II and Fig. 8), if the number of partitioned subregions is set as 50, the learning speed can be increased at a factor of 230, but the obtained successful testing rate is only around 89.01% which is lower than the single SVMs 89.90%. That means some loss of the generalization performance may need to be tolerated by users if speed is the main concern but not accuracy, which is the price that one has to pay for the simplicity of the method.9 However, in fact when applying the proposed modular network one 9In this extreme case, one might think that feedforward neural networks trained by stochastic descent method looks more attractive than the proposed modular network. However, as analyzed by LeCun et al. [26] stochastic learning can be picked with “careful tuning” (i.e., learning rate, learning epoch, network architecture (number of hidden neurons and hidden layers), etc.) when the task is classification and the training set is large and redundant. However, the proposed simple modular network does not face these trivial issues, and is efficient for both classification and regression applications and can be applied easily.

1661

can achieve reasonable learning speedup without significantly losing generalization performance by properly adjusting the number of partitioned subregions.10 For Forest case (cf. Table II and Fig. 8) if the number of partitioned subregions is set as 10, the learning speed can still be increased by 18 times and the successful testing rate is 89.74% which is very close to the generalization performance (89.90%) of the single SVM. For Shuttle case (cf. Table IV), if the number of partitioned subregions is set as 25 the difference of the successful testing rate between the proposed modular network (99.851%) and the single SVM (99.884%) is only 0.03% while the learning speed can be increased at a factor of 27. For California Housing Prediction case (cf. Table V), if the number of partitioned subregions is set as 10 the difference of the testing rmse between the proposed modular network and the single SVM is only around 0.005 while the learning speed can be increased at a factor of 10. The differences between the generalization performance achieved by the single SVM and the proposed modular network are very small and could be neglected in these cases. VI. CONCLUSION This paper has proposed a simple fast generic SVM modular network for large complex applications with generalization performance very close to single SVM implementation. In applications, the whole input space is first arbitrarily divided into several nonoverlapped subregions and each SVM of the modular network will be trained in a subregion. For each SVM, there is a corresponding neural quantizer module which simply consists of only two hardlimit neurons. If an input belongs to the subregion in which a SVM specializes, the SVM will be allowed to fire and its output finally becomes the actual output of the modular network, otherwise the output of this SVM will be inhibited by the corresponding neural quantizer module so that it does not interfere with the other SVMs. Different from subset based methods, separate learning machines within this modular network do not interfere each other and thus, no coordinator among them is required. This modular network is simple in the sense that: 1) Overall it is a feedforward neural network; 2) The units of such feedforward are either simple summing neurons or complete learning machines; 3) In principle, most learning algorithms whose complexity factor is larger than one could be used in such designed modular network without any modification; 4) Users do not need specialist knowledge (such as neural networks, probability, etc.) other than the used learning machines (i.e., SVM in this paper). In fact, since the modular network is so simple, average users without

10It is known that SVM generally outperforms many other algorithms, thus, it is reasonable to conjecture that the proposed modular network could outperform many other learning algorithms by properly adjusting the number of partitioned subregions. Comparison between the proposed modular network and other non-SVM based learning methods is beyond the scope of this paper and is worth investigating in our future study.

1662

5)

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005

background on SVMs can apply the proposed modular network if the popular straightforward compiled C-coded SVM packages: LIBSVM [18] is used; There are no extra parameters to be tuned except for those required by the learning machines themselves.

Since there is no extra computation complexity introduced into the modular network, the implementation speed can be dramatically increased. As demonstrated by simulation results, the learning speed of our modular network can be significantly faster than its corresponding single SVM implementation without sacrificing much generalization performance. At average, the training time of the proposed modular network of SVMs appears to be linearly increased with the number of training observations.

ACKNOWLEDGMENT The authors would like to thank the anonymous Associate Editor and three reviewers for their constructive comments. The authors also wish to thank C.-J. Lin for providing the LIBSVM SVM packages and the scaled MNIST OCR database, and R. Collobert for helpful discussions.

[17] J. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods: Support Vector Machines, B. Schölkopf, C. Burges, and A. Smola, Eds. Cambridge: MIT Press, 1998. [18] C.-C. Chang and C.-J. Lin. (2003) LIBSVM—A Library for Support Vector Machines. Dep. Computer Science and Information Engineering, National Taiwan Univ., Taiwan. [Online]http://www.csie.ntu.edu.tw/~cjlin/libsvm/ [19] J. Platt, “Sequential minimal optimization: A fast algorithm for training support vector machines,”, Microsoft Res. Tech. Rep. MSR-TR-98-14, 1998. [20] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy, “Improvement to Platt’s SMO algorithm for SVM classifier design,” Neural Comput., vol. 13, pp. 637–649, 2001. [21] Y. LeCun. (2003) The MNIST Database of Handwritten Digits. NEC Res. Inst.. [Online]http://yann.lecun.com/exdb/mnist/ [22] K.-M. Lin and C.-J. Lin, “A study on reduced support vector machines,” IEEE Trans. Neural Netw., vol. 14, no. 6, pp. 1449–1459, 2003. [23] C. Blake and C. Merz. (1998) UCI Repository of Machine Learning Databases. Dep. Information and Computer Sciences, Univ. California, Irvine. [Online]http://www.ics.uci.edu/~mlearn/MLRepository.html [24] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Comput., vol. 3, pp. 79–87, 1991. [25] M. I. Jordan and R. A. Jacobs, “Hierarchical mixtures of experts and the EM algorithm,” Neural Comput., vol. 6, pp. 181–214, 1994. [26] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, Efficient backprop, in Lecture notes in computer science (also from citeseer.ist.psu.edu/lecun98efficient.html), vol. 1524, pp. 9–50, 1998. [27] G.-B. Huang and C.-K. Siew, “Extreme learning machine: RBF network case,” in Proc. 8th Int. Conf. Control, Automation, Robotics and Vision (ICARCV 2004), Kunming, China, Dec. 6–9, 2004.

REFERENCES [1] E. Osuna, R. Freund, and F. Girosi, “Training support vector machines: an application to face detection,” in Proc. CVPR’97, Jun. 17–19, 1997. [2] T. Joachims, “Making large-scale support vector machine learning practical,” in Advances in Kernel Methods: Support Vector Machines, B. Schölkopf, C. Burges, and A. Smola, Eds. Cambridge: MIT Press, 1998. [3] A. Smola and B. Schölkopf, “A Tutorial on Support Vector Regression,”, NeuroCOLT2 Tech. Rep. NC2-TR-1998-030, 1998. [4] J.-X. Dong, C. Y. Suen, and A. Krzy˙zak, “A fast SVM training algorithm,” Int. J. Pattern Recogn. Artif. Intell., vol. 17, no. 3, pp. 367–384, 2003. —Support Vector Machine. Dep. [5] T. Joachims. (2003) Computer Science, Cornell Univ., Ithaca, NY. [Online]http://svmlight.joachims.org/ [6] R. Collobert, S. Bengio, and Y. Bengio, “A parallel mixtures of SVMs for very large scale problems,” Neural Comput., vol. 14, pp. 1105–1114, 2002. [7] R. Collobert, Y. Bengio, and S. Bengio, “Scaling large learning problems with hard parallel mixtures,” Int. J. Pattern Recogn. Artific. Intell., vol. 17, no. 3, pp. 349–365, 2003. [8] V. Tresp, “A Bayesian committee machine,” Neural Comput., vol. 12, pp. 2719–2741, 2000. [9] , “Scaling kernel-based systems to large data sets,” Data Mining and Knowledge Discovery, vol. 5, no. 3, 2001. [10] G.-B. Huang, “Learning capability and storage capacity of two-hiddenlayer feedforward networks,” IEEE Trans. Neural Netw., vol. 14, no. 2, pp. 274–281, 2003. [11] J. T.-Y. Kwok, “Support vector mixture for classification and regression problems,” in Proc. Int. Conf. Pattern Recogn. (ICPR’98), Brisbane, Australia, Aug. 1998, pp. 255–258. [12] M. Schmitt, “On the complexity of computing and learning with multiplicative neural networks,” Neural Comput., vol. 14, pp. 241–301, 2001. [13] V. N. Vapnik, The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995. [14] , Statistical Learning Theory. New York: Wiley, 1998. [15] C. Cortes and V. N. Vapnik, “Support vector networks,” Machine Learning, vol. 20, pp. 273–297, 1995. [16] S. Haykin, Neural Networks: A Comprehensive Foundation. Englewood Cliffs, NJ: Prentice-Hall, 1999.

SVM

Guang-Bin Huang (M’98–SM’04) received the B.Sc. degree in applied mathematics and the M.Eng. degree in computer engineering from Northeastern University, P. R. China, in 1991 and 1994, respectively. He received the Ph.D. degree in electrical engineering from Nanyang Technological University, Singapore, in 1999. During the undergraduate period, he also concurrently studied in the Wireless Communication Department, Northeastern University. From June 1998 to May 2001, he was a Research Fellow with Singapore Institute of Manufacturing Technology (formerly known as Gintic Institute of Manufacturing Technology) where he led/implemented several key industrial projects. Since May 2001, he has been an Assistant Professor with the Information Communication Institute of Singapore (ICIS), School of Electrical and Electronic Engineering, Nanyang Technological University. His current research interests include machine learning, bioinformatics, and networking. Dr. Huang is an Associate Editor of Neurocomputing.

K. Z. Mao was born in Shandong, China, on March 11, 1967. He received the Ph.D. degree from University of Sheffield, U.K., in 1998. He worked as a Research Associate with the University of Sheffield for six months, and then joined Nanyang Technological University (NTU), Singapore, as a Research Fellow. Since June 2001, he has been an Assistant Professor with the School of Electrical and Electronic Engineering, NTU, Singapore. His current research interests include machine learning, data mining, biomedical engineering, and bioinformatics.

HUANG et al.: FAST MODULAR NETWORK IMPLEMENTATION FOR SVMs

Chee-Kheong Siew (M’92) received the B.Eng. degree in electrical engineering from the University of Singapore in 1979 and the M.Sc. degree in communication engineering from Imperial College, U.K., in 1987. He is currently an Associate Professor with the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore. From 1995 to 2005, he served as the Head of Information Communication Institute of Singapore (ICIS) after he managed the transfer of ICIS to NTU and rebuilt the institute in the university environment. After six years in the industry, he joined NTU in 1986 and was appointed as the Head of the Institute in 1996. His current research interests include neural networks, packet scheduling, traffic shaping, admission control, service curves and admission control, QoS framework, congestion control, and multipath routing.

1663

De-Shuang Huang (SM’98) received the B.Sc., M.Sc., and Ph.D degrees, all in electronic engineering, from the Institute of Electronic Engineering, Hefei, China, the National Defense University of Science and Technology, Changsha, China, and the Xidian University, Xian, China, in 1986, 1989, and 1993, respectively. During 1993–1997, he was a Postdoctoral student, with the Beijing Institute of Technology and with the National Key Laboratory of Pattern Recognition, Chinese Academy of Sciences, Beijing. In September 2000, he joined the Institute of Intelligent Machines, Chinese Academy of Sciences, as the Recipient of “Hundred Talents Program of CAS.” From September 2000 to March 2001, he worked as a Research Associate with the Hong Kong Polytechnic University. From April 2002 to June 2003, he worked as Research Fellow with the City University of Hong Kong. From August to September 2003, he was a visiting professor with George Washington University, Washington, DC. From October to December 2003, he worked as a Research Fellow with Hong Kong Polytechnic University. From July to December 2004, he was a University Fellow with Hong Kong Baptist University. He has published more than 190 papers. He is the author of Systematic Theory of Neural Networks for Pattern Recognition, which won the Second-Class Prize of the 8th Excellent High Technology Books of China, and in 2001, another book Intelligent Signal Processing Technique for High Resolution Radars.

Suggest Documents