Online Feature Selection Algorithm with Bayesian l1 ... - Springer Link

3 downloads 22040 Views 911KB Size Report
1 Department of Electrical and Computer Engineering. 2 Interdisciplinary ... We propose a novel online-learning based feature selection algorithm for supervised ...
Online Feature Selection Algorithm with Bayesian 1 Regularization Yunpeng Cai1 , Yijun Sun2 , Jian Li1 , and Steve Goodison3 1

Department of Electrical and Computer Engineering Interdisciplinary Center for Biotechnology Research Department of Surgery, University of Florida, Gainesville, FL 32610 USA 2

3

Abstract. We propose a novel online-learning based feature selection algorithm for supervised learning in the presence of a huge amount of irrelevant features. The key idea of the algorithm is to decompose a nonlinear problem into a set of locally linear ones through local learning, and then estimate the relevance of features globally in a large margin framework with 1 regularization. Unlike batch learning, the regularization parameter in online learning has to be tuned on-thefly with the increasing of training data. We address this issue within the Bayesian learning paradigm, and provide an analytic solution for automatic estimation of the regularization parameter via variational methods. Numerical experiments on a variety of benchmark data sets are presented that demonstrate the effectiveness of the newly proposed feature selection algorithm.

1 Introduction High-throughput technologies now routinely produce large datasets characterized by unprecedented numbers of features. This seriously undermines the performance of many data analysis algorithms in terms of their speed and accuracy. Accordingly, across various scientific disciplines, there has been a surge in demand for efficient feature selection methods for high-dimensional data. Not only can its proper design enhance classification performance and reduce system complexity, but it can also provide significant insights into the nature of the problems under investigation in many applications. Feature selection for high-dimensional data is considered one of the current challenges in statistical machine learning [1]. Existing algorithms are traditionally categorized as wrapper or filter methods, with respect to the criterion used to search for relevant features [2]. One major issue with wrapper methods is their high computational complexity. Many heuristic algorithms (e.g., forward and backward selection [3]) have been proposed to alleviate this issue. However, due to the heuristic nature, none of them can provide any guarantee of optimality. In the presence of many thousands of features, a hybrid approach is usually adopted, wherein the number of features is first reduced by using a filter method, and then a wrapper method is used on the reduced feature set. Nevertheless, it still may take several hours to perform the search, depending on the classifier used in a wrapper method. Embedded methods [4] have recently received 

This work is supported in part by the Komen Breast Cancer Foundation under grant No. BCTR0707587. Please address all correspondence to [email protected].

T. Theeramunkong et al. (Eds.): PAKDD 2009, LNAI 5476, pp. 401–413, 2009. © Springer-Verlag Berlin Heidelberg 2009

402

Y. Cai et al.

an increasing interest. In contrast to wrapper methods, embedded methods incorporate feature selection directly into the learning process of a classifier. A feature weighting strategy is usually adopted that uses real-valued numbers, instead of binary ones, to indicate the relevance of features in a learning process. This strategy has several advantages. For example, there is no need to pre-specify the number of relevant features. Also, standard optimization techniques can be used to avoid combinatorial search. Consequently, embedded methods are usually computationally more tractable than wrapper methods. Yet, computational complexity is still a major issue when the number of features becomes excessively large. We recently developed a new feature selection algorithm, referred to as LOFE (LOcal learning based FEature selection), that addresses several major issues with existing methods [5]. The key idea is to decompose an arbitrary nonlinear problem into a set of locally linear ones through local learning, and then estimate the relevance of features globally in a large margin framework with 1 regularization. The algorithm is computationally very efficient. It allows one to process many thousands of features within a few minutes on a personal computer, yet maintains a very high accuracy that is nearly insensitive to a growing number of irrelevant features. Theoretical analysis suggests that the algorithm have a logarithmical sample complexity with respect to data dimensionality [5]. That is, the number of samples needed to maintain the same level of learning accuracy grows only logarithmically with respect to data dimensionality. In this paper, we extend LOFE for online learning, where data arrives sequentially and the estimate of feature relevance is improved with new data without total recalculation. While it has an increasing demand in many real-time systems [6,7], the issue of online learning for feature selection is rarely addressed in the literature. We develop a new online learning algorithm by using stochastic approximation. The algorithm does not make any assumption on data distributions, and thus is applicable for general problems. One major challenge in designing online learning algorithms is the estimation of model parameters, specifically the regularization parameter in our case. Unlike batch learning where the parameter can be estimated through cross-validation by analyzing entire training data, online learning has to automatically tune the parameter on-the-fly with the increasing of training data. We address this issue within the Bayesian learning paradigm, and provide an analytic solution for automatic estimation of the regularization parameter via variational methods. Numerical experiments based on a variety of benchmark data sets are presented that demonstrate the effectiveness of the newly proposed algorithm.

2 Local Learning Based Feature Selection Algorithm This section presents a review of the LOFE algorithm. Let D = {(xn , yn )}N n=1 denote a training dataset, where xn ∈ RM is the n-th data sample and yn ∈ {0, 1} is its corresponding class label. We are interested in the problems where M  N . We start by defining the margin. Given a distance function, we find two nearest neighbors of each sample xn , one from the same class (called nearest hit or NH), and the other from the different class (called nearest miss or NM) [8]. Following the work of [4], the margin of xn is defined as ρn = d(xn , NM(xn )) − d(xn , NH(xn )), where d(·) is the distance

Online Feature Selection Algorithm with Bayesian 1 Regularization

403

function. For the purpose of this paper, we use the block distance to define a sample’s margin and nearest neighbors, while other standard definitions may also be used. An intuitive interpretation of this margin is a measure as to how much xn can “move” in the feature space before being misclassified. By the large margin theory [9], a classifier that minimizes a margin-based error function usually generalizes well on unseen test data. One natural idea then is to scale each feature, thus obtaining a weighted feature space, parameterized by a nonnegative vector w, so that a margin-based error function in the induced feature space is minimized. The margin of xn , computed with respect to w, is given by ρn (w) = d(xn , NM(xn )|w) − d(xn , NH(xn )|w) .

(1)

By defining zn = |xn − NM(xn )| − |xn − NH(xn )|, where | · | is an element-wise absolute operator, ρn (w) can be simplified as ρn (w) = wT zn , which is a linear function of w and has the same form as the sample margin defined in SVM using a kernel function. An important difference, however, is that by construction the magnitude of each element of w in the above margin definition reflects the relevance of the corresponding feature in a learning process. This is not the case in SVM except when a linear kernel is used, which however can capture only linear discriminant information. Note that the margin thus defined requires only information about the neighborhood of xn , while no assumption is made about the underlying data distribution. This implies that by local learning we can transform an arbitrary nonlinear problem into a set of locally linear problems. The local linearization of a nonlinear problem enables us to estimate the feature weights by using a linear model that has been extensively studied in the literature. The main problem with the above margin definition, however, is that the nearest neighbors of a given sample are unknown before learning. In the presence of thousands of irrelevant features, the nearest neighbors defined in the original space can be completely different from those in the induced space. To account for the uncertainty in defining local information, we develop a probabilistic model where the nearest neighbors of a given sample are treated as latent variables. Following the principles of the expectationmaximization algorithm [10], we estimate the margin through taking the expectation of ρn (w) by averaging out the latent variables:  ρ¯n (w) = w

T



P (xi =NM(xn )|w)|xn − xi | −

i∈Mn



 P (xi =NH(xn )|w)|xn − xi |

i∈Hn

= wT z¯n ,

where Mn = {i : 1 ≤ i ≤ N, yi = yn }, Hn = {i : 1 ≤ i ≤ N, yi = yn , i = n}, P (xi =NM(xn )|w) and P (xi =NH(xn )|w) are the probabilities that sample xi is the nearest miss or hit of xn , respectively. These probabilities are estimated through the standard kernel density estimation method: k(xn − xi w ) , ∀i∈Mn j∈Mn k(xn − xj w )

(2)

k(xn − xi w ) , ∀i∈Hn j∈Hn k(xn − xj w )

(3)

P (xi = NM(xn )|w) =  P (xi = NH(xn )|w) = 

404

Y. Cai et al.

where k(·) is a kernel function. Specifically, we use exponential kernel k(d)= exp(−d/δ) where kernel width δ determines the resolution at which the data is locally analyzed. After the margins are defined, the problem of learning feature weights can be directly solved within a margin framework. For computational convenience, we perform the estimation in the logistic regression formulation. In applications with a huge amount of features (e.g., molecular classification [11]), we expect that most of features are irrelevant. To encourage the sparseness, one commonly used strategy is to add 1 penalty of w to an objective function [12,13], which leads to the following optimization problem: min w

N  1   ln 1 + exp(−wT z¯n ) + λw1 N n=1

s.t.

w ≥ 0,

(4)

where λ is a parameter that controls the penalty strength and consequently the sparseness of the solution. Since z¯n implicitly depends on w through the probabilities P (xi =NH(xn )|w) and P (xi = NM(xn )|w), we use a fixed-point iteration method to solve for w. In each iteration, z¯n is first computed by using the previous estimate of w, which is then updated by solving the optimization problem (4). The iterations are carried out until convergence. It is interesting to note that though local learning is a highly nonlinear process, in each iteration we deal with a linear model. For fixed z¯n , (4) is a constrained convex optimization problem. Due to the nonnegative constraint on w, it cannot be solved directly by using a gradient descent method. To overcome this difficulty, we reformulate the problem slightly as:    N  1  2 min ln 1 + exp − vm z¯n(m) (5) + λv22 , v N m n=1 thus obtaining an unconstrained optimization problem. Here vm is the m-th element of 2 , 1 ≤ m ≤ M . The v. It is easy to show that at the optimum solution we have wm = vm solution of v can be readily found through gradient descent with a simple update rule:    2 N z¯n(m) ) 1  exp(− m vm  2 (6) z¯n ⊗ v , v ← v − η λ1 − N n=1 1 + exp(− m vm z¯n(m) ) where ⊗ is Hadamard operator, and η is the learning rate determined by a line search. Note that the objective function (5) is no longer a convex function, and thus a gradient descent method may find a local minimizer or a saddle point. It can be shown that if all elements of a initial point are non-zero, the solution obtained when the gradient vanishes is a global minimizer [5]. Moreover, by using the Banach fixed point theorem [14], it can be proved that the algorithm converges to a unique solution for any nonnegative initial feature weights, under a loose condition that a kernel width is sufficiently large [5]. It is interesting to note that even if the initial feature weights were wrongly selected and the algorithm started computing erroneous nearest neighbors for each sample, the theorem assures that the algorithm will eventually converge to the same solution obtained when one had perfect prior knowledge.

Online Feature Selection Algorithm with Bayesian 1 Regularization

405

The computational complexity of LOFE is O(N 2 M ), which is linear with respect to feature dimensionality. In contrast, some popular greedy search methods (e.g., forward search) require of the order of O(M 2 ) moves in feature space.

3 Online Learning LOFE is based on batch learning – that is, feature weights are updated after seeing all of the training data. In case the amount of training data is enormous, or training data arrives sequentially, online learning is computationally much more attractive than batch learning. We propose a new online learning algorithm by using stochastic approximation techniques [15]. The key idea is to estimate the gradient of the objective function of individual samples, and then perform one gradient-descent step to obtain a solution that reduces the objective function. The theory of stochastic gradient assures that by using a carefully selected step size, the algorithm converges to a fixed-point solution identical to that obtained by using batch learning. To distinguish sequentially arrived samples from those used in batch learning, we use k, instead of n, to index samples in the sequel. At the k-th sampling, we approximate the gradient of the objective function of (5) as     (7) Q(vk ) = λ1 − σ −wT z¯k z¯k ⊗ v , where σ(x) is the sigmoid function, defined as σ(x) = 1/(1+exp(−x)). The stochastic gradient method gives the following updating rule for v : vk+1 = vk − αk Q(vk ) ,

(8)

where αk = η/k, and η is a fixed step size. In each updating step, the vector z¯k is calculated from the k-th training samples based solely on the current value of vk .

4 Bayesian 1 Regularization with Variational Methods Although the above online-learning formulation is straightforward, the estimation of the regularization parameter is not trivial. Unlike batch learning where the parameter can be estimated through cross validation by analyzing the full data set, online learning has to determine the parameter on-the-fly with the increasing of training data. This issue is rarely addressed in the literature. 4.1 Bayesian Estimation of Regularization Parameters It has been recently suggested that parameter estimation can be performed by applying full Bayesian treatment to a parametric model, which is also called evidence approximation in the literature [16,17]. The basic idea is to connect a hyper-parameter to the prior distribution of a model, and then select a parameter that makes the model most consistent to the observed data. We below give a brief review of evidence approximation.

406

Y. Cai et al.

Bayesian learning treats a penalized loss function of the form L = LD + λLW as the log-likelihood of the following posterior distribution p(w|D) =

1 p(D|w)p(w|λ) = exp(−LD ) exp(−λLW ) , p(D|λ) Z

(9)

where D is the training data, and Z is some normalization constant. Thus, the empirical loss LD is mapped to the likelihood function p(D|w), and the regularization term LW to the prior distribution p(w|λ) of w. For the 1 regularization, p(w|λ) ∝ exp(−λw1 ). The prior distribution p(w|λ) is usually chosen to be the exponential distribution or Laplace distribution, depending on the range of w. Since in our case w ≥ 0, we use the isotropic exponential distribution for w, given by (10) p(w|λ) = λM exp(−λw1 ) .  In evidence approximation, p(D|λ) = p(D|w)p(w|λ)dw is called the evidence function. By assuming a prior distribution p(λ), evidence approximation calculates the posterior distribution as p(D|λ)p(λ) , p(λ|D) =  p(D|λ)p(λ)dλ and picks the most probable value λMAP that maximizes the a posteriori distribution. It has been suggested that one can assume λ to be sharply peaked near λMAP [18,19], and maximize p(D|λ) to obtain λMAP . In this paper we adopt this simplification. Obtaining a closed-form expression of a evidence function that allows for direct maximization is difficult. Approximation methods are usually adopted to simplify the optimization. Two commonly used approaches to approximating a probability distribution function are Laplace approximation [18] and variational methods [16,20]. It has been shown in [20] that variational methods usually produce more accurate results than Laplace approximation. Moreover, Laplace approximation is only applicable at local optimal points of a function, while variational methods can be used to approximate a function at arbitrary points. For the purpose of this paper, we use variational methods. 4.2 Variational Methods Variational methods seek a local approximation of a convex or concave function via its first order Taylor expansion. For example, for a concave function f (x), its Taylor expansion is expressed by y(x|ξ) = f (ξ) + f  (ξ)(x − ξ), and y(x|ξ) ≥ f (x) with equality at x = ξ. By varying y with ξ, f (x) = minξ y(x|ξ). Denoting ζ = f  (ξ), we have f (x) = minζ (ζx + H(ζ)). For the sigmoid function, ln σ(x) is a concave function. Hence, ln σ(x) = minζ (ζx+ H(ζ)), where H(ζ) = ζ ln ζ+(1−ζ) ln(1−ζ), and thus σ(x) = minζ exp(ζx+H(ζ)). By using variational methods, we represent an objective function with a simple linear approximation, with equality at designated point ξ = x. Also, variational methods transform a univariate function f (x) to a bivariate one y(x|ξ), thus introducing extra degrees of freedom to the original problem we aim to solve. Denote F(y(x), α) as a functional of y(x) with parameter α. The problem of optimizing F(y(x), α) with respect to α is transformed into a problem of optimizing an approximated functional

Online Feature Selection Algorithm with Bayesian 1 Regularization

407

F(y(x|ξ), ξ, α) with respect to ξ and α individually. The later one is often mathematically more tractable. 4.3 Parameter Estimation We apply variational methods to the objective function (4) to obtain its evidence function, and then estimate the regulariation parameter by maximizing the so-obtained evidence function. Although our goal is to derive a parameter estimation method for online learning, we first work on the batch learning case, and then extend the result to the online learning case. We first rewrite Eq. (4) into the standard form of logistic regression by multiplying it by sample size N , which yields: min L(w) = w

N 

  ˜ ln 1 + exp(−wT z¯n ) + λw 1

s.t.

w ≥ 0,

(11)

n=1

˜ = N λ. For notational simplicity, in the following we still use λ, instead of λ, ˜ where λ when discussing batch learning, unless otherwise specified. We also define φn = z¯n if yn = 1, and φn = −z¯n if yn = 0. The likelihood function p(D|w) can then be expressed by p(D|w) =

N

σ(w φn ) (1 − σ(w φn )) T

yn

n=1

T

1−yn

=

N

exp(wT φn yn )σ(−wT φn ).

n=1

(12) By using the variational approximation of the sigmoid function and Eqs. (10) and (12), the evidence function is given by ⎧ ⎫ N N

 ⎨  −λw1 + w T φn yn −w T φn ζn +H(ζn )) ⎬ ( n=1 p(D|λ) = en=1 λM e · min dw. ⎭ {ζ1 ,·,ζN } ⎩ w≥0 (13) Note that the minimization is inside the integrate operation, which makes the optimization mathematically intractable. Following the principles of variational methods [20], also explained in Sec. 4.2, we treat the parameters ζ = [ζ1 , · · · , ζN ] to be independent of w so that we can move the minimization out of the integration. Integrating out Eq. (13) yields ⎧ ⎫ ⎪ ⎪  N  M ⎪ ⎪ ⎨ ⎬  1 M λ exp . H(ζn )   p(D|λ) = min N ⎪  {ζ1 ,·,ζN } ⎪ ⎪ ⎪ n=1 m=1 λ + ⎩ ⎭ (ζn − yn )φn(m) n=1

(14) Denote ζ ∗ as the variational parameters that minimize the likelihood function (14). By using Eqs. (10), (12) and (14), the posterior distribution of w can then be written as     M N N   ∗ T ∗ p(w|D) = (ζn − yn )φn(m) exp −w (λ1 + (ζn − yn )φn ) , λ+ m=1

n=1

n=1

(15)

408

Y. Cai et al.

¯ as the mean of w, the m-th which is a non-isotropic exponential distribution. Denote w element of which is w ¯(m) =

 −1 N  λ+ (ζn∗ − yn )φn(m) .

(16)

n=1

We then seek the optimal estimates of the variational parameters ζ of p(D|λ). Taking the derivative of the logarithm function of p(D|λ) with respect to ζn and forcing it to zero produce   ∂ ln p(D|λ) ζn ¯ T φn = 0. −w = ln (17) ∂ζn 1 − ζn Hence, for a fixed λ, the optimal estimates of ζ are given by ¯ T φn ), ζn∗ = σ(w

n = 1, · · · , N.

(18)

It is easy to prove that Eq. (18) has an unique solution and can be solved either by an iterative or Newton’s methods. After obtaining an approximation to the evidence function p(D|λ), we are able to estimate the hyper-parameter λ by maximizing the evidence function. The logarithm of Eq. (14) takes the form ln p(D|λ) = −

M 

 ln λ +

m=1



N 

(ζn∗ (λ) n=1

− yn )φn(m)

+

N 

H(ζn∗ (λ)) + M ln λ.

n=1

(19) Since w, ¯ given by Eq. (16), is a function of λ, we denote it by w(λ) ¯ in the sequel. Taking the derivative of Eq. (19) with respect to λ and forcing it to zero, we obtain the following iterated solution of λ, given by  λ=

−1 M 1  w(λ) ¯ (m) . M m=1

(20)

With Eqs. (18) and (20), we are now able to determine the optimal choice of regularization parameter λ by using an iterative method. We now proceed to extend the above derivation for online learning. Using k, instead ˜ = kλ to Eq. (20) yield of N , to denote the sample size and applying λ  kλ =

M k  1  (kλ + (ζn∗ − yn )φn(m) )−1 M m=1 n=1

−1 .

(21)

˜ in Eq. (16) yields ¯ as w ¯ k . Using λ = λ/k Define the k-th estimate of the weight mean w  w ¯k(m) =

kλ +

k 

(ζn∗ n=1

−1 − yn )φn(m)

.

(22)

Online Feature Selection Algorithm with Bayesian 1 Regularization

409

Hence, the k-th estimate of λ can be calculated as  λk =

−1 M 1  kw ¯k(m) , M m=1

(23)

where w ¯k(m) can be computed in an online manner (w ¯k(m) )−1 = (w ¯k−1(m) )−1 + (uk(m) )−1 , uk(m) = (λk−1 + (ζk∗ − yk )φk(m) )−1 .

(24)

Note that Eq. (14) makes sense only if w ¯(m) > 0. If this condition is violated, it implies that the currently estimated λ does not provide sufficient penalty to force the feature weights to zero, and the so-estimated feature weights do not follow the exponential distribution. In this case, we artificially set w ¯k(m) = 0 to increase λ in the next iteration.

5 Experiments This section presents several numerical experiments to demonstrate the effectiveness of the newly proposed algorithm. We first perform a simulation study on the well-known Fermat’s spiral data. It is a binary classification problem. Each class has 230 samples distributed in a two-dimensional space, forming a spiral shape. In addition to the first two relevant features, each sample is contaminated by a varying number of irrelevant features randomly sampled from the standard normal distribution. The spiral problem, though simple, when contaminated by thousands of irrelevant features, poses a serious challenge for existing algorithms. Fig. 1 presents the feature weights learned by our algorithm and three competing algorithms, including SIMBA [4], RELIEF-F [8], and I-RELIEF [21]. We observe that our algorithm performs remarkably well over a wide range of feature-dimensionality values, yielding always the largest weights for the first two relevant features, while all other weights are less than 10−4 . The solution is nearly insensitive to a growing number of irrelevant features. In contrast, the three competing algorithms perform substantially worse than ours. In the second experiment, we apply our algorithm to eight UCI benchmark data sets. The data information is summarized in Table 1. For each data set, the set of original features is augmented by 5000 artificially generated irrelevant features randomly sampled from the standard normal distribution. It should be noted that some features in the 100

1000

5000

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

Feature Weights

1

0.8

0

0

10

1

10

2

10

0

0

10

2

10

RELIEF

10000

I−RELIEF

SIMBA

1

1

1

1

0.8

0.8

0.8

0.8

0.6

0.6

0.6

0.6

0.6

0.4

0.4

0.4

0.4

0.4

0.2

0.2

0.2

0.2

0 0 10

2

10

0 0 10

2

10

4

10

0 0 10

2

10

0 0 10

0.2

2

10

0 0 10

2

10

Fig. 1. Feature weights learned on spiral data containing a varying numbers of irrelevant features, ranging from 100 to 10000, by our algorithm (left 4 columns), and by three competing algorithms (with 5000 irrelevant features). The x- and y-axis represent the number of features and the values of feature weights, respectively. Only the first two features are useful.

410

Y. Cai et al.

Table 1. Summary of UCI data sets. The number of irrelevant features artificially added to the original ones is indicated in parentheses. Data set twonorm flare-solar diabetes splice

Training 400 666 468 1000

Test 7000 400 300 2175

Feature 20(5000) 9(5000) 8(5000) 60(5000)

Data set waveform thyroid heart breast-cancer

Training 400 140 170 200

Test 4600 75 100 77

Feature 21(5000) 5(5000) 13(5000) 9(5000)

Table 2. Classification accuracy and standard deviation of SVM performed on UCI data sets using all features, original features, and features selected by our algorithms. False discovery rate (FDR) and CPU time (measured on Core2 2.0G PC) of each algorithm are also reported. Data set

Original Features Acc(%) twonorm 97.4 ± 0.2 89.9 ± 0.6 waveform 77.3 ± 1.7 diabetes breast-cancer 63.7 ± 6.1 95.3 ± 2.1 thyroid 80.8 ± 3.5 heart 88.1 ± 0.8 splice 65.3 ± 1.7 flare-solar

All Features Acc(%) 50.0 ± 0.0 67.1 ± 0.2 65.7 ± 1.9 69.3 ± 4.5 69.9 ± 3.6 54.4 ± 3.4 51.5 ± 1.7 55.9 ± 1.5

Batch Learning Acc(%) FDR(‰) Time(s) 97.4 ± 0.2 0 162 88.3 ± 0.9 1.6 157 76.3 ± 1.1 1.8 267 69.5 ± 4.6 2.2 68 94.7 ± 2.4 0.1 13 82.8 ± 4.2 0.4 73 90.5 ± 1.2 1.1 2639 62.0 ± 2.8 0.6 948

Online Learning Acc(%) FDR(‰) Time(s) 94.5 ± 1.2 0.4 49 85.1 ± 2.5 2.7 39 74.1 ± 2.9 0.8 48 70.3 ± 5.6 3.1 18 91.5 ± 3.0 0.1 15 80.2 ± 4.3 2.6 18 88.3 ± 3.0 0.3 167 63.7 ± 5.1 0.8 76

original feature sets may be irrelevant or weakly relevant, and hence may receive zero weights in our algorithm. Unlike the spiral problem, however, the relevance information of the original features is unknown. To verify that our algorithm indeed identify all relevant features, we set a high standard by comparing the classification performance of SVM (with the RBF kernel) in two cases: (1) when only the original features are used (i.e., without 5000 useless features), and (2) when the features selected by our algorithm are used. It is well known that SVM is very robust against noise, and that the presence of a few irrelevant features in the original feature sets should not significantly affect its performance. Hence, the classification performance of SVM in the first case should be very close to that of SVM performed on the optimal feature subsets that are unknown to us a priori. Essentially, we are comparing our algorithm with the optimal feature selection algorithm. If SVM performs similarly in both cases, we may conclude that our algorithm achieves close-to-optimum solutions. The structural parameters of SVM are estimated through ten-fold cross validation using training data. For both online and batch learning algorithms, the kernel width δ used in Eqs. (2) and (3) is set to 2. Though not considered in this paper, the kernel width can also be treated as a hyper-parameter and estimated similarly using the proposed algorithm. For the batch learning algorithm, the regularization parameter λ is set to 1. For the online learning algorithm, the regularization parameter is estimated by using the Bayesian parameter estimation algorithm proposed in Sec. 4. To reduce statistical variations, both batch and online learning algorithms are run 10 times for each dataset. In each run, a dataset is randomly partitioned into training and test sets. After a feature weight vector is learned, only the features with weights larger than 10−4 are used for classification. The averaged classification errors and standard deviations of SVM are

Online Feature Selection Algorithm with Bayesian 1 Regularization waveform

twonorm Fixed Parameter Adaptive Parameter

0.3

411

waveform Fixed Parameter Adaptive Parameter

0.22

1

Regularization Parameter

10

Classification Error

Classification Error

0.21 0.25

0.2

0.15

0.2 0.19 0.18 0.17

0.1

0

λ =1.13

10

0.16 0.05

2

4

6

8

10

Regularization Parameter

(a)

12

14

0.15

2

4

6

8

10

Regularization Parameter

(b)

12

14

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Training Step

(c)

Fig. 2. (a-b) Classification performance of online learning algorithm using different but fixed regularization parameters, and the parameter tuned by Bayesian estimation on twonorm and waveform data sets, and (c) the convergence paths of lambda in one sample run

reported in Table 2. The false discovery rate (FDR), defined as the ratio between the number of artificially added, irrelevant features identified by our algorithms as useful ones and the total number of irrelevant features (i.e., 5000), is reported in Table 2. For reference, the classification performance of SVM using all features (i.e., the original features plus the useless ones) is also reported. From these experimental results, we observe the followings: (1) SVM using all features performs poorly, while SVM using the features identified by our batch learning algorithm performs similarly or even slightly better (e.g., breastcancer and splice) than SVM using the original features. This suggests that our batch learning algorithm can achieve a close-to-optimum solution in the presence of a huge number of irrelevant features. This result is consistent with that reported in Fig. 1. (2) Online learning performs slightly worse than batch learning, but with a much lower computational complexity. For most data sets, it only takes batch learning a few minutes to process more than 5000 features. Note however that the CPU times of batch learning on splice and flare-solar are much larger than other data sets. This is due to the fact that the computational complexity of batch learning is quadratic in the number of samples. We should emphasize that though batch learning performs slightly better than online learning, they are used in different scenarios. (3) In addition to successfully identifying relevant features, both batch and online learning algorithms perform remarkably well in removing irrelevant ones. From Table 2, we observe that for both algorithms, the false discovery rates are very low. For example, for splice, there are only less than 4 out of 5000 irrelevant features that are identified by our algorithms as useful ones. In the third experiment, we compare the performance of the online learning algorithm using different but fixed regularization parameters, and the parameter tuned by Bayesian estimation on the twonorm and waveform data sets. The results are shown in Fig. 2(ab). For both data sets, the classification errors are heavily influenced by the choice of λ. We also observe that with the tuned regularization parameter the performance of the algorithm is very close to the optimal one that can be achieved by using a fixed parameter. This result clearly demonstrates the effectiveness of our proposed parameter estimation method.

412

Y. Cai et al.

We also conduct an experiment to study the convergence behavior of the proposed parameter estimation algorithm by applying the online learning algorithm to waveform with different initial values of λ. The learning paths of the parameter are depicted in Fig. 2(c). We observe that the regularization parameter converges regardless of its initial points.

6 Conclusion This paper addressed the issue of finding sparse solutions for large-scale feature selection problems, and developed a computationally efficient method that is applicable for both online and batch learning applications. The batch learning version exhibits a near optimal performance with affordable computational complexity, and the online learning one provides a means to balance between speed and accuracy. We also proposed a Bayesian regularization method for online learning that performs very well with the specified feature selection algorithm.

References 1. Lafferty, J., Wasserman, L.: Challenges in statistical machine learning. Stat. Sinica 16, 307– 322 (2006) 2. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res 3, 1157–1182 (2003) 3. Pudil, P., Novovicova, J.: Novel methods for subset selection with respect to problem knowledge. IEEE Intell. Sys. 13(2), 66–74 (1998) 4. Gilad-Bachrach, R., Navot, A., Tishby, N.: Margin based feature selection - theory and algorithms. In: Proc. 21st Int. Conf. Mach. Learn., pp. 43–50 (2004) 5. Sun, Y., Todorovic, S., Goodison, S.: A feature selection algorithm capable of handling extre -mely large data dimensionality. In: Proc. 8th SIAM Conf. Data Mining, pp. 530–540 (2008) 6. Collins, R.T., Liu, Y., Leordeanu, M.: Online selection of discriminative tracking features. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1631–1643 (2005) 7. Jiang, W., Er, G., Dai, Q., Gu, J.: Similarity-based online feature selection in content-based image retrieval. IEEE Trans. Image Proc. 15(3), 702–712 (2006) 8. Kira, K., Rendell, L.: A practical approach to feature selection. In: Proc. 9th Int. Conf. Mach. Learn., pp. 249–256 (1992) 9. Vapnik, V.: Statistical Learning Theory. Wiley, Chichester (1998) 10. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39(1), 1–38 (1977) 11. Sun, Y., Goodison, S., Li, J., Liu, L., Farmerie, W.: Improved breast cancer prognosis through the combination of clinical and genetic markers. Bioinformatics 23(1), 30–37 (2007) 12. Ng, A.Y.: Feature selection, L1 vs. L2 regularization, and rotational invariance. In: Proc. 21st Int. Conf. Mach. Learn., pp. 78–86 (2004) 13. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996) 14. Kress, R.: Numerical Analysis. Springer, Heidelberg (1998) 15. Spall, J.C.: Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control. John Wiley, Chichester (2003)

Online Feature Selection Algorithm with Bayesian 1 Regularization

413

16. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006) 17. MacKay, D.J.: The evidence framework applied to classification networks. Neural Comp. 4(5), 720–736 (1992) 18. Cawley, G.C., Talbot, N.L.: The evidence framework applied to sparse kernel logistic regression. Neurocomputing 64, 119–135 (2005) 19. MacKay, D.J.: Bayesian interpolation. Neural Comp. 4(3), 415–447 (1992) 20. Jaakkola, T.S., Jordan, M.I.: Bayesian parameter estimation via variational methods. Stat. and Comp. 10, 25–37 (2000) 21. Sun, Y.: Iterative RELIEF for feature weighting: Algorithms, theories, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1035–1051 (2007)

Suggest Documents