Fast kernel feature ranking using class separability for big data mining

2 downloads 2406 Views 672KB Size Report
Jul 14, 2015 - Abstract Kernel feature ranking often delivers many benefits for big data ... field-collected data for a specific big data mining problem, while most ...
J Supercomput (2016) 72:3057–3072 DOI 10.1007/s11227-015-1481-1

Fast kernel feature ranking using class separability for big data mining Zhiliang Liu1,2

Published online: 14 July 2015 © Springer Science+Business Media New York 2015

Abstract Kernel feature ranking often delivers many benefits for big data mining, e.g., improving generalization performance. However, its efficiency is quite challenging due to a need of tuning kernel parameters in the ranking process. In present work, we propose a computational-light metric based on kernel class separability for kernel feature ranking. In the proposed metric, the kernel parameter is optimized by a proposed analytical algorithm rather than an optimization search algorithm. Experimental results demonstrate that (1) the proposed metric can lead to a fast and robust kernel feature ranking; and (2) the proposed analytical algorithm can select a right kernel parameter with much less computation time for two state-of-the-arts kernel metrics. Keywords Feature ranking · Kernel method · Parameter selection · Kernel class separability · Big data

1 Introduction With fast development of sensor and information technologies, big data with large number of features are often the case in various artificial intelligence applications. Many features, e.g., skewness and kurtosis in fault diagnosis [1], can be extracted from field-collected data for a specific big data mining problem, while most of them are redundant or even irrelevant. If a feature dimension can be reduced by removing those

B

Zhiliang Liu [email protected]

1

School of Mechatronics Engineering, University of Electronic Science and Technology of China, Chengdu 611731, People’s Republic of China

2

The State Key Laboratory of Mechanical Transmissions, Chongqing University, Chongqing 400044, People’s Republic of China

123

3058

Z. Liu

redundant and irrelevant features, it can lead to many benefits for big data mining, such as increasing generalization performance, reducing both model complexity and storage requirement, and providing a better interpretation of data mining process [2–4]. Feature ranking is such an approach to achieving feature selection and dimension reduction. However, feature ranking for big data mining becomes quite challenging because of arising challenges such as linearly non-separable problems and quick response [2]. Kernel method is an attractive tool helping to deal with the linearly non-separable challenge in feature ranking, as the data are more likely linearly separable in a highdimensional space generated by a kernel function. A kernel function maps data from the original space into the high-dimensional kernel space without ever knowing the explicit mapping function. Kernel method has a great success in the field of machine learning, particularly in support vector machine (SVM). It has been rapidly and widely applied to data mining algorithms as well as feature ranking. In this paper, for convenience, we call the original space established by physical features as the feature space, call the virtual space mapped by a kernel function as the kernel space, and call feature ranking with kernel methods as kernel feature ranking. In kernel feature ranking, features are ranked by a kernel-based metric and then those features above a defined threshold are selected. A few metrics for kernel feature ranking have been reported recently. One of them is the so-called margin width, which is a classical and well-known concept in SVM. The margin width can be used to measure feature effectiveness in the kernel space, for its changes can reflect generalization performance of features. The smaller the changes, the less important the corresponding features. Guyon et al. [5] proposed an intuitive feature ranking method based on the margin width and recursive feature elimination. Several researchers made further investigation on this kernel metric such as Gualdrón et al. [6], Chen et al. [7], and Qu et al. [8]. Since kernel feature ranking based on the margin width needs recursive training of SVM, it usually consumes large computation time. Recently, there is a trend that in the field of big data, researchers pay more attention to computational complexity of feature ranking [9]. To alleviate computation burden, class separability has been used in kernel feature ranking for linearly separable/nonseparable data. The idea behind class separability is that the features leading to larger class separability are more important. In this paper, we use kernel class separability referring to class separability measured in the kernel space. Scatter matrix-based metrics can be adopted to estimate kernel class separability. Yeung et al. [10] defined kernel class separability based on within-class scatter matrix and between-class scatter matrix for learning the kernel matrix. Wang [2] proposed a metric that is defined by class separability based on between-class scatter matrix in the kernel space. Liu et al. [11] proposed a metric based on kernel class-pair separability for learning kernel parameters of kernel Fisher discriminant analysis. Some other metrics different from scatter matrix-based metrics have also been proposed recently. Li et al. [12] proposed an automatic method for selecting the parameter of the normalized kernel using kernel class separability. Liu et al. [13] used genetic algorithm to find the best kernel class separability in the space established by the feature and the kernel parameter. Liu et al. [14] evaluated effectiveness of individual features by their contribution to kernel class separability. It is worth pointing out that kernel parameter selection is important and necessary to class separability-based kernel feature ranking. This is not only

123

Fast kernel feature ranking using class separability...

3059

because an optimal kernel parameter is good for recognizing robust features, but also because the number of kernel parameter selection is proportional to the number of feature evaluation and thus is proportional to the computation time of kernel feature ranking. Obviously, reducing the computational time of kernel parameter selection can improve efficiency of kernel class separability-based feature ranking. The above reviewed papers have made efforts to improve efficiency of kernel feature ranking using optimization search methods such as gradient-based methods [12], genetic algorithm [11,13], and Newton’s method [2,14]. However, optimization search algorithms are still time-consuming due to the need of many iterations to find the optimal kernel parameter. Therefore, it is quite challenging to improve efficiency of kernel feature ranking for big data where feature dimension is extremely large. In this paper, we propose a simple but effective metric for kernel feature ranking. We also propose an analytical method to determine the optimal kernel parameter for the proposed metric. Since the analytical method avoids the optimization search process, and thus computation load for kernel feature ranking is significantly reduced. Experimental results will demonstrate that (1) the proposed metric can largely reduce computation time for kernel feature ranking; and (2) the proposed method of kernel parameter selection can select the right kernel parameter for two reported kernel feature ranking in [2,14]. The rest of this paper is organized as follows. Section 2 introduces the proposed metric in both cases of binary classification and multi-class classification. Section 3 introduces implementation of the proposed kernel feature ranking. Numerical validations are conducted in Sect. 4, and final conclusion comes in Sect. 5.

2 Methodology 2.1 Kernel class separability for binary classification Class separability is a classical concept in pattern recognition, and several measures are proposed to estimate it such as divergence, the Bhattacharyya distance, and scatter matrix [15,16]. In the feature space, mean distance is a common statistic measure of a dataset, and it is also often used to estimate class separability, e.g., in the well-known Fisher discriminant analysis. Let (x, y) denotes an n-dimensional training pair. A sample x and a class label y is an element of an input space Rn and a set of {±1}, respectively. Let N1 and N2 denote the training sets from the classes of +1 and −1, respectively. Let N denotes the set of all the training samples, and let its size be N = N1 + N2 . In this paper, within-class separability and between-class separability are defined as follows: 1 W = 2 B=



 N1 N2 N1  N2  1  1  (1) (1) 2 (2) (2) 2 xt − xk  + 2 xt − xk  , and (1) N12 t=1 k=1 N2 t=1 k=1

N1  N2 1  (1) (2) xt − xk 2 , N1 N2

(2)

t=1 k=1

123

3060

Z. Liu (i)

where x j is the jth sample from the ith class; ||x|| is the Euclidean norm of x. We then simply define “separable” for datasets by W and B. And we consider cases of W < B to be separable and the other cases to be non-separable. In this paper, we consider only the former separable cases, i.e., W < B. Also, we know the relationship between Euclidean distance in the original space and that in the kernel space. For the Gaussian radial basis function, the relationship can be derived into:   xi − x j 2 , (xi ) − (x j )2 = 2 − 2 exp − 2σ 2

(3)

where (x) denotes the corresponding sample in the kernel space with respect to x; σ is the only kernel parameter in the Gaussian radial basis function. In light of the relationship shown in Eq. (3), within-class separability in Eq. (1) and between-class separability in Eq. (2) can be transformed to two measures in the kernel space. To avoid confusion, we use the terms of kernel within-class separability and kernel between-class separability to name the two corresponding measures in the kernel space. Therefore, kernel within-class separability and kernel between-class separability are defined as follows:   1 W  = 2 − 2 exp − 2 W , and 2σ   1 B  = 2 − 2 exp − 2 B . 2σ

(4) (5)

At last, we combine kernel within-class separability and kernel between-class separability into a single aggregate objective function, which is used to describe the whole-class separability of a training set. For a binary classification problem, the kernel class separability is defined by the following objective function: 

 −W  J12 (σ ) = ω B        1 1 = ωw 2 exp − 2 W − 2 + ωb 2 − 2 exp − 2 B , 2σ 2σ T

(6)

where ω = [ωw , ωb ]T and ωw + ωb = 1. ω is a two-dimensional vector for weighting kernel within-class separability and kernel between-class separability. Obviously, the objective function in Eq. (6) is a linear combination of the two measures of kernel class separability. By the definition of kernel class separability in Eq. (6), large class separability means small within-class separability but large between-class separability. The weighting vector needs to be set beforehand. Empirically, the larger the ωw (or ωb ), the larger the weight of kernel within-class separability (or kernel between-class separability). Also, other measures can be realized using different linear/nonlinear combination of Eqs. (4) and (5).

123

Fast kernel feature ranking using class separability...

3061

In this section, we define kernel class separability in binary classification cases where the class labels take only two labels: +1 and −1. However, multi-class classification cases are also the cases in real world. For example, a gearbox may have several fault modes (classes), such as pitting, crack, tooth-missing, etc. Therefore, kernel class separability for multi-class classification cases has to be discussed. In the following, we introduce three strategies that extend kernel class separability defined in Sect. 2.1 to multi-class classification.

2.2 Kernel class separability for multi-class classification The first two strategies, i.e., one-against-one and one-against-all [17], are two commonly used methodologies for transforming a multi-class classification problem to a set of binary classification problems. The same ideas can be transplanted into the proposed metric of kernel class separability. The third strategy is a straightforward extension of the proposed metric. We introduce the three strategies in details as follows.

2.2.1 Strategy 1: one-against-one (OAO) This strategy constructs L(L −1)/2 models of binary classification, with which kernel class separability can be computed by Eq. (6). The i– jth model is constructed with all samples in the ith class with the label of “+1” and all samples in the jth class with the label of “−1”. Therefore, kernel class separability using the OAO strategy is computed by L  L  2 Ji j (σ ∗ ), (7) J (σ ) = L × (L − 1) i=1 j=i+1

where L is the number of classes; Ji j (σ ) is the objective function based on the i– jth model.

2.2.2 Strategy 2: one-against-all (OAA) This strategy constructs L models of binary classification, and thus kernel class separability can be computed by Eq. (6) for each model. The ith model is constructed with all samples in the ith class with the label of “+1”, and all the other samples with the label of “−1”. Therefore, kernel class separability using the OAA strategy is computed by L 1 Ji. (σ ∗ ), (8) J (σ ) = L i=1

where Ji. (σ ) is the objective function based on the ith model.

123

3062

Z. Liu

2.2.3 Strategy 3: direct extension (DEX) By the first two strategies, kernel class separability defined in the binary classification case can be computed for multi-class classification cases. In this section, we propose a definition to deal with multi-class classification cases directly. Keeping this idea, kernel within-class separability and kernel between-class separability are defined as follows: W =

Ni  Ni L 1 1  (i) (i) xt − xk 2 , and 2 L N i=1 i t=1 k=1

Nj Ni  L L   1  2 ( j) xt(i) − xk 2 . B= L(L − 1) Ni N j i=1 j=i+1

(9)

(10)

t=1 k=1

We use the same integration approach in Eq. (6), and the kernel class separability for multi-class classification is defined by 

 −W  B        1 1 = ωw 2 exp − 2 W − 2 + ωb 2 − 2 exp − 2 B . 2σ 2σ

J (σ ) = ωT

(11)

2.2.4 Discussions on the three multi-class extension strategies OAA represents the earliest and the most common strategy for multi-class extension. It involves L binary models. The parameter σ needs to be optimized in each model. The consumption time of optimization is proportional to the number of the binary models. Therefore, the OAA strategy usually consumes more time than the DEX strategy but less time than the OAO strategy. In addition, its performance may be compromised due to unbalanced training datasets [18], which means that the number of classes reorganized by the OAA strategy is not equal to each other. The OAO strategy usually consumes the most time because it constructs the most number of models among the three strategies. The DEX strategy shows the most computational efficiency. However, its parameter σ cannot be optimized in a feasible way like the OAA strategy and the OAO strategy, which can select the optimal values of σ with respect to the different class. 2.3 Automatic kernel parameter selection In Sect. 2.2, we define kernel class separability for the binary classification case and the multi-class classification case. For both cases, the kernel parameter σ has to be first determined before applying kernel class separability to kernel feature ranking. A poor setting of σ can easily blind the metric of kernel class separability to distinguish the right relevant or irrelevant features. In this section, we propose a fast and robust

123

Fast kernel feature ranking using class separability...

3063

method of kernel parameter selection to deal with this issue. The proposed method finds the optimal parameter by maximizing kernel class separability with respect to the kernel parameter σ . The optimization problem is defined as σ ∗ = arg max J (σ ),

(12)

σ ∈(0,+∞)

where J is the objective function that refers to Eq. (6) for the binary classification case and refers to Eq. (11) for the multi-class classification case. The objective function J has continuous first-order derivative and second-order derivative with respect to σ . Hence, the maximizer of the objective function with respect to σ can be derived as follows. 1. Calculate the first derivative and the second derivative of J (σ ).      1 1 d J (σ ) = 2ωw W exp − 2 W − 2ωb B exp − 2 B σ −3 (13) dσ 2σ 2σ      d2 1 1 2 2 J (σ ) = 2ωw W exp − 2 W − 2ωb B exp − 2 B σ −6 dσ 2 2σ 2σ      1 1 − 6ωw W exp − 2 W − 6ωb B exp − 2 B σ −4 2σ 2σ (14) 2. Let the first derivative in Eq. (13) equal to zero, and we get all stationary points. d J (σ ) = 0 dσ

    1 1 ⇔ 2ωw W exp − 2 W − 2ωb B  exp − 2 B = 0 2σ 2σ B−W ⇔ σ∗ = 2 × log(ωb B/ωw W )

(15)

3. Substitute σ in Eq. (15) into Eq. (14), and test whether the second derivative is less than zero. d2 ωb B J (σ )|σ =σ ∗ < 0 ⇔ ωw W (W − 3σ ∗2 ) 2 dσ ωw W ∗2 < ωb B(B − 3σ ) ⇔ W < B

(16)

As stated earlier, the datasets are assumed separable if W < B. It means Eq. (16) holds. Thus, the stationary point obtained in Eq. (15) is the maximizer of the objective function and also the optimal σ in the sense of kernel class separability maximization. Figure 1 shows an example of the kernel class separability with respect to σ on the Parkinson dataset retrieved from the UCI Machine Learning Repository [19]. It shows the optimal parameter σ (=1.66) maximizes kernel class separability. With the optimal

123

3064

Z. Liu 0.14

Kernel Class Separability

0.12

0.1

0.08

0.06

0.04

0.02

0

the optimal σ 0

2

4

6

8

10

σ

12

14

16

18

20

Fig. 1 Kernel class separability with respect to σ

parameters, kernel class separability proposed in the previous section can be efficiently computed. For a binary classification problem or a multi-class classification problem with the DEX strategy, the proposed algorithm of kernel parameter optimization needs to run only one time. For a multi-class classification problem, the kernel parameter selection algorithm has to conduct L times for the OAA strategy and L(L − 1)/2 times for the OAO strategy. Notice that the proposed kernel parameter selection algorithm could also be applied to model selection of other kernel methods, e.g., other reported metrics based on kernel class separability and SVM [20]. In the next, we will introduce the implementation of kernel feature ranking.

3 Kernel feature ranking Let α be an n-dimensional vector whose element indicates whether a feature is selected or not, that is, “0” denotes a status of “unselected”, and “1” denotes a status of “selected”. For example, αi = 1, which means the ith feature is selected. Feature selection aims to find n  features out of the original n ones (n  < n in general). To achieve this objective, feature selection is conducted using an objective function and a search strategy. Given an objective function C, feature selection can be mathematically expressed as the following optimization problem: α∗ = arg max C(α ⊗ x), α∈{0,1}n

where α ⊗ x denotes the component-wise multiplication.

123

(17)

Fast kernel feature ranking using class separability...

3065

Feature ranking is a category of feature selection. It aims to estimate intrinsic information content of a feature with respect to the label. In feature ranking, a feature set consisting of top-ranked individual features is not necessarily the best feature subset. Considering this issue, it is suggested to use the change as a new metric after a feature is removed [5]. According to kernel class separability introduced in Sect. 2, the proposed metric for kernel feature ranking is defined by δ j = J (σ ∗ ) − J−F j (σ ∗ ),

(18)

where δ j is the effectiveness score of the jth feature, 1 ≤ j ≤ n; J−F j (σ ) is the kernel class separability after removing a feature F j . The optimization problem in Eq. (17) is usually computational intractable when n is a large number. Feature ranking is a suboptimal method to balance computation time and performance of feature selection. Best Individual n  (BIN) is the simplest and the most intuitive way to achieve feature selection. Under this strategy, the metric in Eq. (18) is computed for all individual features. Top n  features can be selected according to their ranking feature scores. Let us take the Fisher iris dataset [19] as a simple example to illustrate the process of kernel feature ranking. In this dataset, four features were measured from each sample: F1 = sepal length, F2 = sepal width, F3 = petal length, and F4 = petal width. The effectiveness score on F1 is computed as follows. First, we find the optimal σ by the analytical method in Sect. 2.3 and compute the corresponding kernel class separability on the entire feature set {F1 , F2 , F3 , F4 }, which gives J = 0.6331. Second, we do the same thing on the feature set {F2 , F3 , F4 } that excludes F1 , and it gives J−F1 = 0.6950. At last, the score on F1 can be obtained by subtracting J with J−F1 , which gives δ1 = −0.0619. The above procedure is repeated on the rest three features, and finally we have: δ2 = −0.0769, δ3 = 0.0754, and δ4 = 0.0814. Obviously, we have δ4 > δ3 > δ1 > δ2 according to the scores on the four features, and thus the feature sequence in a decreasing order of importance is as follows: F4 , F3 , F1 , and F2 .

4 Numerical validation In this section, the proposed method will compare with two state-of-the-art methods in [2,14], where quasi-Newton’s method is used for their kernel parameter selections. In our comparison, we also apply the proposed kernel parameter selection method with ω = [0.5; 0.5] in Sect. 2.3 to the two reported methods in [2,14]. Totally, five methods are compared with each other, and they are summarized in Table 1. We evaluate performance of feature ranking from two aspects: classification accuracy and CPU time. Classification accuracy is the first important aspect of kernel feature selection. It is estimated by fivefold cross-validation with the SVM classifier that is implemented in LIBSVM [21]. CPU time is the other important aspect of kernel feature selection as we also prefer a quick response. CPU time is collected by two functions of tic and toc in MATLAB R2013a (64-bit) on a computer with a processor of Intel Core i5-3337U (1.8 GHz, 1.8 GHz), 8 GB RAM and an operating system

123

3066

Z. Liu

Table 1 Five methods for kernel feature ranking Method

Metric

Kernel parameter selection

Method 1

The proposed metric

The proposed analytical method

Method 2

The reported metric in [14]

The selection method in [14]

Method 3

The reported metric in [14]

The proposed analytical method

Method 4

The reported metric in [2]

The selection method in [2]

Method 5

The reported metric in [2]

The proposed analytical method

of Windows 8 (64-bit). In the following, we will evaluate all the five methods on a simulated dataset and eight real-world datasets.

4.1 Results on the simulated datasets The two moon dataset [22] is a well-known nonlinear binary classification problem (y ∈ {+1, −1}) and has been widely used as a benchmark dataset to test nonlinear methods. In our simulation, only two out of 52 features are statistically relevant to the target classes, and they are numbered from 1 to 2. Figure 2 shows a scatter plot by groups with the two relevant features. All the others sampled from the normal distribution of N (0, 20) are noisy features and are numbered from 3 to 52. We generated 720 samples for the two moon dataset. As we clearly know the characteristic of all 52 features, we then investigate if they can be correctly recognized by the proposed metric in the following examination. 4 Class: +1 Class: -1

3 2

Feature #2

1 0 -1 -2 -3 -4 -15

-10

-5

0

Feature #1

Fig. 2 Scatter plot of the two moon dataset

123

5

10

15

Fast kernel feature ranking using class separability...

3067

0.025

Effectiveness Score

0.02

0.015

0.01

0.005

0

-0.005

0

5

10

15

20

25

30

35

40

45

50

Feature Index

Fig. 3 Effectiveness score obtained by the proposed method

The two moon dataset is a binary classification problem, and thus we do not need the multi-class extension strategies introduced in Sect. 2.2 in this experiment. Figure 3 shows feature effectiveness obtained by the proposed metric. The two relevant features have much higher amplitude than the others, which means that the two features can be correctly recognized as relevant ones. From this experiment, the proposed metric can recognize the two nonlinear features correctly in the two moon dataset.

4.2 Results on the real-world datasets In this section, the five methods are compared with each other on eight real-world datasets. The datasets are selected based on a criterion that feature dimension covers a wide range from dozens to thousands. Table 2 summarizes the profiles of the eight benchmark datasets. The pitting damage dataset [23] comes from an experiment conducted by Reliability Research Lab at University of Alberta. The rest datasets are downloaded from LIBSVM datasets [21]. We first use the colon cancer dataset as an example to illustrate the efficiency of the proposed method. The colon cancer dataset is a binary classification problem and includes 40 tumor and 22 normal colon tissue samples in terms of gene expression. As shown in Table 2, it has 62 samples in a dimensionality of 2000. Figure 4 shows the effectiveness scores of the 2000 individual features using the proposed method. From Fig. 4, the feature #493 and the feature #377 are the top two ranked features among the 2000 features using the proposed method. Scatter plot by class with the top two ranked features is shown in Fig. 5. We also plot the bottom two ranked features by class in Fig. 6. From the two figures, the top two features can roughly separate samples from the tumor class and the normal class, while samples represented by the

123

3068

Z. Liu

Table 2 Summary of the benchmark datasets No.

Dataset

1

Sonar

2

Pitting damage

136

128

3

Handwritten numeral

649

400

4

Mnist38

784

13,966

5

Colon cancer

2000

62

6

Gisette

5000

7000

7

Duke breast cancer

7129

44

8

Leukemia

7129

72

3

Number of instances

60

x 10

2.5

Number of features

208

-4

Feature #377

Feature #493

Effectiveness Score

2

1.5

1

0.5

0

-0.5

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Feature Index

Fig. 4 Effectiveness score obtained by the proposed method

bottom two features totally overlap with each other. Again, the proposed method is demonstrated to do the right thing for the feature ranking. Figure 7 shows CPU time of kernel feature ranking for the five methods on the colon cancer dataset. In terms of CPU time, the five methods consume 131.57, 4.20 × 103 , 245.74, 5.51 × 103 , and 327.53 s, respectively. Obviously, the proposed method has an advantage from the viewpoint of CPU time. After the feature ranking, five feature sequences are produced with respect to the five methods. The sequences contain the feature importance in a decreasing order. The first is the most important feature. Classification experiments are conducted as follows. Following the order of the sequences, individual features are added into the selected feature subset that starts from an empty set until all features are in the feature subset. Classification accuracy is computed after adding a new feature. Figure 8 shows classification accuracy with respect to n  (from

123

Fast kernel feature ranking using class separability...

3069

3 2.5 2

Feature #377

1.5 1 0.5 0 -0.5 -1 Tumor Normal

-1.5 -2 -2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

Feature #493

Fig. 5 Scatter plot with the top two ranked features obtained by the proposed method 5

4

Feature #263

3

2

1

0

-1

-2

Tumor Normal -2

-1

0

1

2

3

4

5

Feature #262

Fig. 6 Scatter plot with the bottom two ranked features obtained by the proposed method

0 to 100) for the five methods. From Fig. 8, we can observe that the accuracy of the five methods is pretty close. That is, the proposed method can reduce CPU time of the kernel feature ranking without accuracy deterioration on the colon cancer dataset. Following the same procedure as above, experiments are conducted on the rest benchmark datasets. Table 3 summarizes CPU time of kernel feature ranking for the eight benchmark datasets. From Table 3, CPU time of the five methods is proportional

123

3070

Z. Liu 6000

5000

CPU Time (s)

4000

3000

2000

1000

0

Method 1

Method 2

Method 3

Method 4

Method 5

Fig. 7 CPU time of kernel feature ranking for the five methods

Accuracy (%)

95

Method 1 Method 2 Method 3

90

85 10

20

30

40

50 n'

60

70

80

90

100

Accuracy (%)

95

Method 1 Method 4 Method 5

90

85 10

20

30

40

50 n'

60

70

80

90

100

Fig. 8 Classification accuracy with respect to n  for the five methods

to the feature dimension of a dataset. That is, kernel feature ranking needs more CPU time for the big-size datasets. It means the proposed method will show more efficiency for big-size datasets. In addition, the proposed kernel parameter selection can reduce the CPU time of the two reported methods in [2,14]. In general, the proposed method reaches the least CPU time among the five methods. It demonstrates the efficiency of the proposed method in reducing CPU time of kernel feature ranking.

123

Fast kernel feature ranking using class separability...

3071

Table 3 CPU time (in s) of kernel feature ranking Method 1

Method 2

Method 3

Method 4

Method 5

Sonar

1.59

17.08

1.66

21.49

Pitting damage

2.42

6.89

2.71

14.36

3.46

Handwritten numeral

14.67

195.88

24.49

198.87

29.24

161,081.12

24,202.70

202,675.53

25,145.77

4204.71

245.74

5513.48

327.53

Mnist38

14,960.34

Colon cancer

131.57

2.36

Gisette

2594.01

82,901.32

4845.03

108,705.41

6457.60

Duke breast cancer

811.70

21,793.55

1433.97

31,590.77

1961.67

Leukemia

985.59

27,788.61

1774.05

38,966.81

2322.65

Bold texts in the table indicate the least values of CPU time

4.3 Benefit of using the proposed metric In general, the proposed metric is a good approximation of classification accuracy. The proposed metric can provide the following benefits. 1. Robustness Linearly non-separable cases widely exist in real-world applications. In fault diagnosis, the gap between fault modes and the statistical features often results in a nonlinear mapping between them. The proposed metric takes advantages of kernel method to handle not only the linear separable cases but also the linearly non-separable cases. 2. Computation efficiency For a given kernel parameter set, each evaluation of the metric (e.g., [2,14]) needs to solve an optimization search problem, which can considerably prolong the feature ranking process. Comparatively, no optimization search process is required in the proposed metric. Each evaluation of the proposed metric has much less computational load. It can significantly reduce the time cost (especially for high-dimensional datasets), and thus leads to a faster feature ranking.

5 Conclusion In this paper, a metric based on Gaussian RBF kernel class separability is proposed for kernel feature ranking. It is conceptually and computationally simple. We extend the metric for multi-class classification problems using three strategies: one-against-all, one-against-one, and direct extension. To achieve an efficient kernel feature ranking, we propose an analytical algorithm to fast tune the kernel parameter of Gaussian radial basis function. Experimental results demonstrated that the proposed metric has comparable performance in terms of classification accuracy, but has the best performance in term of CPU time with the two reported methods in [2,14]. Therefore, the proposed method can lead to fast kernel feature ranking. Acknowledgments This work was supported by Open Fund (Contract No. SKLMT-KFKT-201418) of the State Key Laboratory of Mechanical Transmissions, Chongqing University. The anonymous reviewers and editors are highly appreciated for their constructive comments and helpful suggestions.

123

3072

Z. Liu

References 1. Liu Z, Qu J, Zuo MJ, Xu H (2013) Fault level diagnosis for planetary gearboxes using hybrid kernel feature selection and kernel Fisher discriminant analysis. Int J Adv Manuf Technol 67(5–8):1217–1230 2. Wang L (2008) Feature selection with kernel class separability. IEEE Trans Pattern Anal Mach Intell 30(9):1534–1546 3. Liu Z, Zuo MJ, Xu H (2013) Fault diagnosis for planetary gearboxes using multi-criterion fusion feature selection framework. Proc Inst Mech Eng Part C J Mech Eng Sci 227(9):2064–2076 4. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182 5. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422 6. Gualdrón O, Brezmes J, Llobet E, Amari A, Vilanova X, Bouchikhi B, Correig X (2007) Variable selection for support vector machine based multisensor systems. Sens Actuators B Chem 122(1):259– 268 7. Chen X, Zeng X, Alphen DV (2006) Multi-class feature selection for texture classification. Pattern Recognit Lett 27(14):1685–1691 8. Qu J, Liu Z, Zuo MJ, Huang H (2011) Feature selection for damage degree classification of planetary gearboxes using support vector machine. Proc Inst Mech Eng Part C J Mech Eng Sci 225(9):2250–2264 9. Tan M, Tsang IW, Wang L (2014) Towards ultrahigh dimensional feature selection for big data. J Mach Learn Res 15:1371–1429 10. Yeung DY, Dai G (2007) Learning the kernel matrix by maximizing a KFD-based class separability criterion. Pattern Recognit 40(7):2021–2028 11. Liu J, Zhao F, Liu Y (2013) Learning kernel parameters for kernel Fisher discriminant analysis. Pattern Recognit Lett 34(9):1026–1031 12. Li CH, Ho HH, Liu YL, Lin CT, Kuo BC, Taur JS (2012) An automatic method for selecting the parameter of the normalized kernel function to support vector machines. J Inf Sci Eng 28(SI):1–15 13. Liu Z, Zuo MJ, Xu H (2011) A Gaussian radial basis function based feature selection algorithm. In: 2011 IEEE international conference on computational intelligence for measurement systems and applications (CIMSA), Ottawa, ON, Canada, pp 1-4 14. Liu Z, Zuo MJ, Xu H (2013) Feature ranking for support vector machine classification and its application to machinery fault diagnosis. Proc Inst Mech Eng Part C J Mech Eng Sci 227(9):2077–2089 15. Fukunaga K (1990) Introduction to statistical pattern recognition. Academic Press Inc., London 16. Webb AR (2002) Statistical pattern recognition. Wiley, New York 17. Vapnik V (2000) The nature of statistical learning theory. Springer, Berlin 18. Eichelberger RK, Sheng VS (2013) Does one-against-all or one-against-one improve the performance of multiclass classifications. In: The 27th national conference on artificial intelligence (AAAI), Bellevue, Washington, pp 1609–1610 19. Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml 20. Liu Z, Zhao X, Zuo MJ, Xu H (2015) An analytical approach to fast parameter selection of Gaussian RBF kernel for support vector machine. J Inf Sci Eng 31(2):691–710 21. Chang C, Lin C (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:21–27 22. Liu Z, Zhao X, Zou J, Xu H (2013) A semi-supervised approach based on k-nearest neighbor. J Softw 8(4):768–775 23. Liu Z, Zhao X, Zuo MJ, Xu H (2014) Feature selection for fault level diagnosis of planetary gearboxes. Adv Data Anal Classif 8(4):377–401

123

Suggest Documents