This paper presents an online feature selection and classification algorithm. The algorithm is implemented for impact acoustics signals to sort hazelnut kernels.
ONLINE FEATURE SELECTION AND CLASSIFICATION Habil Kalkan, Bayram Çetisli Department of Computer Engineering, Suleyman Demirel University, 32260, Isparta, Turkey ABSTRACT This paper presents an online feature selection and classification algorithm. The algorithm is implemented for impact acoustics signals to sort hazelnut kernels. The classifier, which is used to determine the most discriminative features, is updated when a new observation is processed. The algorithm starts with decomposing the signal both in time and frequency axes in binary tree format. A feature set is obtained from the extracted features by using each node of the trees in time-frequency (t-f) plane. The information gathered from new entrance is discarded after updating the model parameters and algorithm states. The binary trees are pruned both in time and frequency axes by using the discrimination power of the nodes. This gives the most discriminative sub-bands in the t-f axes. The relevant features are selected from the remaining nodes after pruning operation. A maximum likelihood classifier with the assumption of multivariate Gaussian distribution is obtained from the relevant model parameters, and used for online testing. The developed online learning algorithm gives better learning results compared to on-line AdaBoost algorithm for sorting of hazelnut kernels. Index Terms— Feature extraction, local discriminant bases, online learning, classification, and acoustics.
1. INTRODUCTION Feature selection is defined as decreasing to the n dimensional feature space from the m (n < m) dimensional data space. It is important issue to determine the most discriminative features in classification problems. While the relevance of the features increases, the decreased features improve the learning capability of the classifier and its performance. There have been many studies on selecting the most relevant features for classification [1,2]. Boosting has been successfully used for feature selection in wide variety of machine learning problems. Boosting mechanism constructs a strong classifier by combining the weak classifiers. Adaptive Boosting, shortly AdaBoost [3], is an adaptive form of Boosting, which reweights the misclassified samples during training, so that new classifiers can focus more on these samples. Most of the learning algorithms operate on batch mode that requires entire training set for learning. However, in practical problems, observations arrive sequentially, and there may be some characteristic differences between them. The batch trained classifier may not respond to the changing features and their characteristics. In contrast, online learning algorithms process the one of observations at each time, and remove it from the database after the updating of predefined model parameters. Only a small amount
978-1-4577-0539-7/11/$26.00 ©2011 IEEE
2124
of memory is required to keep the model parameters. Online learning algorithms are commonly used in recognition, detection and tracking [4, 5] applications. Many of the developed applications operate on fixed selected feature set. However, Robert and Yanxi et al. [6] updated the selected discriminative features and used them for tracking problem. In addition, Helmut and Horst [7] developed online version of AdaBoost algorithm for feature selection. Recently, we developed an adaptive feature selection procedure based on Local Discriminant Bases (LDBs) algorithm [8], and used to select the most discriminative features on timefrequency axes. The selected feature set is fixed, and used to train the classifier in “batch” learning mode. In this study, we improved online feature extraction algorithm based on LDBs method for one-dimensional signal classification, and then it is compared to the online AdaBoost algorithm [7]. Impact acoustic signal dataset of cracked and un-cracked hazelnut kernels was used in this study. The algorithm starts by generation a candidate feature set in 2D binary tree format for each observation by decomposing the signals in time-frequency patterns. The feature vector of the new observation is used to update the model parameters (mean vector and covariance matrix) constructed by past observations. The most discriminative feature nodes in feature trees are selected by LDBs algorithm using the updated model parameters. A maximum likelihood classifier is constructed from the relevant coefficients of the model parameters. In the test phase, the observations were classified by using the recently adapted classifier. As a novel approach, the developed algorithm combines the feature extraction, selection and classification in an online manner. The paper is organized as follows. Section 2 describes the online feature selection algorithm. The online classification scheme is given in Section 3. Experimental studies and conclusions are given in Section 4 and Section 5, respectively.
2. ONLINE FEATURE SELECTION ALGORITHM The original LDB was previously adapted to feature extraction in two dimensions in batch mode [8]. In this study, we have developed an LDB based online feature extraction and selection algorithm combined to online classification. The proposed algorithm is explained in the Algorithm 1. The original LDBs algorithm was developed by Saito and Coifman [9] to get local information for classification. LDBs algorithm expands time axis by local trigonometric axis, or expand the frequency axis by wavelet packets in binary tree format. Then, the nodes in the tree are pruned according to their discrimination power. However, it is The author would like to thank to Dr. Yasemin Yardımcı for her support of this work.
ICASSP 2011
stated that [10] expansion in both time and frequency axes is important for classification aims. Therefore, we expanded the impact acoustic signals in both axes to detect the most discriminative features for classification.
μti 1
Σti 1
2.1 Time Segmentation with Local Cosine Packets Short Time Fourier Transform (STFT) is usually used to extract the local information in signals. However, STFT generates sidelobe artifacts due to the disjoint windows while partitioning the time axis. Alternatively, local trigonometric bases as well as other windows can be used for partitioning the time axis with reduced side-lobe effects. The performance of Local Cosine Packets (LCP) over the STFT is also emphasized in [10]. The LCP partitions in the time axis by using smooth bells [11] are constructed by using the cut-off functions. In this study, we used LCP for partitioning the time axis in binary tree format up to level J as in Fig. 1.
where
1 tμti xti 1 , t 1 t t Σti x xT , 2 i i t 1 t 1
(3) (4)
xit 1 μit .
xi
2.4 Dissimilarity Measure Each time-frequency (t-f) node in the binary tree is represented by the energy of the samples they include in and the dissimilarity is defined as the distance of the nodes of each class. The relevance of the nodes for classification depends on the distance of the nodes. Various type of dissimilarity metric can be used. However, we preferred to use Fisher distance as the dissimilarity measure because Fisher distance can be computed by only using a few model parameters that can be evaluated online. The Fisher distance (D) of the kth t-f node is computed as
Dk
Fig. 1. Time segmentation with LCP.
μ1k μ 2 k
¦1 k , k ¦ 2 k , k
,
(5)
2.2 Frequency Segmentation with Wavelet Transform The signals in time segments are also divided into sub-bands in frequency axis by full wavelet transform to get the frequency content of the signals. This segmentation helps to extract, and to analyze the local patterns in signals. In general, the signals in a practical system are not time aligned. A shift at the input signal may result a large difference at the output of the system. Shift invariance is usually introduced as a requirement for robust classification [12]. Therefore, the un-decimated wavelet transform that has shift-invariance property is used for decomposing the signal in frequency axis up to level J. The decomposition levels may be different in time and frequency axes. However, we fixed that level for both axes, and used in our implementation. 2.3 Model Update The information content in signals is localized at different resolution levels as in Sec. 2.2. Totally, m =(2J-1)2 t-f nodes are obtained for the J level segmentation in time and frequency axes. An energy feature value, xk, is evaluated for each m node of the signals by
xk
¦
n j 1
sk ( j )
k=1,2,..m
and the signals are represented by a feature vector
x1 , x2 ,
x
, xm ,
(1)
x (2)
where sk and n denotes the sample and the number of the samples in the kth t-f node, respectively. The model parameters Mti (μti , 6ti ) of each class are updated t 1
when a new observation x i
is received where the terms
μ ti and
Σ ti are the mean and covariance matrix of the ith class features after the tth observation was processed. The t indices denotes the number of observations obtained up to the present state. The parameters, which are also used in [13] are updated as
2125
μ ik
and 6i (k , k ) are the mean and variance of the kth node of the ith class. Note that the 6i (k , k ) is the kth diagonal element of where
the updated covariance matrix. 2.5 Feature Selection by Pruning the Binary Trees The partitioning of signals in both time and frequency axes provides all possible dyadic nodes. However, a large portion of the nodes may not include relevant information for classification. Therefore, the binary trees in both axes are pruned sequentially according to their discrimination power by the following algorithm. Pruning algorithm IF (Dmother < max{ Dchild1 , Dchild2}) THEN set max { Dchild1 , Dchild2} as Dmother ELSE remove children. The algorithm keeps the mother node, if the mother has higher discrimination power than its children, otherwise keeps the children. When the mother node is removed, her child, having the maximum discrimination, is set as the mother for high level comparison. The pruning algorithm finds the optimal partitioning in time axis. The obtained time segments are further decomposed in frequency axis, and separately pruned by the same algorithm. The nodes that survive after the two consecutive pruning operations give the location of the best features for classification. The selected p features are also sorted according to their discrimination power and the most discriminative n of them that capture the 2/3 of the total discrimination power of the p features are used for online classification. The locations (j,k) of the selected features are assigned to nxm findex matrix as
findex ( j, k ) nxm
1, ® ¯0,
k th node is selected otherwise
Algorithm 1: Online Feature Selection and Learning 0. Initialize the model parameters for each i =1, 2 and z = 0. M it (μio , Σi0 ) (0,0)
class
i
Begin For every observation signal S 1. Extract features from S - Decompose the signal into time-frequency nodes in binary tree format up to Jth level. - Evaluate a feature vector x (Eq. 2) 2. Update model parameters of the ith class where the new observation S belongs to t 1 i
t 1 i
t 1 i
M (μ , Σ ) M (μ , Σ ) x t i
t i
t i
t 1 i
(Eqs. 3 and 4)
3. Evaluate the discrimination power of each node by using Fisher Distance (Eq. 5) 4. Extract the p most discriminative features from the model parameters according to Fisher Distance of the nodes by pruning operations 5. Select the first n features that capture 2/3 of the total discrimination power of the p (n