IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 2, FEBRUARY 2015
367
A Parametric Classification Rule Based on the Exponentially Embedded Family Bo Tang, Student Member, IEEE, Haibo He, Senior Member, IEEE, Quan Ding, Member, IEEE, and Steven Kay, Fellow, IEEE
Abstract— In this paper, we extend the exponentially embedded family (EEF), a new approach to model order estimation and probability density function construction originally proposed by Kay in 2005, to multivariate pattern recognition. Specifically, a parametric classifier rule based on the EEF is developed, in which we construct a distribution for each class based on a reference distribution. The proposed method can address different types of classification problems in either a data-driven manner or a model-driven manner. In this paper, we demonstrate its effectiveness with examples of synthetic data classification and real-life data classification in a data-driven manner and the example of power quality disturbance classification in a modeldriven manner. To evaluate the classification performance of our approach, the Monte-Carlo method is used in our experiments. The promising experimental results indicate many potential applications of the proposed method. Index Terms— Exponentially embedded family (EEF), multivariate Gaussian classification, parametric classification rule.
I. I NTRODUCTION
I
N PATTERN recognition, there are two kinds of classifiers that have been widely used in scientific and engineering fields: 1) parametric classifiers in which the underlying joint distributions/models of the data are known except for the source parameters of their probability density function (pdf) which need to be estimated, such as maximum a posteriori probability (MAP) and 2) nonparametric classifiers in which the classification rules do not explicitly depend on the underlying data distributions, such as the nearest neighbors method, neural network, and support vector machine (SVM). The parametric classifier is also known as a model-driven classifier, while the nonparametric classifier is usually considered a data-driven classifier. In the past few decades, both of these two classic classifiers have been greatly developed in different ways, but they can be related to each other in many aspects. First, given the data distribution Manuscript received April 12, 2013; accepted November 27, 2014. Date of publication January 9, 2015; date of current version January 15, 2015. This work was supported in part by the National Science Foundation under Grant ECCS 1053717 and Grant CCF 1439011 and in part by the Army Research Office under Grant W911NF-12-1-0378. B. Tang, H. He, and S. Kay are with the Department of Electrical, Computer and Biomedical Engineering, University of Rhode Island, Kingston, RI 02881 USA (e-mail:
[email protected];
[email protected];
[email protected]). Q. Ding is with the Department of Physiological Nursing, University of California at San Francisco, San Francisco, CA 94131 USA (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2383692
models for each class (class-wise distribution), the Bayesian classifier provides the optimal classification performance over all decision rules as the number of data becomes infinity. For example, the probability of error in the nearest neighbors is upper bounded by the Bayes probability of error, which is proved in [2]. Second, nonparametric classifiers could be interpreted as parametric classifiers. For instance, the probability neural network is built on the basis of the estimation of probability density, and the Gaussian process is designed with a Gaussian noise distribution, which can further interpret the decision rule of neural network. Because both parametric and nonparametric classifiers make prediction decisions based on the posterior probability, they can be further explained via probability density estimation, such as naive Bayesian classifiers, nearest neighbor-based classifiers, and probabilistic neural networks. The exponential family of pdfs indexed with natural parameters unifies many existing distributions and holds many useful algebraic properties [3], which can be used to model various mixed types (continuous or discrete) of measured object properties [4]. The exponentially embedded family (EEF) is one novel pdf construction method based on the reference distribution [1]. The constructed class-wise distributions have the form of the exponential family and inherit many of its useful properties. Kay et al. [5] demonstrate that the constructed distribution based on EEF is asymptotically closest to the true one with Kullback–Leibler (KL) divergence measurement. More importantly, because there is no assumption on the reference distribution to construct the EEF, this methodology can be applied to many challenging real-world applications. In this paper, we explore another important capability of the EEF to make a connection between the data space and the natural parameter space in exponential family distributions. Based on this new property, we can build a decision rule for classification in either a data-driven manner or a model-driven manner. In data-driven manner, we use a non-Gaussian mixture to model the whole training data as the reference distribution. In contrast to the classic parametric or nonparametric classifiers using classwise probability density estimation, our proposed classification method takes advantage of the whole data distribution from all classes. Because the construction of the classwise distribution is based on the reference distribution, this information about the distribution of the whole data is the key factor to improve classification performance in our classifier. In model-driven manner, the prior-known model can be easily integrated into our classification framework.
2162-237X © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
368
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 2, FEBRUARY 2015
This gives superior classification performance compared with these data-driven classification methods that have difficulty using the data model information, such as neural network and SVM. To demonstrate the effectiveness of this proposed methodology, we test it on several classification problems, such as synthetic data classification and real-life data set classification in a data-driven manner, and power quality (PQ) disturbance classification in a model-driven manner. Although the underlying data models of these classification problems are quite different, our classifier can perform better than other existing methods. Hence, the generality of our classifier based on EEF is demonstrated and the classifier can potentially be applied to many other classification problems. The rest of this paper is organized as follows. In Section II, we formulate the theoretical framework using EEF to connect the data space and natural parameter space in a data-driven manner and briefly review the EEF and its properties for detection and classification. In Section III, we build our classifier rule with the constructed EEF. In Section IV, the first example for real-life data set classification is presented, where we show the methods of building target function for classification. In Section V, we apply this parametric classifier to PQ disturbance classification problem. Finally, the conclusion is given in Section VI. II. E XPONENTIALLY E MBEDDED FAMILY A. Theoretical Framework A data sample can be considered as an observation of a particular object of interest associated with a p-dimensional random variable x. In the formalism of Gaussian processes [6], [7] and probabilistic principal component analysis [8], it is assumed that training space points can be drawn from populations determined by the latent variable α, that is, the classwise distribution has the form p(x; α). Following the definition of the generalized exponential family, as described below, it further assumes that the classwise distribution p(x; α) is an exponential family density in which α is the natural parameter. For a member of the exponential family, the distribution takes the following form: p(x; α) = exp[α T x − G(α)] p(x; 0)
(1)
where the function G(·) is the so-called cumulant-generating function, which is defined as exp(α T x) p(x; 0)dx (2) G(α) = ln X
where X is the space of data vector x. The cumulantgenerating function G(α) normalizes the p(x; α) to be a pdf. Under this assumption, we can move the data space X to the natural parameter space via the nonlinear transformation of cumulant-generating function G(α). To embed the data space of all classes into one natural parameter space, the EEF constructs the classwise distribution using a reference distribution. Assume the reference distribution is denoted by p0 (x), the constructed data distribution p(x; η) with natural parameters η is written as p(x; η) = exp[ηT x − K 0 (η) + ln p0 (x)]
(3)
where K 0 (η) is also the cumulant-generating function, defined as K 0 (η) = ln E 0 [exp(η T x)] exp(η T x) p0 (x)dx = ln X
(4)
which is the log-expectation of exp(η T x) with respect to the reference distribution. To avoid the confusion of natural parameters between exponential family and EEF, we will use η for the natural parameters of EEF in this paper. The natural parameters η, also named latent variables, determine the classwise distribution. B. Introduction to EEF When the pdf of the measurements T = t(x) under a reference hypothesis pT (t; H0 ) is known, the EEF aims to construct a pdf pˆ T (t; η), which unifies the unknown pdfs under the reference hypothesis H0 and the M candidate hypotheses Hi , i = 1, . . . , M. The pdf pˆ T (t; η) is indexed with the natural parameter vector η. The measurements T = t(x) are the sufficient statistics of x. As stated in Section II-A, [5] demonstrates that the pdf pˆ T (t; ηi ) is the asymptotically closest to the true pdf pT (t; Hi ) under M hypotheses in the measurement of KL divergence. The KL divergence D( pT (t; Hi )|| pˆ T (t; Hi )) has been proved to determine the asymptotic performance for detection by Stein’s lemma [9] and for classification [10]. The KL divergence [11] D( pT (t; Hi )|| pˆ T (t; Hi )) is written as pT (t; Hi ) dt. (5) pT (t; Hi ) ln D( pT (t; Hi )|| pˆ T (t; Hi )) = pˆ T (t; Hi ) It is always nonnegative and equals zero if and only if pT (t; Hi ) = pˆ T (t; Hi ) for all t. Thus, under the assumption of small signal, the pdf pˆ T (t; Hi ) which approximates the true unknown pdf pT (t; Hi ) is found by minimizing the KL divergence in (5) with the constraint of moment matching [5]. The Kullback theorem [11] gives the solution for this problem, where pˆ T (t; Hi ) can be written as pˆ T (t; η) = exp [η, t − K 0 (η) + ln pT (t; H0 )]
(6)
where η is a p × 1 vector for multivariate classification and η = [η1 , η2 , . . . , η p ]T , p denotes the dimension of the sample space, and η, t is the inner product of η and t or η, t = η T t. K 0 (η) is log-normalizer which makes the pdf integrate to one, written as K 0 (η) = ln E 0 [exp(η, t)] = ln exp(η, t) pT (t; H0 )dt.
(7)
t
Note that the pdf pˆ T (t; η) is indexed by the natural parameter vector η in (6). The natural parameter vector η differs from the source parameter vector θ in the original pdf pT (t) = pT (t; θ ). Equation (6) indicates that the reduced pdf no longer relates to the source parameters θ , but the natural parameters η. In addition, η is identifiable for pˆ T (t; η) since if η1 = η2 ⇒ pˆ T (t; η1 ) = pˆ T (t; η2 ). Choosing η = 0, the pdf is reduced to the reference hypothesis H0 pdf. Thus, all the pdfs under reference hypothesis and M candidate hypotheses can be embedded into the reduced pdf with the natural parameter vector.
TANG et al.: PARAMETRIC CLASSIFICATION RULE BASED ON THE EEF
369
C. Properties of EEF The pdf in the EEF inherits many useful algebraic properties of statistical exponential families [12]. First, the momentgenerating function of EEF can be written as m(s) = exp[K 0 (η + s) − K 0 (η)].
(8)
Second, the mean and variance matrix relates to the natural parameter vector η as E(T) = ∇ K 0 (η) 2
Var(T) = ∇ K 0 (η)
(9) (10)
∇2 K
where 0 (η) denotes the Hessian of K 0 (η) and is a positive definite matrix, since K 0 (η) is a strictly convex and differentiable function [13]. Hence, the η, t − K 0 (η) in (6) is concave and the maximum of η, t − K 0 (η) always exists. The maximum value is found either at the boundary points of the natural parameters or at the point obtained by setting the differential to be zero ∂ K 0 (η) = t. (11) ∂η III. EEF PDF FOR C LASSIFICATION In this section, we present how to build our classification rule based on the EEF pdf when the labeled data (training data) are available. For the pdf pˆ T (t; η) in (6), we define l(t) as the log-likelihood function l(t; η) = ln pˆ T (t; η).
(12)
Given a set of training data, we can estimate the natural parameter vector ηˆ using the maximum likelihood estimator (MLE) [14] ηˆ = arg max l(t; η). η
(13)
The estimated value ηˆ i for each hypothesis Hi , i = 1, 2, . . . , M can be obtained according to (11). Then, we can write the pdf of measurements under Hi as pˆ T (t; ηˆ i ) = exp[ηˆ i , t − K 0 (ηˆ i ) + ln pT (t; H0 )].
(14)
Since the reduced pdf is parameterized by the natural parameter vector η in EEF, we consider the pˆ T (t; ηˆi ) equal to the estimation of pT (t; Hi ) under Hi . Unlike the discriminative approach, we employ the generative approach to maximize the posterior probability. Given unknown testing data ts to be classified, similar to the MAP decision rule, we assign the class Hi to ts if p(Hi |ts ) ∝ p(ts |Hi ) p(Hi ) is maximized over i . The prior probability for each class p(Hi ) can be approximated by the ratio of number of training data for each class, and p(ts |Hi ) can be replaced by its estimate pˆ T (ts |ηˆ i ). Then, the target function of our classifier rule is built as following: EEFi = ηˆ i , ts − K 0 (ηˆ i ) + ln p(Hi ).
Then, the pdf for each hypothesis is available with the estimated natural parameter vectors ηˆ . In the testing process, the unknown testing data can be classified according to the target function built in (15), which is very similar to the MAP rule.
(15)
Thus, in the training process of our new classifier rule, the natural parameters of the EEF pdf are first estimated using the MLE from the sets of training data under each hypothesis.
IV. DATA -D RIVEN C LASSIFICATION VIA EEF A. Hypothesis Given the sufficient statistics of data observations, the distribution in (14) can asymptotically approximate the true distribution for each class with respect to KL divergence, and the derived decision rule in (15) can provide the same classification performance as the MAP rule, which usually provides an asymptotic upper bound of classification performance for all classifiers. However, for these general formulas in (14) and (15), it is hard to obtain the sufficient statistics without knowing the classwise distributions. One way is that we use the data observation itself as one of its sufficient statistics, i.e., T(x) = x, and thus, the reduced distribution can be rewritten as in (3). As we mentioned above, the reference distribution p0 (x) for the reference hypothesis H0 could be either known or unknown, and either Gaussian or non-Gaussian. Given the training data, we consider a more general case of non-Gaussian distribution for the reference hypothesis H0 . We propose to use a Gaussian mixture model to represent the whole training data set. Assume the whole training data could be represented with a Gaussian mixture model with M components as follows: p0 (x) =
M
πm N (μm , Cm )
(16)
m=1
where πm is the prior distribution for the m-th Gaussian model, and N (μm , Cm ) represents the Gaussian distribution with mean vector μm and covariance matrix Cm . The expectationmaximization (EM) method is employed to estimate the parameters of the Gaussian mixture model with constraints that each Gaussian component fits only one class data under the reference hypothesis. Unlike the classic classification method that uses the EM rule to estimate the Gaussian mixture model for each class, our classification rule constructs a new distribution pT (t; η) with a reference distribution p0 (x). The natural parameters η can be used to determine the classwise distribution. In this datadriven manner, both the estimation of the reference distribution p0 (x) and the classwise natural parameters are estimated from the training data. While there are several Gaussian models underlying the training data set, the latent variable η in EEF can be considered as the variable indicating the probability that the observation x belongs to these models. Then, we use the training data from each class to estimate the classwise natural parameter to build the final classification rule. Because the proposed method makes use of the information of the whole training data distribution, we could expect that the EEF pdf will perform better than Bayesian inference methods without using that information. To simplify our derivation, we also assume that the data variable is independent and
370
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 2, FEBRUARY 2015
identically distributed (i.i.d.), which means all observations are conditionally independent.
Combining (22) and (23), we further have M
B. Distribution Construction Given the training data, we assume that a Gaussian mixture model can represent the whole data distribution. Using the EM approach, the parameters of mean vector μi and covariance matrix Ci for each class can be estimated. According to the definition of cumulant-generating function in (4), we have K 0 (η) = ln E 0 [exp(η T x)] M = ln πm exp(η T x)N (μm , Cm )dx. X
m=1
(17)
For each Gaussian component in the above integral, we obtain its closed form as 1 exp(η T x)N (μm , Cm )dx = exp η T μm + η T Cm η T 2 X (18) where the detailed derivation is given in the Appendix. The K 0 (η) can then be rewritten as M 1 πm exp η T μm + η T Cm η T . K 0 (η) = ln (19) 2 m=1
Thus, the new distribution with EEF can be constructed as p(x; η) = exp[η T x − K 0 (η) + ln p0 (x)] M exp(η T x) m=1 πm N (μm , Cm ) = exp(K 0 (η)) T M 1 T −1 m=1 πm exp η x − 2 (x − μm ) Cm (x − μm ) . = M 1 T T T m=1 πm exp(η μm + 2 η Cm η )
m=1
1 T T πm (μm + Cm η − x) exp η μm + η Cm η = 0. 2 (24)
Even though no closed form solution exists, a numerical solution can always be found using an iterative approach, because the K 0 (η) function is convex. Either the methods of searching for the zero point in (24) or of searching for the maximum point in (21) can be employed to find the parameters estimation ηˆ for each class. For a C-class classification problem, ηˆ 1 , ηˆ 2 , . . . , ηˆ C can be estimated given the training data set. Finally, given testing data xt to be classified, we decide the class for which the following is maximized over all classes: ηˆ iT xt
EEFi =
m=1
+ ln p(Hi ).
p(x1, x2 , . . . , x N ; η) =
= exp η T
N
p(xi ; η)
i=1 N i=1
xi − N K 0 (η) + ln
N
p0 (x) .
(21)
i=1
Similar to (11), the natural parameters η could be estimated by N xi ∂ K 0 (η) ∂ ln p(x1 , x2 , . . . , x N ; η) =0⇒ = i=1 = x. ∂η ∂η N (22) Given the (19), we have M πm (μm + Cm η) exp(η T μm + 12 η T Cm η) ∂ K 0 (η) = m=1 M . 1 T T ∂η m=1 πm exp(η μm + 2 η Cm η) (23)
1 T T T πm exp ηˆ i μm + ηˆ i Cm ηˆ i 2 (25)
C. Experiments 1) Synthetic Data Classification: To demonstrate the efficiency of the proposed method, we compare the classification performance with pseudo-MAP rule and MAP rule for synthetic non-Gaussian (Gaussian mixture) data classification. From a Bayesian perspective, the MAP rule is the optimal classification method in which the true parameters are assumed to be known. The testing data xt is assigned the class label for which the following target function is maximized over i:
(20) Because the latent variable η determines the class-wise distribution, in the next step, we will estimate the natural parameter η with MLE given the training data for each class. Given N training data for one class, due to the assumption of i.i.d. variable, we have
− ln
M
p(xt ; Hi ) =
Mi m=1
1 −1 πi,m exp − (xt − μi,m )T i,m (xt − μi,m ) 2 (26)
where Mi is the number of Gaussian component for class i . The pseudo-MAP method first estimates the source paraˆ i,m with ˆ i,m , and i,m as meters πi,m as πˆ i,m , μi,m as μ EM algorithm from training data for each mixed Gaussian component m = 1, 2, . . . , Mi in each class Hi , i = 1, 2, . . . , C. Then, the testing data xt is assigned the class label when the following target function is maximized over i : p(xt ; Hi ) =
Mi
m=1
1 −1 πˆ i,m exp − (xt − μ ˆ i,m )T i,m (xt − μ ˆ i,m ) . 2 (27)
We choose (25)–(27) as the target functions for MAP rule, pseudo-MAP rule and our classifier rule, respectively. For the experiment, there are C = 3 classes, and Ns = 1000 training data for each class. Let the parameters of the
TANG et al.: PARAMETRIC CLASSIFICATION RULE BASED ON THE EEF
Fig. 1. Classification accuracy comparison among EEF-based rule and pseudo-MAP rules.
class-wise distributions be
371
Fig. 2. Classification accuracy comparison among EEF-based rule and MAP-based rules. TABLE I D ESCRIPTIONS OF F OUR D ATA S ETS
For Class 1: π1,1 = 0.6 μ1,1 = [0, 0, 0] 1,1 = σ 2 I π1,2 = 0.4 μ1,2 = [1, 0, 0] 1,2 = (σ 2 + 1)I For Class 2: π2,1 = 0.6 μ2,1 = [0, 4, 0] 2,1 = σ 2 I π2,2 = 0.4 μ2,2 = [0, 4, 1] 2,2 = (σ 2 + 1)I For Class 3: π3,1 = 0.6 μ3,1 = [0, 0, 4] 3,1 = σ 2 I π3,1 = 0.4 μ3,2 = [1, 0, 4] 3,2 = (σ 2 + 1)I. (28) The probabilities of correct classification (Pc ) are shown in Fig. 1. The Monte Carlo method is used in this simulation in which each result is averaged over 2000 experiments for every σ 2 . Here we assume that the number of components are known for each class M1 = M2 = M3 = 2, and C use M = i=1 Mi = 6 in the reference pdf in our EEF classifier rule. It is shown that the classification performance of our classifier rule is very close to the optimal MAP rule and better than the pseudo-MAP rule when the variance increases. Compared with the MAP rule in (27), the EEF rule in (25) embeds the relationship of all classes through the reference distribution p0 (x) in our constructed pdf to improve classification performance. How to choose the number of components is an important, but is still an unsolved issue in mixture models. If we assign more components to fit the data, named overfitting, the recognition performance would degrade. Several methods of accessing the number of components exist in [15], such as minimum information ratio criteria and Akaile’s information criteria (AIC). Here, we carry out one experiment to show how classification performance is influenced by overfitting. We show the classification performance loss for both pseudo-MAP and our EEF method with Mi = 2, 3, 4, compared with the optimal MAP method, in Fig. 2.
The simulation results are averaged over 500 experiments for every σ 2 . It shows that the EEF method is less influenced by overfitting, while classification performance of the pseudoMAP method decreases with the severity of overfitting. 2) Real-Life Data Classification: To further demonstrate the generality of the proposed method, we further test the classification performance for seven real-life data sets in the UCI repository [16] for the methods of pseudo-MAP, nearest neighbor, and neural network. These data sets are summarized in Table I, which shows the number of classes, the number of input dimensions, and the total number of data samples that are used in our experiments. The data set Knowledge aims to predict students’ knowledge level about the subject of electrical dc machines [17]. The single proton emission computed tomography is a medical image data set in which the patient is classified into two categories: normal and abnormal [18], [19]. With Indian Liver Patient data set, the aim is to predict liver patient or not according to the age, gender, total proteins, and so on [20]. Wine data set attempts to determine the origin of wines using chemical analysis [21]. The data set Abalone data set is used to predict the age of abalone from physical measurements. Different ages are considered as 29 classes, but in our experiment, we treat it as three-category classification problem by grouping classes 1–8 as category 1, classes 9 and 10 as category 2, and classes 11 and above as category 3. The Seeds data set
372
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 2, FEBRUARY 2015
TABLE II AVERAGE T ESTING A CCURACY AND S TANDARD D ERIVATION OF THE P ROPOSED M ETHOD , P SEUDO -MAP, N EAREST N EIGHBOR , AND N EURAL N ETWORK IN P ERCENTAGE . T HE B EST VALUE I S H IGHLIGHTED W ITH B OLD VALUE AND THE
S IGNIFICANT I MPROVEMENT I S U NDERLINED FOR O UR P ROPOSED M ETHOD
TABLE III C ATEGORIES OF P OWER Q UALITY D ISTURBANCE
aims to identify three different varieties of wheat: Kama, Rosa, and Canadian, with 70 elements for each class. Finally, the data set Vertebral is used to classify orthopedic patients into two classes (normal or abnormal) with six biomechanical features. The testing classification accuracy of the proposed method in comparison with pseudo-MAP rule, nearest neighbor, and neural network is presented in Table II, in which we highlight the best performance. It shows that the proposed EEF outperforms the other three methods for these seven real-life data sets. Table II shows the comparative results, which are averaged over 100 runs. For each run, we randomly select half of data as training data, and the rest of data as test data. For the proposed EEF classifier rule used in our experiments, we use the AIC rule [22] to access the number of components Mi for each class and use M = C i=1 Mi in our method. For the neural network, we employ a typical multilayer perceptron (MLP) structure with the back propagation learning method. We use 10 neurons in the hidden layer, and set the learning rate of 0.01 with 1000 iterations. The interested reader is referred to [23] for further information. In addition, a bias neuron is used in both input layer and hidden layer. We also examine whether two results are significantly different according to a paired t-test with a significance level of 0.01. For some of these data sets, such as Spect and Wine, the improvement as compared
with the pseudo-MAP is not significant, which can be also seen in our synthetic data simulation result shown in Fig. 1 for a small σ 2 . In general, the improvement is still very promising in comparison with the other three classic classification methods. V. M ODEL -D RIVEN C LASSIFICATION VIA EEF: P OWER Q UALITY D ISTURBANCE C LASSIFICATION A. Hypothesis In this section, we apply the EEF to PQ disturbance pattern classification problem which is a long-standing critical issue in the electrical power industry [24]. Recently, many methods have been proposed for improving the classification performance to decrease the economic loss, such as self-organizing learning array (SOLAR) system based on wavelet transformation [25], inductive inference approach [26], and SVM classification. PQ disturbance classification aims to recognize which type of disturbance occurred in a power system and take appropriate measurements. In this paper, we adopt the popular PQ disturbance models with seven different classes (C1–C7) as shown in Table III. They include the normal, swell, sag, harmonic, outage, sag with harmonic, and swell with harmonic models [25], [26]. Assuming that these power signals si , i = 1, 2, . . . , 7 are affected by additive Gaussian white noise, which is widely
TANG et al.: PARAMETRIC CLASSIFICATION RULE BASED ON THE EEF
373
considered in the research of PQ fields [27]–[29], we write the observed power signal models as x[n] = s[n] + w[n]
(29)
where w is assumed to be white Gaussian noise with pdf w ∼ N (0, σ 2 ). Thus, we can consider the following hypotheses for this PQ distribution classification problem: H0 : x = w Hi : x = si + w, i = 1, 2, . . . , 7
(30)
where x, si , and w are all N × 1 vectors, and H0 is considered as the reference hypothesis. Given one unknown disturbance, we aim to assign it to one of these hypotheses H1 , . . . , H7 .
According to the PQ disturbance signal models shown in Table III, we can also rewrite these signal models as follows: (31)
where θ i is a pi × 1 unknown vector by which the signal si under hypothesis Hi is parameterized, and Hi is a N × pi known matrix for the i th signal model. N is the number of data samples for all the signal models, and pi is the number of unknown parameters for the i th signal model, since the number of unknown parameters under different signal model may not be equal. Thus, by replacing the signal model in (31) by (30), we rewrite the signal models as H0 : x = w Hi : x = Hi θ i + w, i = 1, 2, . . . , 7
(32)
which is a classification problem for the linear model. Note that because θ i represents the i th hypothesis, we can randomly choose t1 and t2 in the step function u(t1 ) − u(t2 ) for Hi , i = 2, 3, 5, 6, 7 in (32). For this case, we choose the statistic T (x) = x. Then the pdf of x under the reference hypothesis H0 is x ∼ N (0, σ 2 I) under H0
(33)
and the log-normalized factor K 0 (η) is K 0 (η) = ln E 0 [exp(η T x)] σ 2ηT η = . 2 Thus, we construct the EEF pdf of x with EEF as
η i = Hi θ i .
(34)
p(x, η) = exp[η T x − K 0 (η) + ln p0 (x)] T 1 x x σ 2 ηT η T = exp − x − . exp η N 2σ 2 2 (2πσ 2 ) 2 (35) Note that the natural parameter vector ηi for each hypothesis Hi , i = 1, 2, . . . , 7 can be estimated by maximizing η T x − K 0 (η) over η. There are also some implicit constraints
(36)
Under this constraint, the MLE of θ i and the estimation of η i can be obtained. We have T −1 T Hi x Hi Hi (37) θˆ = 2 σ and therefore
−1 T Hi x Hi HiT Hi . (38) ηˆi = 2 σ Thus, the class-wise distribution can be rewritten as p(x, η) ˆ = exp[ηˆ T x − K 0 (η) ˆ + ln p0 (x)].
B. Reduced Distribution Construction
si = Hi θ i
in finding the MLE [5]. Since ηi is the only parameter that represents the signal under Hi , we have
(39)
C. Penalty Version of Classification Rule Given one unknown PQ signal x to be classified, we decide Hi for which the following is maximized over i : EEFi = ηˆ iT x − K 0 (ηˆ i ) + ln p(Hi ) σ 2 ηˆ iT ηˆ i + ln p(Hi ) = ηˆ iT x − 2
−1 T Hi x xT Hi HiT Hi + ln p(Hi ) = 2 2σ Qi + ln p(Hi ) = 2
(40)
where Q i = xT Hi (HiT Hi )−1 HiT x/σ 2 . Note that the above constructed test statistic has the form of the GLRT for linear model [30]. It also can be realized as a ratio of generalized energy classifier since the orthogonal projection matrix P H = Hi (HiT Hi )−1 HiT projects the data vector onto the signal subspace [31]. We also notice that there are different numbers of unknown parameters for different classes. Because the value of Q i monotonically decreases with respect to the number of unknown parameters, there is poorer classification performance for the class data with more unknown parameters. Instead of using (40) directly for PQ classification, we introduce a penalty version of classification rule that is similar to the EEF method for model selection [1] ⎧ pi Qi Qi ⎨ − + 1 + ln p(Hi ) if Q i > pi ln EEFi = 2 2 pi ⎩ 0 otherwise (41) we can also write this rule in a compact form pi Qi Qi Qi EEFi = − + 1 + ln p(Hi ) u −1 ln 2 2 pi pi (42) where u(x) is the unit step function. Since Q i increases with the number of unknown parameters, the penalty term pi (ln Q i / pi + 1) also increases with the number of unknown parameters.
374
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 2, FEBRUARY 2015
TABLE IV C LASSIFICATION P ERFORMANCE FOR PQ D ISTURBANCE W ITH O UR RULE C OMPARED W ITH THE SOLAR, I NDUCTIVE I NFERENCE AND C-SVM
D. Simulations To evaluate the classification performance of our rule, we compare our method to three other existing methods: SOLAR based on wavelet feature extraction [25], inductive inference approach [26], and the SVM method. There are two kinds of SVM (C-SVM and v-SVM) with many different kernel functions. Here, we choose C-SVM with a sigmoid kernel function which has the best performance among these different SVM methods in PQ disturbance classification, as demonstrated in [25]. In this simulation for PQ disturbance classification problem, let the angular frequency ω = 100π rad/s, the sampling frequency Fs = 256 Hz and the amplitude A = 110. We also use 200 training PQ disturbance signals to construct the pdf in (35) and estimate the natural parameter vectors in (37), and 200 testing PQ disturbance signals to evaluate the classification performance according to the target function in (40) for each hypothesis. The overall accuracy for this seven-class PQ disturbance problem is given in Table IV. In Table IV, each 7 × 7 matrix shows the classification performance for each method, and the diagonal elements of each matrix denotes the number of correctly classified PQ disturbance signals among 200 PQ disturbance signals. It is easily shown that our rule has much better classification performance 98.29% compared with 94.93% in SOLAR based on wavelet feature extraction, 90.4% in inductive inference approach, and 94.86% in C-SVM with sigmoid kernel function. Next, we analyze the classification performance of our classifier rule in different noise environments.
Fig. 3.
Comparison of two versions of EEF classification rule.
Signal-to-noise ratio (SNR) is one common metric to compare the level of the signal with the level of the noise. SNR is defined as SNR = 10 log10
Ps dB Pn
(43)
where Ps and Pn are the powers of signal and noise, respectively. The values of SNR ranging from 20 to 50 dB are used in our simulation. We first compare the two EEF classification rules in a model-driven manner: the original
TANG et al.: PARAMETRIC CLASSIFICATION RULE BASED ON THE EEF
375
aims to minimize the KL divergence with its true pdf, the natural parameters can reflect the ordinal relationship between classes. A PPENDIX C UMULANT-G ENERATING F UNCTION IN EEF In this appendix, we present the detailed derivation of (19) for classification with Gaussian mixture model via EEF. Given (17)
M πm exp(η T x)N (μm , Cm )dx . (44) K 0 (η) = ln m=1
X
For each integral component, we have exp(η T x)N (μm , Cm )dx Fig. 4. Impacts of SNR and number of training data for overall classification performance.
X
=
1 (2π) N/2 |C
one in (40) and the one with the penalty term in (42). The results are shown in Fig. 3, which demonstrates that the penalty version of the classification rule performs better than the one without. Second, for the seven PQ disturbance signals, we also evaluate the classification performance of our method versus the number of training data samples. Fig. 4 shows that our classifier rule has high overall classification accuracy even for low SNR and that the classification performance improves as the number of training data samples increases. VI. C ONCLUSION A parametric classifier based on the EEF is presented in this paper. Given a set of training data, we construct a new generalized pdf with respect to the reference distribution. The new distribution inherits many useful properties from the exponential family. It produces a decision rule for classification in either a data-driven manner or a model-driven manner. The proposed method could benefit many classification problems, and provide an effective way to address the problem of model parameters learning using embedded latent variable exponential family probability for classification. Several different classification examples, including synthetic data simulation and real-life data set classification in a datadriven manner and PQ disturbance classification in a modeldriven manner, are used to demonstrate the generality and effectiveness of our classifier rule. We show that our classifier obtains competitive results for real-life data sets as compared with the MAP rule, nearest neighbor method and neural network approach. For the PQ disturbance classification problem, the simulation results show that our classifier rule improves the classification performance dramatically as compared with other existing methods, because the prior known model information is used in a model-driven manner. In future research, we will build classifiers based on the EEF for other application scenarios, such as imbalanced learning [32] and ordinal classes classification. Because the pdf in EEF
m|
1/2
1 T −1 × exp η x − (x − μm ) Cm (x − μm ) dx 2 X 1 1 T −1 exp − (x − μm ) Cm (x − μm ) = (2π) N/2 |Cm |1/2 X 2 1 × exp η T μm + η T Cm η T dx 2 1 T T T = exp η μm + η Cm η (45) 2 where μm = Cm θ + μm , and the property p(x)d(x) = 1 is used in the above derivation. Replacing (45) into (17), the final cumulant-generating function K 0 (η) is obtained as M 1 T T T K 0 (η) = ln πm exp η μm + η Cm η (46) 2
T
m=1
which is (19). R EFERENCES [1] S. Kay, “Exponentially embedded families—New approaches to model order estimation,” IEEE Trans. Aerosp. Electron. Syst., vol. 41, no. 1, pp. 333–345, Jan. 2005. [2] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Trans. Inf. Theory, vol. 13, no. 1, pp. 21–27, Jan. 1967. [3] C. M. Bishop, Pattern Recognition and Machine Learning. New York, NY, USA: Springer-Verlag, Oct. 2007. [4] C. Levasseur, “Generalized statistical methods for mixed exponential families,” Ph.D. dissertation, Dept. Elect. Comput. Eng., Univ. California, San Diego, CA, USA, 2009. [5] S. Kay, Q. Ding, and M. Rangaswamy, “Sensor integration for classification,” in Proc. Conf. Rec. 44th Asilomar Conf. Signals, Syst., Comput., Nov. 2010, pp. 1658–1661. [6] G. Skolidis and G. Sanguinetti, “Bayesian multitask classification with Gaussian process priors,” IEEE Trans. Neural Netw., vol. 22, no. 12, pp. 2011–2021, Dec. 2011. [7] M. Seeger, “PAC-Bayesian generalization error bounds for Gaussian process classification,” J. Mach. Learn. Res., vol. 3, pp. 233–269, Oct. 2002. [8] C. M. Bishop, “Bayesian PCA,” in Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 1999, pp. 382–388.
376
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 2, FEBRUARY 2015
[9] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. New York, NY, USA: Wiley, 2006. [10] M. B. Westover, “Asymptotic geometry of multiple hypothesis testing,” IEEE Trans. Inf. Theory, vol. 54, no. 7, pp. 3327–3329, Jul. 2008. [11] S. Kullback, Information Theory and Statistics, 2nd ed. New York, NY, USA: Dover, 1997. [12] F. Nielsen and V. Garcia. (2009). “Statistical exponential families: A digest with flash cards,” CoRR. [Online]. Available: http://arxiv.org/ abs/0911.4863 [13] L. D. Brown, Fundamentals of Statistical Exponential Familes (Monograph Series). Inst. Math. Statistics, 1986. [14] S. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory. Englewood Cliffs, NJ, USA: Prentice-Hall, 1993. [15] A. Oliveira-Brochado and F. V. Martins, “Assessing the number of components in mixture models: A review,” Faculty de Economia, Univ. Porto, Porto, Portugal,Tech. Rep. 194, 2005. [16] K. Bache and M. Lichman. (2013). UCI Machine Learning Repository. [Online]. Available: http://archive.ics.uci.edu/ml/ [17] H. T. Kahraman, S. Sagiroglu, and I. Colak, “The development of intuitive knowledge classifier and the modeling of domain dependent data,” Knowl.-Based Syst., vol. 37, pp. 283–295, Jan. 2013. [18] K. J. Cios, D. K. Wedding, and N. Liu, “CLIP3: Cover learning using integer programming,” Kybernetes, vol. 26, no. 5, pp. 513–536, 1997. [19] L. A. Kurgan, K. J. Cios, R. Tadeusiewicz, M. Ogiela, and L. S. Goodenday, “Knowledge discovery approach to automated cardiac SPECT diagnosis,” Artif. Intell. Med., vol. 23, no. 2, pp. 149–169, 2001. [20] B. V. Ramana, M. S. P. Babu, and N. B. Venkateswarlu, “A critical study of selected classification algorithms for liver disease diagnosis,” Int. J. Database Manage. Syst., vol. 3, no. 2, pp. 101–114, 2011. [21] S. Aeberhard, D. Coomans, and O. de Vel, “The classification performance of RDA,” Dept. Comput. Sci. Math. Statist., James Cook Univ. North Queensland, Townsville City, Australia, Tech. Rep. 92-01, 1992. [22] H. Akaike, “A new look at the statistical model identification,” IEEE Trans. Autom. Control, vol. 19, no. 6, pp. 716–723, Dec. 1974. [23] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representation by error propagation,” in Parallel Distributed Processing, D. E. Rumelhart et al., Eds. Cambridge, MA, USA: MIT Press, 1986, pp. 318–362. [24] M. Samotyj, “The cost of power disturbance to industrial and digital economy companies,” Consortium for electrical infrastructure to support a digital society, an initiative by EPRI and the Electrical Innovation Institute, Jun. 2001. [25] H. He and J. A. Starzyk, “A self-organizing learning array system for power quality classification based on wavelet transform,” IEEE Trans. Power Del., vol. 21, no. 1, pp. 286–295, Jan. 2006. [26] T. Abdel-Galil, M. Kamel, A. M. Youssef, E. F. El-Saadany, and M. M. A. Salama, “Power quality disturbance classification using the inductive inference approach,” IEEE Trans. Power Del., vol. 19, no. 4, pp. 1812–1818, Oct. 2006. [27] H. Zhang, P. Liu, and O. P. Malik, “Detection and classification of power quality disturbances in noisy conditions,” IEE Proc. Generat. Transmiss. Distrib., vol. 150, no. 5, pp. 567–572, Sep. 2003. [28] H.-T. Yang and C.-C. Liao, “A de-noising scheme for enhancing waveletbased power quality monitoring system,” IEEE Trans. Power Del., vol. 16, no. 3, pp. 353–360, Jul. 2001. [29] P. O’Shea, “A high-resolution spectral analysis algorithm for powersystem disturbance monitoring,” IEEE Trans. Power Syst., vol. 17, no. 3, pp. 676–680, Aug. 2002. [30] S. Kay, Fundamentals of Statistical Signal Processing: Detection Theory. Englewood Cliffs, NJ, USA: Prentice-Hall, 1998. [31] L. L. Scharf and B. Friedlander, “Matched subspace detectors,” IEEE Trans. Signal Process., vol. 42, no. 8, pp. 2146–2157, Aug. 1994. [32] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009.
Bo Tang (S’14) received the B.S. degree from the Department of Information Physics, Central South University, Changsha, China, in 2007, and the M.S. degree from the Institute of Electronics, Chinese Academy of Science, Beijing, China, in 2010. He is currently pursuing the Ph.D. degree with the Department of Electrical, Computer and Biomedical Engineering, University of Rhode Island, Kingston, RI, USA. His current research interests include statistical machine learning, computational intelligence, computer vision, and robotics.
Haibo He (SM’11) received the B.S. and M.S. degrees from the Huazhong University of Science and Technology, Wuhan, China, in 1999 and 2002, respectively, and the Ph.D. degree from Ohio University, Athens, OH, USA, in 2006, all in electrical engineering. He was an Assistant Professor with the Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ, USA, from 2006 to 2009. He is currently the Robert Haas Endowed Professor and an Associate Professor of Electrical Engineering with the University of Rhode Island, Kingston, RI, USA. He has authored one research book (Wiley), edited one research book (Wiley-IEEE), and three conference proceedings (Springer), and authored or co-authored over 150 peer-reviewed journal and conference papers. His current research interests include machine learning, cyber-physical systems, computational intelligence, hardware design for machine intelligence, and various applications, such as smart grid and renewable energy systems. Dr. He was a recipient of the IEEE International Conference on Communications Best Paper Award in 2014, the IEEE Computational Intelligence Society Outstanding Early Career Award in 2014, the National Science Foundation CAREER Award in 2011, and the Providence Business News Rising Star Innovator Award in 2011. His researches have been covered by national and international media, such as the IEEE S MART G RID N EWSLETTER, The Wall Street Journal, and Providence Business News. He is currently an Associate Editor of the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS , and the IEEE T RANSACTIONS ON S MART G RID.
Quan Ding (M’11) received the B.S. degree from Shanghai Jiao Tong University, Shanghai, China, in 2006, and the M.S. and Ph.D. degrees from the University of Rhode Island, Kingston, RI, USA, in 2008 and 2011, respectively, all in electrical engineering. He has been a Post-Doctoral Research Associate with the University of Rhode Island, since 2011, and a Research Associate with the Harvard Medical School, Boston, MA, USA. He is currently a PostDoctoral Scholar with the University of California at San Francisco, San Francisco, CA, USA. His current research interests include statistical signal processing, detection and estimation theory, machine learning, biomedical signal processing, and physiological/clinical data mining. Dr. Ding has served as an Associate Editor of the IEEE S IGNAL P ROCESSING E-N EWSLETTER since 2014.
TANG et al.: PARAMETRIC CLASSIFICATION RULE BASED ON THE EEF
Steven Kay (F’89) was born in Newark, NJ, USA, in 1951. He received the B.E. degree from the Stevens Institute of Technology, Hoboken, NJ, USA, in 1972, the M.S. degree from Columbia University, New York, NY, USA, in 1973, and the Ph.D. degree from the Georgia Institute of Technology, Atlanta, GA, USA, in 1980, all in electrical engineering. He was with Bell Laboratories, Holmdel, NJ, USA, from 1972 to 1975, where he was involved in transmission planning for speech communications and simulation, and subjective testing of speech processing algorithms. From 1975 to 1977, he attended the Georgia Institute of Technology, where he studied communication theory and digital signal processing. From 1977 to 1980, he was with the Submarine Signal Division, Portsmouth, RI, USA, where he was involved in research on autoregressive spectral estimation and the design of sonar systems. As a leading expert in statistical signal processing, he has been invited to teach short courses to scientists and engineers at government laboratories, including NASA, Houston, TX, USA, and the Central Intelligence Agency, Vairfax, VA, USA.
377
He has written numerous journals and conference papers and is a Contributor to several edited books. He is currently a Professor of Electrical Engineering with the University of Rhode Island, Kingston, RI, USA, and a Consultant to numerous industrial concerns, the Air Force, the Army, and the Navy. He has authored the textbooks entitled Modern Spectral Estimation (Prentice-Hall, 1988), the Fundamentals of Statistical Signal Processing, Vol. I: Estimation Theory (Prentice-Hall, 1993), the Fundamentals of Statistical Signal Processing, Vol. II: Detection Theory (Prentice-Hall, 1998), and Intuitive Probability and Random Processes Using MATLAB (Springer, 2006). He has recently been included on a list of the 250 most cited researchers in the world in engineering. His current research interests include spectrum analysis, detection and estimation theory, and statistical signal processing. Dr. Kay is a member of the Tau Beta Pi and Sigma Xi. He has received the IEEE Signal Processing Society Education Award for outstanding contributions in education and in writing scholarly books and texts. He has been a Distinguished Lecturer of the IEEE Signal Processing Society. He has been an Associate Editor of the IEEE S IGNAL P ROCESSING L ETTERS and the IEEE T RANSACTIONS ON S IGNAL P ROCESSING.