Feature Selection for Hidden Markov Models and ... - IEEE Xplore

107 downloads 0 Views 4MB Size Report
May 9, 2016 - ABSTRACT In this paper, a joint feature selection and parameter estimation algorithm is presented for hidden Markov models (HMMs) and ...
Received March 2, 2016, accepted March 16, 2016, date of publication April 11, 2016, date of current version May 9, 2016. Digital Object Identifier 10.1109/ACCESS.2016.2552478

Feature Selection for Hidden Markov Models and Hidden Semi-Markov Models STEPHEN ADAMS1 , PETER A. BELING1 , (Member, IEEE), AND RANDY COGILL2 , (Member, IEEE) 1 Department 2 IBM

of Systems and Information Engineering, University of Virginia, Charlottesville, VA 22904, USA Ireland Ltd., IBM House, Shelbourne Road, Dublin D04 NP20, Ireland

Corresponding author: S. Adams ([email protected])

ABSTRACT In this paper, a joint feature selection and parameter estimation algorithm is presented for hidden Markov models (HMMs) and hidden semi-Markov models (HSMMs). New parameters, feature saliencies, are introduced to the model and used to select features that distinguish between states. The feature saliencies represent the probability that a feature is relevant by distinguishing between state-dependent and state-independent distributions. An expectation maximization algorithm is used to calculate maximum a posteriori estimates for model parameters. An exponential prior on the feature saliencies is compared with a beta prior. These priors can be used to include cost in the model estimation and feature selection process. This algorithm is tested against maximum likelihood estimates and a variational Bayesian method. For the HMM, four formulations are compared on a synthetic data set generated by models with known parameters, a tool wear data set, and data collected during a painting process. For the HSMM, two formulations, maximum likelihood and maximum a posteriori, are tested on the latter two data sets, demonstrating that the feature saliency method of feature selection can be extended to semi-Markov processes. The literature on feature selection specifically for HMMs is sparse, and non-existent for HSMMs. This paper fills a gap in the literature concerning simultaneous feature selection and parameter estimation for HMMs using the EM algorithm, and introduces the notion of selecting features with respect to cost for HMMs. INDEX TERMS Feature selection, hidden Markov models, hidden semi-Markov models, maximum a posteriori estimation.

I. INTRODUCTION

Hidden Markov models (HMMs) and their extensions hidden semi-Markov models (HSMMs) are widely used for modeling sequential data. Speech recognition, video identification, finance, and tool wear monitoring are just a few of the fields in which HMMs and HSMMs have found widespread application. These models are composed of time series of correlated hidden and observed random variables, with the latter taking the form of a vector of features typically calculated from raw data produced by sensors or other observational channels. An important problem in the construction of HMMs and HSMMs is the selection of which features to use in the model. Methods for feature selection specifically designed for HMMs and HSMMs are the focus of this paper. Models constructed using a high-dimensional feature vector may contain features that do not contribute to distinguishing between the states. The principle of parsimony would suggest removing these features if this can be done without significantly effecting the usefulness of the model

1642

for estimating the underlying states. One possible approach to reducing the dimensionality of the feature vector is to model every possible subset of features and select the model with the greatest likelihood for explaining the data. This approach becomes impractical as the number of features increases because of exponential growth in the number of feature subsets. Further, this approach to feature selection does not include the cost of collecting each feature in the likelihood evaluation. It is also possible to consider notions of cost, both tangible and abstract, as the basis for feature selection. Sensors often represent major capital expenditures and have significant operating costs. In other scenarios, it may be important to consider data acquisition time or the consumption of system resources, such as bandwidth or payload. Collectively, the price paid (in money, opportunity, time, etc.) to obtain the feature vector is referred to as test cost [13], [17], [20], [27]. Feature selection techniques that incorporate test cost have been studied [7], [9], [15], [16], [21], [25], [29].

2169-3536 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

VOLUME 4, 2016

S. Adams et al.: Feature Selection for HMMs and HSMMs

These methods balance test cost with misclassification cost and primarily focus on decision systems [15], [16], [29], decision trees [7], [25] and K -nearest neighbor [9]. These methods require labeled data because they use a misclassification measure in the feature selection process. Ji and Carin [10] formulate the feature selection with cost problem as a partially observable Markov decision process (POMDP) in which the actions are the selection of features to sample, and the hidden states are mixture components. This POMDP formulation has several limitations, including significant computational requirements and the difficulty of handling continuous features. In this paper, we propose algorithms that bridge the literature on subset-based and test-cost-based methods by efficiently optimizing model parameter estimates and selecting relevant feature subsets given the cost of each feature. Our models, the feature saliency hidden Markov model (FSHMM) and its HSMM extension, simultaneously provide maximum a posteriori (MAP) estimates and select relevant features, under the assumption that the number of states is known. Knowing the number of underlying states makes it possible to use the expectation maximization (EM) algorithm. The EM algorithm results in accurate parameter estimates, and reduces computational time and model complexity compared to other alternatives that have been considered in the literature. In our approach, new parameters called feature saliencies are introduced into the model and used to represent the degree to which a given feature can distinguish between states. A truncated exponential distribution with support on [0, 1] is chosen as a prior distribution on the feature saliencies. The hyperparameter for the truncated exponential distribution can be used as a weight forcing estimates for the saliencies towards zero. By introducing this prior, the resulting estimation procedure is biased toward not including features in the model, unless the training data strongly suggests these features can be used to distinguish states. The weight can be increased until estimated feature saliencies start to fall below a desired threshold, and the features providing the most discriminating information are left. Further, the weight can be used to include the cost of collecting a feature. Features with a higher collection cost must provide more distinguishing information if they are to be included in the model. The obvious choice for a prior on the saliencies would be the beta distribution given that the values are restricted to [0, 1]. We compare HMMs using both priors and find that the beta prior under estimates the value of relevant features. The beta prior also requires the choice of two hyperparameters, while the exponential only requires a single hyperparameter for each saliency. Zhu et al. [30] use variational Bayesian (VB) methods to jointly estimate model parameters and select features for HMMs. This method does not require a priori knowledge of the number of states, or the number of mixtures if a Gaussian mixture model (GMM) is used for the emission distribution. This lack of information increases VOLUME 4, 2016

computation time, increases the complexity of the model, and, in some cases, can decrease parameter estimation accuracy. When using the mean field assumption, VB can underestimate the variance for the approximate distribution [6]. Chatzis and Kosmopoulos [5] show that in the presence of outliers, VB gives poor estimates for model parameters and the number of states when using an HMM with Gaussian emissions. When predicting a state sequence given data, it is more efficient to use point estimates and the Viterbi algorithm than calculating the distribution of all possible state sequences. Therefore, knowing the approximate posterior distributions is generally not necessary for prediction. We use a similar model as the authors of [30], but assume a fixed number of states allowing us to use the EM algorithm. Further, we extend the feature saliency model to HSMMs. The contributions of this paper are: • Derivation of an EM-based algorithm for the FSHMM. This methods combines ideas from two different bodies of literature and provides simultaneous feature selection and parameter estimation with good practical performance. • Extension of the FSHMM structure and methodology to HSMMs. • Empirical comparison of the EM and VB algorithms for FSHMM when the number of states is known. The principle conclusions are that the MAP FSHMM formulation outperforms the VB method in terms of predictive accuracy, selecting features with lower cost, robustness to the training set, and computational expense. The VB method is sensitive to the training set when selecting features, i.e. changes in the training set yield different selected feature subsets. The MAP FSHMM formulation is robust to changes in the training set due to the use of the prior distributions. • Empirical investigation into the use of prior distributions as a method for incorporating the cost of features into the feature selection process. A brief outline of the MAP FSHMM formulation and some preliminary experimental results has been previously presented [1], however the extension of feature saliency to HSMMs is novel to this study. This paper is organized as follows. Background information on HMMs, HSMMs, and feature selection is given in Section 2. The feature saliency model for HMMs and HSMMs is described in Section 3. The EM algorithm for both models is outlined in Section 4. Experiments on synthetic and real data are discussed in Section 5. Finally, conclusions are given in Section 6. II. BACKGROUND

This section provides background information on HMMs and HSMMs, along with a brief overview of the literature on feature selection for HMMs. A HMM is composed of a sequence of unobserved states, modeled by a Markov chain, and a sequence of observed features related to the states. HMMs are generally used in 1643

S. Adams et al.: Feature Selection for HMMs and HSMMs

applications where a sequence of unknown (hidden) states must be estimated from a sequence of correlated observations. In the machining problem of tool wear monitoring, to take a concrete example, hidden states may be used to represent an unobserved sequence of tool wear levels, and observations may correspond to features extracted from measured cutting force and vibration signals. The purpose of using a HMM is to enable the accurate estimation of tool wear levels from the observed force and vibration signals. Hidden semi-Markov models (HSMMs) [28] are an extension of the standard HMM in which multiple observations (as opposed to a single one) are emitted from a single state. The underlying Markov chain follows a semi-Markov process where each state has a duration or sojourn time modeled as a random variable. The duration is the number of time steps in the state before a state transition occurs. HSMMs are generally used in the same fields as HMMs, but they outperform HMMs when the duration is not geometrically decaying. When observed features are continuous, both HMMs and HSMMs are typically modeled using Gaussian random variables. Specifically, observed features are modeled as conditionally independent given the state sequence. So, HMMs can be completely specified by the initial probability mass function and transition matrix for the states, together with the parameters of a set of Gaussian distributions, where there is a distinct set of parameters for the conditional distribution for each given state. There are several types of HSMMs depending on the assumptions made about state transitions and duration times. In the most general type, the transition matrix includes the duration. In explicit duration HMMs, the state transition and duration are assumed to be independent. In applications, the parameters of HMMs and HSMMs are generally estimated from data. When provided with state sequences and observed output features as training data, the parameters can be estimated separately from state transition frequencies and conditional feature statistics for each state. In the case of unsupervised learning, when the state sequence is not known, parameter estimates for a HMM can be obtained using the Baum-Welch algorithm [22], Markov chain Monte Carlo (MCMC) methods [4], or VB methods [14]. The Baum-Welch algorithm requires the number of hidden states to be known, while some MCMC and VB methods can determine the most likely number of states given the data. However, the MCMC and VB methods require significantly more computation time. Similar techniques exist for unsupervised learning of HSMMs [28], but they are more computationally complex than the HMM versions due to the addition of the duration random variable. There is a dearth of literature concerning a single process for feature selection and model estimation specifically for HMMs. To the best of our knowledge, there is no literature that specifically addresses feature selection for HSMMs; however, approaches for HMMs can be easily adapted to HSMMs. In the literature, feature selection and parameter estimation are generally cast as two separate problems. 1644

Principal component analysis (PCA) can be used to reduce the number of features before the model parameters are estimated [2]. Nouza [19] compares PCA, discriminative feature analysis, and sequential search. Nouza concludes that sequential search and discriminative feature analysis outperform PCA but suffer from higher computational load and require supervised data. Li and Bilmes [12] perform feature pruning given the model parameters, with the primary objective of reducing the computation time of the likelihood calculation for speech recognition systems. Features with little influence on the likelihood calculation are pruned and represented in the model using a simple function. Xie et al. [26] and Montero and Sucar [18] both extract a known set of features from the data rather than selecting features using an algorithm. The previously mentioned method of Zhu et al. [30] appears to be one of the few feature selection techniques specifically designed for HMMs. A vast majority of the techniques conduct feature selection during a pre-processing phase, which is independent of the model. The proposed FSHMM method differs from this VB approach by using the EM algorithm and calculating MAP estimates. This allows for a less complex model and the incorporation of the cost of features into the feature selection process. III. FEATURE SALIENCY MODEL

In this section, we describe the feature saliency hidden Markov model (FSHMM) and the feature saliency explicit duration hidden Markov model (FSEDHMM). An explicit duration HMM is a specific type of HSMM where it is assumed that duration and state transition are independent. A. FEATURE SALIENCY HIDDEN MARKOV MODEL

Consider a HMM with continuous emissions and I states. Let y = {y0 , y1 , . . . , yT } be the sequence of observed data, where each yt ∈ RL . The observation for the l-th feature at time t, which is represented by the l-th component of yt , is denoted by ylt . Let x = {x0 , x1 , . . . , xT } be the unobserved state sequence. The transition matrix of the Markov chain associated with this sequence is denoted as A. The components of this transition matrix are denoted by aij = P(xt = j|xt−1 = i), and π is used to denote the initial state distribution. In terms of these quantities, the complete data likelihood can be written as: p(x, y|3) = πx0 fx0 (y0 )

T Y

axt−1 ,xt fxt (yt ),

(1)

t=1

where 3 is the set of model parameters, and fxt (yt ) is the emission distribution given state xt . For feature selection, we use a feature saliency model for the emission distributions [11]. A feature is considered to be relevant if its distribution is dependent on the underlying state and irrelevant if its distribution is independent of the state. Let z = {z1 , . . . , zL } be a set of binary variables indicating the relevancy of each feature. If zl = 1, then the l-th feature is relevant. Otherwise, if zl = 0, the l-th feature is irrelevant. VOLUME 4, 2016

S. Adams et al.: Feature Selection for HMMs and HSMMs

The feature saliency ρl is defined as the probability that the l-th feature is relevant. Assuming the features are conditionally independent given the state allows the conditional distribution of yt given z and x to be written as: p(yt |z, xt = i, 3) L Y = r(ylt |µil , σil2 )zl q(ylt |l , τl2 )1−zl ,

(2)

l=1

where r(ylt |µil , σil2 ) is the Gaussian conditional feature distribution for the l-th feature with state-dependent mean µil and state-dependent variance σil2 , and q(ylt |l , τl2 ) is the state-independent Gaussian feature distribution with mean l and variance τl2 . For the remainder of this paper, we treat all features as conditionally independent given the state. For the FSHMM, the set of model parameters 3 is {π, A, µ, σ, ρ, , τ }. The marginal probability of z is: P(z|3) =

L Y

ρlzl (1 − ρl )1−zl .

(3)

l=1

The joint distribution of yt and z given x is: p(yt , z|xt = i, 3) L Y = [ρl r(ylt |µil , σil2 )]zl [(1 − ρl )q(ylt |l , τl2 )]1−zl . (4) l=1

The marginal distribution for yt given x can be found by summing (4) over z and is: fxt (yt ) = p(yt |xt = i, 3) L   Y = ρl r(ylt |µil , σil2 ) + (1 − ρl )q(ylt |l , τl2 ) .

multiple time steps, with a minimum value of dmin and a maximum value of dmax . The sum of the individual duration random variables is equal to the total number of time steps PN n=1 dn = T . The probability of duration d in state j is represented by pj (d). The explicit duration formulation assumes that state transition and duration are independent. The truncated Poisson distribution is used for pj (d): ( P(d|λj ) dmin ≤ d ≤ dmax , pj (d) = (7) 0 otherwise, where −λ

P(d|λj ) =

=

d!

×

p(x, y, z|3) axt−1 ,xt p(yt , z|xt , 3).

dmin

λdj d!

.

(8)

τ =1

(5)

The complete data likelihood for the FSHMM is: T Y

Pdmax

As with the FSHMM, let z = {z1 , . . . , zL } be a set of binary variables indicating the relevancy of each feature. If zl = 1, then the l-th feature is relevant. Otherwise, if zl = 0 the l-th feature is irrelevant. The state-dependent and stateindependent distributions are assumed to be Gaussian, and the same notation as the FSHMM is used. The set of latent random variables for the FSEDHMM is {x, z, D}, and the set of model parameters 3 is {π, A, µ, σ, ρ, , τ, λ}. The marginal probability of z is the same as (3), and the joint distribution of yt and z given x, P(yt , z|xt = i, 3), is the same as (4). The joint probability of all random variables is: # "d 0 Y P(yτ , z|x1 = i, 3) P(x, y, z, D) = πx0 px0 (d0 )

l=1

= πx0 p(y0 , z|x0 , 3)

λdj e j d! Pdmax λdj e−λj dmin d! λdj

(6)

 N Y

axn−1 ,xn pxn (dn )  n=1   dˆ nY +dn   P(yτ , z|xn = i, 3) , (9)  τ =dˆ n +1

t=1

where B. FEATURE SALIENCY EXPLICIT DURATION HIDDEN MARKOV MODEL

Consider a HSMM with I states. Let y = {y0 , y1 , . . . , yT } be the sequence of observed data, where each yt ∈ RL . The observation for the l-th feature at time t, which is represented by the l-th component of yt , is denoted by ylt . Let x = {x0 , x1 , . . . , xT } be the unobserved state sequence. Assume that the hidden states follow a semi-Markov process, where the duration in a state is modeled as a truncated Poisson random variable with parameter λ. Let D = {d0 , . . . , dN } be the sequence of duration random variables, where N + 1 is the number of sojourn periods or one plus the number of state transitions. Note that there are two sets of subscripts: t represents the time step, and n represents the nth sojourn period. The duration of the nth sojourn period can last for VOLUME 4, 2016

dˆ n =

n−1 X

dnˆ .

(10)

nˆ =1

IV. EM ALGORITHM FOR FSHMM AND FSEDHMM

The EM algorithm, referred to as the Baum-Welch algorithm when used with HMMs ([3], [22]), is used to calculate maximum likelihood (ML) estimates for the model parameters. Priors can be placed on the parameters to calculate the MAP estimates [8]. The Baum-Welch algorithm iterates between two steps: the expectation step and the maximization step, often abbreviated as E-step and M-step. The E-step finds the expected value of the complete log-likelihood with respect to the state, given the data and the current model parameters. The M-step maximizes the expectation computed in the 1645

S. Adams et al.: Feature Selection for HMMs and HSMMs

vilt = P(zl = 0, xt = i|y, 30 ) γit hilt = gilt = γit − uilt ,

E-step to find the next set of model parameters. These two steps are repeated until convergence. The expectation of the complete log-likelihood is designated the Q function, given by: Q(3, 3 ) = E[log p(x, y|3)|y, 3 ]. 0

0

(11)

In (11), 3 represents the set of model parameters for the current iteration, and 30 is the set of parameters from the previous iteration. For ML estimation, the Q function in (11) is calculated in the E-step and then maximized with respect to 3 for the M-step. For MAP estimation, the Q function is modified by adding terms corresponding to the prior on the model parameters, G(3): Q(3, 30 ) + log(G(3)).

(12)

That is, the Q function in (11) is calculated for the E-step as in the ML algorithm, but the log of G(3) is added to the Q function and maximized for the M-step. For both ML and MAP, the quantities in (11) and (12) are maximized by computing roots of their derivatives with respect to 3. For the FSHMM, the set of hidden variables is composed of the states and the feature saliencies. The Q function in this case incorporates the hidden variables associated with feature saliencies, and is given by: Q(3, 30 ) = E[log p(x, y, z|3)|y, 30 ] X = log(p(x, y, z|3))P(x, z|y, 30 ).

(13)

x,z

The Q function for the FSEDHMM includes the duration random variable: Q(3, 30 ) = E[log P(x, y, z, D|3)|y, 30 ] X = log(P(x, y, z, D|3))P(x, z, D|y, 30 ). x,z,D

(14) A. PROBABILITIES FOR E-STEP

For the FSHMM probabilities:

E-step,

calculate

the

γt (i) = P(xt = i|y, 30 ),

following (15)

ξt (i, j) = P(xt−1 = i, xt = j|y, 30 ),

(16)

where γt (i) and ξt (i, j) are calculated using the forwardbackward algorithm ([3], [22]). Additionally, calculate: eilt = p(ylt , zl = 1|xt = i, 30 ) = ρl r(ylt |µil , σil2 ),

(17)

hilt = p(ylt , zl = 0|xt = i, 3 ) 0

= (1 − ρl )q(ylt |l , τl2 ), gilt = p(ylt |xt = i, 3 ) 0

B. ML M-STEP

The parameter estimates for the initial state distribution and the transition probabilities are the same as in the Baum-Welch algorithm. The estimates for the parameters of r(·|·) and q(·|·), and the feature saliencies use the probabilities defined in the previous section, (20) and (21). Specifically, the parameter estimates for the FSHMM ML M-step are given by: πi = γ0 (i), PT ξt (i, j) , aij = Pt=1 T −1 t=0 γt (i) PT uilt ylt µil = Pt=0 , T t=0 uilt PT 2 t=0 uilt (ylt − µil ) 2 , σil = PT t=0 uilt  PT PI t=0 i=1 vilt ylt ,  l = PT PI t=0 i=1 vilt  PT PI 2 t=0 i=1 vilt (ylt − l ) 2 τl = , PT PI t=0 i=1 vilt PT PI t=0 i=1 uilt ρl = . T +1 For the FSEDHMM ML M-step, additionally calculate the parameter estimate for the duration distribution: PT Pdmax ηt (i, d)d t=0 d=d . (22) λi = PT Pd min max t=0 d=dmin ηt (i, d) C. MAP M-STEP

The priors used for MAP estimation are listed below. Dir is the Dirichlet distribution, N is the Gaussian distribution, IG is the inverse gamma distribution, and Ai is row i of the transition matrix: µil ∼ l ∼

= eilt + hilt ,

(19)

uilt = P(zl = 1, xt = i|y, 3 ) γit eilt , = gilt 0

1646

where γt (i), ξt (i, j), uilt , and vilt will be used in the M-step. For the FSEDHMM E-step, again calculate γt (j) and ξt (i, j), but use the forward-backward algorithm for an explicit duration HMM outlined in [28]. In addition, ηt (j, d) (also outlined in [28]) must be calculated during the FSEDHMM E-step. Then calculate the probabilities in (17) to (21) using γt (j) for HSMMs.

π ∼ Dir(π |β), (18)

(20)

(21)

N (µil |mil , s2il ), N (l |bl , c2l ),

Ai ∼ Dir(Ai |α i ), σil2 ∼ IG(σil2 |ζil , ηil ), τl2 ∼ IG(τl |νl , ψl ),

1 −kl ρl e , Z where Z is the normalizing constant. Figure 1 contains the graphical model for the MAP formulation. The parameter ρl ∼

VOLUME 4, 2016

S. Adams et al.: Feature Selection for HMMs and HSMMs

FIGURE 1. Graphical model for MAP formulation. Squares represent hidden variables. Filled circles are observable variables. Open circles are model parameters.

update equations are: πi = PI aij = µil = σil2

γ0 (i) + βi − 1

,

i=1 (γ0 (i) + βi − 1) PT t=1 ξt (i, j) + αij − 1 , PI PT j=1 t=1 ξt (i, j) + αij − 1

s2il

PT

2 t=0 uilt ylt + σil mil , P s2il Tt=0 uilt + σil2

PT =

l =

2 t=0 uilt (ylt − µil ) + 2ηil , PT uilt + 2(ζil + 1) t=0   P PI 2 c2l Tt=0 i=1 vilt ylt + τl bl

 , P  PI 2 c2l Tt=0 i=1 vilt + τl  PT  PI v (ylt − l )2 + 2ψl ilt t=0 i=1 2  τl = P  P , T I t=0 i=1 vilt + 2(νl + 1) r P P  T I b T 2 − 4kl T− b t=0 i=1 uilt ρl = , 2kl where b T = T + 1 + kl . The ML and MAP estimates can be easily adjusted for multiple observation sequences. Another choice for a prior on ρ is the beta distribution, because ρ is restricted to [0,1]. One would think that a beta prior is more appropriate than the truncated exponential, but we found through experimentation that models using this VOLUME 4, 2016

prior do not perform as well when estimating the saliency of features. The parameter update for ρ using the beta prior B(ρl |kl , κl ) is: PT PI uilt + kl − 1 ρl = t=0 i=1 . (23) (T + 1) + kl + κl − 2 In order to ensure that ρl ≥ 0, we must set kl ≥ 1 ∀l. When kl = 1, the parameter estimate for ρ simplifies to: PT PI t=0 i=1 uilt . (24) ρl = (T + 1) + κl − 1 As long as there is more than one observation in y, the denominator of (24) will be greater than 0. For the remainder of the paper, we will designate the formulation using the truncated exponential as MAP, and the formulation using the beta prior as MAP-beta. For the FSEDHMM MAP M-step, a Gamma prior is used on the duration distribution parameter, λi ∼ G(λi |oi , $i ), which yields the following parameter update equation:  PT Pdmax ηt (i, d)d + oi − 1 t=1 d=d min . (25) λˆ i = PT Pdmax t=1 d=dmin ηt (i, d) + $i Either the truncated exponential prior or the beta prior can be used on ρ for the FSEDHMM. V. EXPERIMENTS

In this section, numerical experiments are conducted on a set of synthetic data, a tool wear data set used in 1647

S. Adams et al.: Feature Selection for HMMs and HSMMs

the 2010 Prognostics and Health Management (PHM) conference data challenge,1 and an activity recognition data set collected using a Microsoft Kinect2 (see [23], [24] for more information about this data set). For the experiments using a FSHMM, model parameters are estimated using ML, MAP, MAP-beta, and the VB approach outlined in [30]. The parameter estimates are compared with the true values for the synthetic data experiments. Prediction accuracy for a test set are calculated for the PHM and Microsoft Kinect data. For the experiments using the FSEDHMM, model parameters are trained using only the ML and MAP formulations, and results are presented for PHM and Kinect data sets. Experiments using FSEDHMM were conducted on synthetic data, but the results are not presented because the conclusions are the same as those for the FSHMM experiments on synthetic data.

formulations reflect the training set data, giving a higher probability to initially starting in state 1, and are therefore not presented in detail. The parameter estimates for the transition matrix A, the state-dependent distribution r(·|·), and the stateindependent distribution q(·|·) are close to the true values and similar for each formulation, so these results are also not presented. However, the formulations differ when estimating ρ, and these parameter estimates are displayed in Table 1. TABLE 1. Parameter estimates for feature saliencies of all features. ML overestimates the relevance of irrelevant features. MAP-beta underestimates the relevance of relevant features. MAP and VB both accurately estimate the relevance of all features.

A. SYNTHETIC DATA - FSHMM

Three observation sequences are produced by a model with two relevant features. The two dimensional vectors of relevant features are generated from N (µi , 6). The model has two states and 500 time steps are generated for each sequence. The model parameters are:       25 0 µ1 = 10 20 , µ2 = 30 60 , 6 = , 0 25     0.75 0.25 A= , π = 0.4 0.6 . 0.4 0.6 Three irrelevant features of random noise, generated from N (0, I ), are added to the data, resulting in a model with five features in total. The hyperparameters for the priors in the MAP formulation are: βi = αij = 2, sil = ζil = ηil = νl = ψl = 0.5, cil = 1. bl is the mean of the observations for the l th feature. m1l is bl minus one standard deviation and m2l is bl plus one standard deviation. For ρl , the weight parameter kl is set to 50. (At kl ≥ 50, the majority of the exponential prior’s density lies between zero and one.) For MAP-beta, the hyperparameters are the same except kl = 1 and κl = 49, giving the same expectation as the truncated exponential prior. The hyperparameters for the VB formulation are the same as those described in [30]. The algorithms are initialized using equal initial probabilities and transition probabilities. µ is randomly selected, σ = 4,  = b, and τ is the standard deviation of the data. ρ for all features is always initially set to 0.5. The algorithms are run for a maximum of 1000 iterations. Convergence is tested by calculating the percent change in the likelihood for ML, the posterior probability for MAP and MAP-beta, and the lower bound for VB. The convergence threshold is 10−9 . The estimates for the VB formulation are the expectations of the approximate posterior distributions. The synthetic data set has two observation sequences that start in state 1 and one sequence starting in state 2. The estimates for π for all 1 http://www.phmsociety.org/competition/phm/10 2 http://bart.sys.virginia.edu/incom2015.tgz

1648

The estimates for ρ of the relevant features are greater than or equal to 0.99 for ML, MAP and VB. MAP-beta underestimates ρ for the relevant features. We find that underestimating the saliency of relevant features is the primary disadvantage of MAP-beta formulation. When kl > 1 and κl is chosen so the expectation matches the exponential prior, MAP-beta continues to underestimate ρ for the relevant features. The ML method produces the highest ρ for the irrelevant features, while MAP, MAP-beta and VB produce smaller estimates. (In order to prevent division by zero in the VB algorithm, an upper and lower bound is placed on the saliencies. For this experiment, these were 1 − 10−9 and 10−9 .) Overestimating the saliency of irrelevant features is one disadvantage of the ML formulation. After 1000 iterations, the VB method had not converged. MAP converged after 359 iterations, MAP-beta after 366 iterations, and ML after 372 iterations. When either the number of iterations is decreased or the convergence threshold is increased, the VB method produces greater saliency estimates for the irrelevant features. The estimates produced by MAP, MAP-beta and ML are not significantly affected by changes in the maximum number of iterations or the convergence threshold. Through testing, we found that the VB method is sensitive to the number of observations in the training data. Specifically, the estimated feature saliencies for irrelevant features increase as the number of observations increase. To illustrate, we test on synthetic data. The same model as above is used to generate a sequence of 100,000 observations. Models are trained using an increasing number of observations from the synthetic data. The same starting values and hyperparameters as the previous test are used. However, k is changed in the MAP formulation to scale with the number of observations (k = T /4). For MAP-beta, k = 1 and κ is scaled. As previously stated, the irrelevant feature saliencies in the VB formulation increase with the number of observations. There is an increase in the relevant feature saliencies, but VOLUME 4, 2016

S. Adams et al.: Feature Selection for HMMs and HSMMs

it is very small in scale. The feature saliencies for relevant and irrelevant features for ML, MAP and MAP-beta appear unaffected by the number of observations. The estimated ρ for each algorithm for feature 3, an irrelevant feature, are plotted in Figure 2. The estimates for MAP and MAP-beta are indistinguishable because feature 3 is irrelevant.

wear value. Models with 5, 10 and 20 hidden states are trained. Full models use all 18 features. We use ρ to remove features and construct reduced models. When predicting states using the Viterbi algorithm, only the state-dependent parameters are used. This reduces the number of parameters that need to be stored after training and reduces the computational resources needed for prediction. The model now resembles a standard HMM. The conditional probability used for the Viterbi algorithm is: p(yt |xt = i, 3) =

L Y

r(ylt |µil , σil2 ).

(26)

l=1

FIGURE 2. Plot for estimated ρ for feature 3, an irrelevant feature, as the number of observations in the training set increases.

B. PHM DATA - FSHMM

In this set of experiments, ML, MAP, MAP-beta, and VB are compared using a tool wear data set. The 2010 Prognostics and Health Management (PHM) conference used this data set for their data challenge. The data consists of six tools used for 315 cuts on a CNC milling machine. Three tools (designated Tools 1, 4, and 6) are supervised and have corresponding wear measurements. Three tools (Tools 2, 3, and 5) are unsupervised. Force and vibration in three directions (X, Y, and Z) are collected for each cut. Root mean square (RMS), the sum of log energies (SLE), and maximum energy (ME) are calculated from each sensor and used as features. The features are normalized so their scales are similar. We assume the cost of each force sensor is $2400, and we assume the cost of each vibration sensor is $1200. For this set of numerical experiments, we assume that each direction of each sensor can be removed while leaving the other directions in place. For example, the force sensor in the X direction can be removed from the mill without removing the Y and Z directions. We use a leave-one-out cross validation testing methodology. One of the supervised tools is removed from the data, and a model is trained on the remaining five tools. The model is then tested on the withheld tool by predicting the wear state. All three supervised tools are tested. The wear measurements are divided into equally spaced discrete bins. The wear bins correspond to the hidden states of the FSHMM. The Viterbi algorithm is used to predict the state sequence of the test tool. The median of the wear bin is used as the predicted wear value. Root mean squared error (RMSE) is calculated between the true wear measurement and the predicted VOLUME 4, 2016

For reduced models, we remove all features associated with a single sensor reducing the feature set by three features. ρ¯ is the mean of the three feature saliencies associated with a particular sensor. For example, ρ¯ForceX is the mean of the ρ for RMS, SLE and ME for the force in the X direction. The sensor with the smallest ρ¯ is removed to form the reduced model. Tool wear is non-decreasing, therefore we assume the underlying Markov chain is left-to-right. However, the stick breaking representation of the hierarchical Dirichlet process in the VB formulation does not allow for a strict left-to-right Markov chain. The ML and MAP algorithms are initialized with the same values. The initial self transition aii is 0.95, the transition to the next state ai,i+1 is 0.05, and π1 = 1. The statedependent means µil are equally spaced between −2 and 2. The state-dependent standard deviation σil is 1 for all states and features. The state-independent parameters are calculated from the training data. For MAP, the prior parameters are αii = αi,i+1 = 2, αij = 1 for j 6 = i and j 6 = i + 1, β1 = 2, βi6=1 = 1, m = µinit , s = 0.5, ζ = η = ν = ψ = 0.5, b = 0, and c = 1. We use half the assumed cost of each sensor for kl (kl = 1200 for the force features and kl = 600 for the vibration features). For MAP-beta, the hyperparameters are the same as for MAP except kl = 1, and κl is half the assumed cost of the sensor. The hyperparameters for VB are the same as in [30]. We attempt to set all initial values for VB as close as possible to the initial values for the EM models. The convergence threshold for this experiment is lowered to 10−6 for all four algorithms. Tables 2 to 4 contain the RMSE values for the full and reduced models, the ρ¯ for the removed sensor, and the sensor removed during feature selection. Removing a sensor decreases average RMSE in only 4 of the 12 tests. However, when a sensor is removed, the average RMSE increases more than 2µm in only two tests (VB using 20 states and MAP-beta using 10 states). The largest increase in RMSE for the reduced model occurs for the MAP-beta test on Tool 6 using 10 states. We believe this is due to the model underestimating relevant features resulting in skewed parameter estimates. In the experiments using 20 states, ML yields the lowest average RMSE for the full (21.81 µm) and reduced 1649

S. Adams et al.: Feature Selection for HMMs and HSMMs

TABLE 2. Results for 20 state PHM experiments. MAP and MAP-beta consistently select force in the Y direction for removal, which is the more expensive sensor. ML and VB, which do not consider cost, select varying sensors for removal. The average RMSE and ± 1 standard deviation are given for each formulation.

TABLE 3. Results for 10 state PHM experiments. MAP and MAP-beta consistently removes force in the Y direction, which is the more expensive sensor. VB and ML, which do not consider cost, select varying sensors for removal. The average RMSE and ± 1 standard deviation are given for each formulation.

models (22.14 µm). In the experiments using 10 states, VB yields the lowest average RMSE for the full (20.77 µm) and reduced models (21.52 µm). In the experiments using 5 states, MAP yields the lowest average RMSE for the full model (24.07 µm), and MAP-beta yields the lowest for the reduced model (24.59 µm). For all three assumed number of states, MAP produces full and reduced models with average RMSE below 27 µm, while the results for ML, MAP-beta and VB vary depending on the number of states. For this application when only prediction accuracy is considered, the ML, MAP, and VB are essentially interchangeable depending on the initial assumptions, while MAP-beta gives poor accuracy results for the 10 state models. MAP and MAP-beta stand out during the feature selection process. The MAP-beta formulation gives the smallest estimated feature saliency for all features. The estimated feature saliencies for the MAP algorithm are smaller than either ML or VB, but larger than MAP-beta. This is expected due to the use of 1650

the prior, and the fact that we have previously established that MAP-beta can underestimate ρ. For ML and VB, the lack of a prior results in an estimated ρ for each sensor greater than 0.75. In general, these ρ close to 1 would indicate that the feature is relevant and should not be removed from the model. However, our goal is to remove one sensor in order to reduce the cost of the sensor setup. The total cost of all six sensors is $10,800. When a vibration sensor is removed, the total cost of the remaining sensors is $9600. When a force sensor is removed, the total cost is $8400. For all the MAP and MAP-beta tests, the force in the Y direction is removed during feature selection. Further, both MAP algorithms consistently remove the same sensor when the training set is changed. The removed sensor when using the ML and VB algorithms vary when the number of states and the training set are changed. However, the removed sensor does stabilize for ML and VB when the number of states is reduced to 5. MAP and MAP-beta consistently yield a reduced model with the lowest VOLUME 4, 2016

S. Adams et al.: Feature Selection for HMMs and HSMMs

TABLE 4. Results for 5 state PHM experiments. MAP and MAP-beta consistently removes force in the Y direction, which is the more expensive sensor. ML and VB, which do not consider cost, remove a less expensive sensor, vibration in the Y direction. The average RMSE and ± 1 standard deviation are given for each formulation.

FIGURE 3. Plots for Tool 6 of the 20-state model. Wear is the true wear measurement, Full is the full model using 18 features, Reduced is the reduced model using 15 features.

total cost of sensors. The ML and VB algorithms do not give a clear indication of which sensor should be purchased, while MAP consistently indicates that force in the Y direction can be eliminated from the sensor setup. The ML and VB algorithms suggest three different sensors can be removed depending on the number of states and training sets. As the number of states decreases, the RMSE generally increases. The models with 20 states fit the data better. We decrease the number of states to 5 to show the affect of more states on ρ. The estimated saliencies increase with the number of states because, in general, the probability of a series of observations coming from a multi-modal Gaussian distribution (the state-dependent distribution) is greater than the probability of them coming from a single Gaussian (the state-independent distribution). The priors force ρl towards 0 VOLUME 4, 2016

and helps discriminate between relevant and irrelevant features. This is further support for using a MAP formulation over ML or VB when modeling HMMs with a larger state space. The VB algorithm does not estimate a left-to-right Markov chain. During testing, the VB models can predict decreasing wear estimates. Figure 3 displays plots of the true wear curve for Tool 6 and the wear curves estimated by the full and reduced models. The three EM based formulations all predict non-decreasing wear curves, while the VB method predicts curves that can transition to lower states. The figure displays only one example of VB predicting decreasing wear estimates, but this occurs in almost every wear experiment when using the VB formulation. Further, a proposed advantage of VB is the ability to estimate the number of states. In this 1651

S. Adams et al.: Feature Selection for HMMs and HSMMs

TABLE 5. FSEDHMM results for 5 state PHM experiments. MAP removes force in the Y direction, while ML removes vibration in the Y direction which is less expensive than the force sensor. The average RMSE and ± 1 standard deviation are given for each formulation.

experiment, the estimated number of states by VB is the same as the initial number of states. C. PHM DATA - FSEDHMM

The ML and MAP formulations of the FSEDHMM are tested on the PHM data set. The experimental setup is the same as the FSHMM experiment. There is no version of the VB formulation for EDHMMs to test, and the MAP-beta formulation for the FSEDHMM is omitted. Only the 5 state model is evaluated. The ML and MAP FSEDHMMs are initialized with the same values. The initial state probability and the transition matrices are not estimated. We assume a left-to-right model. For an EDHMM, a left-to-right model has a probability of 1 of transitioning to the next state, and a zero probability for all other state transitions. The initial distribution is also known to be 1 for the first state, and 0 for all other states, because we assume each tool is new and has no wear. The statedependent means µil are equally spaced between −2 and 2. The state-dependent standard deviations σil is 1 for all states and features. The state-independent parameters are calculated from the training data. The initial values for λi are (T + 1)/I . For MAP, the prior hyperparameters are m = µinit , s = ζ = η = ν = ψ = 0.5, b = 0, c = 1, o = λinit , and $ = 1. Half of the assumed cost of each sensor is assumed for kl (kl = 1200 for the force features and kl = 600 for the vibration features). The convergence threshold for this experiment is 10−6 . The RMSE values for the full and reduced models are in Table 5. The MAP formulation outperforms the ML formulation in terms of average RMSE and feature reduction. The ML formulation selects vibration in the Y direction to be removed, and the MAP formulation selects force in the Y direction. The average estimated ρ for the removed features are lower for MAP than for ML, which is to be expected due to the prior. The reduced model for ML performs better than the full, while the full model slightly outperforms the reduced for MAP. When comparing the FSEDHMM with the FSHMM results in Table 4, the ML FSEDHMM has a lower average RMSE for the full model, but a higher average RMSE for the reduced. The MAP FSEDHMM has a lower average RMSE for both the full and reduced models. The features selected 1652

for removal are the same for FSEDHMM and FSHMM. The difference in the average RMSE of the FSEDHMM and FSHMM might not be significant; however, the MAP formulation continues to select the more expensive sensor for removal. One advantage to the FSEDHMM is the ability to interpret the estimated values for λ as the average time in each state. For example, the estimated λ for ML using Tools 1 and 4 as the training data are [56.52, 82.99, 80.15, 63.22, 80.31]. On average, this model will spend 56 cuts in the first state. All values for the estimated λ are in Table 6. λ estimated by MAP more closely reflect a true wear curve than the ML estimates, because the average number of cuts spent in the first state, which can be interpreted as the initial wear-in state, is shorter than the average number of cuts in all other states. TABLE 6. FSEDHMM λ’s for 5 state PHM experiments. In this table, the tools indicate the training set. These parameters can be interpreted as the average time in each state.

D. MICROSOFT KINECT DATA - FSHMM

We also consider a data set containing Microsoft Kinect data. This data set was collected by observing a worker engaged in a painting process in a manufacturing setting. This painting process can be described as a sequence of six tasks, which are labeled as ‘Fetch’, ‘Paint’, ‘Dry’, ‘Load’, ‘Walk’ and ‘Other’. These tasks are performed repeatedly by the worker. Our objective is to infer the task (the hidden state) that the worker is engaged in at each time step from body position observations collected by the Kinect. As observable features, the Kinect records the X, Y, and Z coordinates of ten upper body joints. Specifically, the Kinect records the positions of ten joints on the body labeled ‘Head’, ‘Shoulder Center’, ‘Shoulder Left’, ‘Elbow Left’, ‘Wrist Left’, ‘Hand Left’, ‘Shoulder Right’, ‘Elbow Right’, ‘Wrist Right’, and ‘Hand Right’. The set is composed of observations collected over the course of one hour. The first two-thirds of the data are used for training the models, and the last third is reserved VOLUME 4, 2016

S. Adams et al.: Feature Selection for HMMs and HSMMs

FIGURE 4. Blue bars indicate saliences ≥ 0.9 and red bars indicate saliences < 0.9.

for testing the accuracy of our inferred models. The first 2000 observations of the training set are used to calculate initial parameter estimates. The task being performed by the worker at each time step is actually known but is only used for calculating the initial parameter estimates and validation. The true tasks are not used when training the models using EM or VB. All four algorithms are initialized with the same values. The parameters for r(·|·) are initialized by calculating the state-dependent mean and standard deviation from the initialization set. The parameters for q(·|·) are the means and standard deviations of the data. The transition probabilities are initialized by counting the number of transitions in the initialization set. The MAP hyperparameters are α = 1, β = 1, m = µinit , s = 0.25, ζ = 0.25, η = 1, b = init , c = 0.5, ν = 0.25, ψ = 0.5, and k = 15, 000. (k is set very high to drive saliencies towards zero). The MAP-beta hyperparameters are the same as MAP except kl = 1 and κl = 15, 000. The cost of collecting each feature is the same for this data set. Therefore, the prior weights in MAP and MAP-beta are used to remove a larger set of features than the ML or VB algorithms. The VB parameters and the convergence threshold are the same as those used in the PHM experiments. We perform three validation experiments. In all experiments, we use the four algorithms to infer the parameters of models. In the first experiment we treat all 30 features as relevant, and test the classification accuracy of the three models on a test set. In the second experiment, we remove features with estimated ρ below a given threshold. In the VOLUME 4, 2016

third experiment, we sequentially remove features based on the estimated ρ. We then test the reduced models and calculate classification accuracy. The test set is composed of approximately 20 minutes worth of data sampled at a rate of 30 frames per second, resulting in 30,102 time steps. Using the models obtained by each of the algorithms, we infer which of the six tasks was being performed in each time step and compare the inferred task with the known task in that time step. The Viterbi algorithm using (26) is implemented for task prediction. In the first experiment, the fractions of correctly classified states on the test set for ML, MAP, MAP-beta and VB are 0.7572, 0.7473, 0.7590 and 0.6415. ML, MAP and MAP-beta all outperform VB by more than 10%. The MAP-beta yields the most accurate model using all 30 features. In the second experiment, a removal threshold is chosen and features with estimated ρ below a threshold of 0.9 are removed from the model during feature selection. Figure 4 shows plots of ρ, where features below the threshold are marked in red. Using a truncated exponential prior with support on [0, 1] in the MAP formulation forces ρ estimates towards zero and allows k to be chosen so the features can distinguish themselves between relevant and irrelevant. All estimated saliencies using MAP-beta are below the 0.9 threshold. ML and VB tend to assign higher estimated values of ρ making it more difficult to perform feature selection. The fraction of correctly classified states for the reduced ML model is 0.7600, the reduced MAP model is 0.7606, and the reduced VB model is 0.6561. The fraction of correctly classified states for the reduced MAP-beta model is 0.4964, 1653

S. Adams et al.: Feature Selection for HMMs and HSMMs

TABLE 7. Order of features removed during sequential reduced model testing.

which is to be expected as no features are used in the estimation. The model predicts every observation as ‘Paint’, and the classification accuracy reflects the proportion of ‘Paint’ in the test set. For a better comparison, we remove features with estimated ρ below 0.5 and retest MAP-beta. This yields a fraction of correctly classified states of 0.7480. The largest estimated ρ for MAP-beta is 0.6525, while the smallest for VB and ML are 0.7402 and 0.8068. Therefore, we cannot select a single removal threshold that is acceptable for every model. The reduced ML and VB models each remove 5 features, while the reduced MAP model removes 18 features. The reduced models for theses three algorithms have improved accuracy over their corresponding full model. When the removal threshold is lowered to 0.5 for MAP-beta, only 6 features are removed, and the accuracy was less than the corresponding full model. In the third experiment, features are removed sequentially based on the estimated saliencies. More specifically, for the single removed feature models, the feature with the lowest estimated ρ is removed and tested. For the models removing two features, the two features with the lowest estimated saliencies are removed and the models are tested. The process continues until all features have been removed and tested resulting in 30 reduced models per algorithm. The results are displayed in Figure 5. For this third experiment, the EM based algorithms dominate the VB algorithm until over 25 features are removed from the model. At this point, the algorithms’ accuracies 1654

FIGURE 5. Sequential feature removal results for ML, MAP, MAP-beta and VB algorithms.

converge and begin to drop dramatically. When focusing on the EM algorithms, ML has a higher accuracy than MAP and MAP-beta until more than 25 features are removed. MAP and MAP-beta yield similar results with MAP-beta outperforming MAP in 16 of the 30 reduced models. However, starting at removing 24 features, the accuracy of MAP-beta drops to around 66%. At this point, under estimating relevant features starts to affect the performance of the algorithm. ML and MAP do not have this problem and continue with performance measurements above 70%. Table 7 contains the order the features are removed for each algorithm. It is clear that each algorithm removes features VOLUME 4, 2016

S. Adams et al.: Feature Selection for HMMs and HSMMs

in a different order. However, the first feature removed is always the ‘Right Hand’ in the Y direction. Further, when the same feature subset is used for testing, as in the models that remove 1 and 3 features, the EM based algorithms outperform the VB algorithm. This indicates that the EM algorithms in general give better parameter estimates than the VB algorithm for this data set. Again, the VB algorithm does not allow for certain transitions of the Markov chain to be 0. For this implementation, specific transition between tasks could be impossible. For instance, a worker might never go from ‘Paint’ to ‘Load’. The VB algorithm could allow for this transition but, if this transition never happens in the training set, the ML and MAP algorithms will set this transition to 0. E. MICROSOFT KINECT DATA - FSEDHMM

The FSEDHMM is also tested on the Kinect data set. The data are divided into training and testing sets as in all previous experiments: the first two thirds are used for training, the last third is reserved for testing, and the first 2000 observations of the training set are used for initializing the algorithm. This testing procedure only provides point estimates for the accuracy. Due to the significant increase in the computation required to train the FSEDHMM, the training set is broken into smaller sequences, so it can be processed in parallel. More specifically, the probabilities for each sequence in the E-step of the EM algorithm can be calculated independently and in parallel. Each sequence is approximately four minutes worth of data. Without parallel processing, the amount of time needed to train the FSEDHMM on the Kinect data set is on the order of months. However, there is a trade-off to segmenting the data sets in this fashion: it introduces discontinuities to the data. For the first observation in each of the smaller sequences, the algorithm has no knowledge of the data that came before it. For the last observation, there is no knowledge of the data that comes after it. As the number of sequences increases, this effect on the data increases. We believe four minute sequences adequately balance the trade-off between the training time and adding discontinuities to the data. ML and MAP are initialized with the same values. The parameters for r(·|·) are initialized by calculating the statedependent mean and standard deviation from the initialization set. The parameters for q(·|·) are the means and standard deviations of the training data. The transition probabilities are initialized by counting the number of transitions in the initialization set. The duration distribution parameters are calculated from the initialization set by averaging the time spent in each state before a transition. The MAP hyperparameters are αij = 1, βi = 1, m = µinit , s = 0.25, ζ = 0.25, η = 1, b = init , c = 0.5, ν = 0.25, ψ = 0.5, o = λinit , $ = 1, and k = 15, 000. The maximum duration is 2000, and the minimum duration is 30. These values are selected based on the range of times for each task. The maximum duration is just over one minute. The total work sequence should take around 30 seconds to complete so none of the primary tasks should last for one minute. The ‘Other’ state is the only possible task VOLUME 4, 2016

TABLE 8. FSEDHMM λ’s for Kinect data. These parameters can be interpreted as the average time in each task.

that can last longer than one minute but this rarely occurs. The minimum duration is one second, and each task should take at least this amount of time to complete. The convergence threshold for this experiment is increased to 10−4 in order to decrease training time. The fraction of correctly classified observations, or the accuracy, for the ML full model is 0.8056, and the accuracy for the reduced model is 0.8067. The accuracies for the MAP full and reduced models are 0.8193 and 0.8037. The full MAP model gives the best accuracy, and reduced MAP gives the worst. The full and reduced models are less than 1% apart. Overall, the FSEDHMM performs better on the Kinect data than the FSHMM. ML removes 5 features, while MAP removes 19, one more feature than its FSHMM counterpart. This is possibly due to the higher convergence threshold. While the reduced MAP model does not perform as well as the full model or either ML model in terms of accuracy, it uses more than half the number of features with less than a 2% drop in accuracy. As with the PHM data, we can examine the estimates for the duration parameters to gain knowledge about the process. The parameter estimates for λ are displayed in Table 8. On average ‘Walk’ takes the least amount of time. This makes sense because the distance between the painting booth and the drying rack is about a step. On average, ‘Load’ takes the most time because the worker can remain at the loading rack for a significant amount of time performing several tasks. VI. CONCLUSION

In this paper, a joint feature selection and parameter estimation algorithm is presented for HMMs and HSMMs. It is assumed that the number of hidden states is known for both models. The MAP approach has the added advantage of a prior on ρ that can be adjusted until estimated saliencies for less relevant features begin to fall. Further, the priors on ρ are used to incorporate the cost of collecting a feature into the feature selection process. When predicting state sequences for given data, point estimates for the parameters are used rather than the full posterior. Therefore, calculating the full posterior as in the VB approach is not necessary. The proposed MAP formulation is compared to ML, MAPbeta, and VB approaches for HMMs on synthetic and collected data. For the HSMMs, the MAP formulation is only compared to the ML formulation. The MAP-beta formulation is omitted due to the results of the HMM experiments demonstrating the advantages of MAP when using a prior. There is no published VB approach for HSMMs. The experiments on the synthetic data show that all four of the HMM algorithms give accurate parameter estimates for the transition 1655

S. Adams et al.: Feature Selection for HMMs and HSMMs

probabilities and the emission distribution parameters. However, ML and VB prefer to assign higher saliency values, while the truncated exponential prior in MAP pushes the saliencies towards zero. The beta prior assigns low values to irrelevant features but tends to underestimate the value of relevant features. These points are reiterated by the experiments on the real data sets. The EM methods are less computationally expensive than the VB formulation and converge in fewer iterations. Experiments on synthetic data confirm the differences between ML and MAP when moving to a semiMarkov process, but these results are not presented in this study. The experiments on the PHM tool wear data set demonstrate that the cost of collecting a feature can be integrated into the model estimation and feature selection process. When compared to the other tested algorithms, the MAP algorithm consistently yields a reduced model with a smaller total cost of sensors without significantly effecting prediction accuracy and is robust to changes in the training set and the number of states. The MAP-beta formulation yields the same feature set as MAP but can produce larger RMSE. When different training sets are used, ML and VB produce varying results. The ML and VB algorithms do not clearly inform the user which sensor to purchase. The MAP and MAP-beta formulations produce estimated saliencies smaller than ML or VB. Further, the VB formulation does not allow for a left-to-right Markov chain, which is desirable in tool wear applications. The VB approach can be used to determine the number of underlying states that best predict the data in the case where the number of states is unknown. However, for this tool wear data set, the number of initial states always equals the number of states estimated by VB negating this advantage. The HSMM models outperforms the corresponding FSHMM models when assuming 5 states. Further, MAP continues to select the more expensive feature for removal. MAP produces the highest accuracy for the reduced model on the Kinect data set while also removing the largest number of features when the thresholding technique is used for feature selection. ML and VB remove 5 features where MAP removes 18 features. At the chosen removal threshold, MAP-beta removes all features and produces a poor reduced model, but it does give the highest accuracy of all the full models. The ML and MAP formulations yield full and reduced models with higher accuracy than VB. When features are removed sequentially based on the estimated saliencies, the EM based algorithms all outperform the VB algorithm. The cost associated with collecting each feature is the same for all features, so the prior is used to drive saliency towards 0 and help discriminate between relevant and irrelevant features. Both FSEDHMM models outperform MAP FSHMM with accuracies above 0.8. Further, MAP FSEDHMM removes one more feature from the reduced model than the corresponding MAP FSHMM model. In conclusion, when the number of states is known and it is desired to reduce the number of features, the MAP formulation gives both accurate parameter estimates and 1656

feature saliencies weighted towards zero. MAP gives the added advantage of allowing the cost of collecting a feature to be included in the feature selection process. We prefer the exponential prior over the beta prior on the feature saliencies. MAP is also robust to changes in the training data and modeling assumptions. The feature saliency technique can be extended to HSMMs to offer a specific feature selection technique for semi-Markov processes. REFERENCES [1] S. Adams, P. A. Beling, and R. Cogill, ‘‘Infusing prior knowledge into hidden Markov models,’’ in Proc. ECMLPKDD Doctoral Consortium, 2015, pp. 23–32. [2] F. I. Bashir, A. A. Khokhar, and D. Schonfeld, ‘‘Object trajectory-based activity classification and recognition using hidden Markov models,’’ IEEE Trans. Image Process., vol. 16, no. 7, pp. 1912–1919, Jul. 2007. [3] J. A. Bilmes, ‘‘A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models,’’ Int. Comput. Sci. Inst., vol. 4, no. 510, p. 126, 1998. [4] R. J. Boys and D. A. Henderson, ‘‘A comparison of reversible jump MCMC algorithms for DNA sequence segmentation using hidden Markov models,’’ Comput. Sci. Statist., vol. 33, pp. 35–49, 2001. [5] S. P. Chatzis and D. I. Kosmopoulos, ‘‘A variational Bayesian methodology for hidden Markov models utilizing Student’s-t mixtures,’’ Pattern Recognit., vol. 44, no. 2, pp. 295–306, Feb. 2011. [6] G. Consonni and J.-M. Marin, ‘‘Mean-field variational approximate Bayesian inference for latent variable models,’’ Comput. Statist. Data Anal., vol. 52, no. 2, pp. 790–798, Oct. 2007. [7] L. Gao and X. S. Wang, ‘‘Feature selection for building cost-effective data stream classifiers,’’ in Proc. 5th IEEE Int. Conf. Data Mining, Nov. 2005, pp. 1–4. [8] J.-L. Gauvain and C.-H. Lee, ‘‘Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains,’’ IEEE Trans. Speech Audio Process., vol. 2, no. 2, pp. 291–298, Apr. 1994. [9] K. Iswandy and A. Koenig, ‘‘Feature selection with acquisition cost for optimizing sensor system design,’’ Adv. Radio Sci., vol. 4, no. 7, pp. 135–141, 2006. [10] S. Ji and L. Carin, ‘‘Cost-sensitive feature acquisition and classification,’’ Pattern Recognit., vol. 40, no. 5, pp. 1474–1485, May 2007. [11] M. H. C. Law, M. A. T. Figueiredo, and A. K. Jain, ‘‘Simultaneous feature selection and clustering using mixture models,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 9, pp. 1154–1166, Sep. 2004. [12] X. Li and J. Bilmes, ‘‘Feature pruning for low-power ASR systems in clean and noisy environments,’’ IEEE Signal Process. Lett., vol. 12, no. 7, pp. 489–492, Jul. 2005. [13] C. X. Ling, Q. Yang, J. Wang, and S. Zhang, ‘‘Decision trees with minimal costs,’’ in Proc. 21st Int. Conf. Mach. Learn., 2004, p. 69. [14] C. A. McGrory and D. M. Titterington, ‘‘Variational Bayesian analysis for hidden Markov models,’’ Austral. New Zealand J. Statist., vol. 51, no. 2, pp. 227–244, Jun. 2009. [15] F. Min, H. He, Y. Qian, and W. Zhu, ‘‘Test-cost-sensitive attribute reduction,’’ Inf. Sci., vol. 181, no. 22, pp. 4928–4942, Nov. 2011. [16] F. Min, Q. Hu, and W. Zhu, ‘‘Feature selection with test cost constraint,’’ Int. J. Approx. Reasoning, vol. 55, no. 1, pp. 167–179, Jan. 2014. [17] F. Min and Q. Liu, ‘‘A hierarchical model for test-cost-sensitive decision systems,’’ Inf. Sci., vol. 179, no. 14, pp. 2442–2452, Jun. 2009. [18] J. A. V. Montero and L. E. S. Sucar, ‘‘Feature selection for visual gesture recognition using hidden Markov models,’’ in Proc. 5th Int. Conf. Comput. Sci. (ENC), Sep. 2004, pp. 196–203. [19] J. Nouza, ‘‘Feature selection methods for hidden Markov model-based speech recognition,’’ in Proc. 13th Int. Conf. Pattern Recognit., vol. 2. Aug. 1996, pp. 186–190. [20] M. Núñez, ‘‘The use of background knowledge in decision tree induction,’’ Mach. Learn., vol. 6, no. 3, pp. 231–250, May 1991. [21] P. Paclík, R. P. W. Duin, G. M. P. van Kempen, and R. Kohlus, ‘‘On feature selection with measurement cost and grouped features,’’ in Structural, Syntactic, and Statistical Pattern Recognition. Berlin, Germany: Springer, 2002, pp. 461–469. [22] L. Rabiner, ‘‘A tutorial on hidden Markov models and selected applications in speech recognition,’’ Proc. IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989. VOLUME 4, 2016

S. Adams et al.: Feature Selection for HMMs and HSMMs

[23] D. J. Rude, S. Adams, and P. A. Beling, ‘‘A benchmark dataset for depth sensor-based activity recognition in a manufacturing process,’’ IFACPapers OnLine, vol. 48, no. 3, pp. 668–674, 2015. [24] D. J. Rude, S. Adams, and P. A. Beling, ‘‘Task recognition from joint tracking data in an operational manufacturing cell,’’ J. Intell. Manuf., pp. 1–15, Nov. 2015. [25] V. S. Sheng, C. X. Ling, A. Ni, and S. Zhang, ‘‘Cost-sensitive test strategies,’’ in Proc. Nat. Conderence Artif. Intell., 2006, vol. 21. no. 1, p. 482. [26] L. Xie, S.-F. Chang, A. Divakaran, and H. Sun, ‘‘Structure analysis of soccer video with hidden Markov models,’’ in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 4. May 2002, pp. IV-4096–IV-4099. [27] Q. Yang, C. Ling, X. Chai, and R. Pan, ‘‘Test-cost sensitive classification on data with missing values,’’ IEEE Trans. Knowl. Data Eng., vol. 18, no. 5, pp. 626–638, May 2006. [28] S.-Z. Yu, ‘‘Hidden semi-Markov models,’’ Artif. Intell., vol. 174, no. 2, pp. 215–243, Feb. 2010. [29] H. Zhao, F. Min, and W. Zhu, ‘‘Cost-sensitive feature selection of numeric data with measurement errors,’’ J. Appl. Math., vol. 2013, Mar. 2013, Art. no. 754698. [30] H. Zhu, Z. He, and H. Leung, ‘‘Simultaneous feature and model selection for continuous hidden Markov models,’’ IEEE Signal Process. Lett., vol. 19, no. 5, pp. 279–282, May 2012.

STEPHEN ADAMS received the master’s degree in statistics and the Ph.D. degree from the Systems and Information Engineering Department, University of Virginia (UVA), in 2015. He is currently a Research Scientist with the Systems and Information Engineering Department, UVA. Prior to joining the Ph.D. program, he was with UVA’s Environmental Health and Safety Department. He is currently part of the Adaptive Decision Systems Laboratory, UVA. His research is applied to several domains, including activity recognition, prognostics and health management for manufacturing systems, and predictive modeling of destination given user geo-information data.

VOLUME 4, 2016

PETER A. BELING received the Ph.D. degree in operations research from the University of California at Berkeley. He is currently an Associate Professor with the Department of Systems and Information Engineering, University of Virginia (UVA). He is the Director of the UVA Adaptive Decision Support Laboratory and the Co-Director of the UVA Financial Decision Engineering Research Group. His research interests are in the area of data analytics and decision-making in complex systems, with an emphasis on adaptive decision support systems. His research has found application in a variety of domains, including prognostics and health management, mission-focused cyber security, and financial decision-making.

RANDY COGILL (M’07) received the Ph.D. degree in electrical engineering from Stanford University, USA, in 2007. From 2007 to 2014, he was an Assistant Professor with the Department of Systems Engineering, University of Virginia, USA. He is a Researcher with the Control, Optimization, and Decision Sciences Group, IBM Research, Ireland. His expertise is in the fields of stochastic modeling, optimization, control theory, and network algorithms. He has authored over 30 peer-reviewed publications, and has served as the Principal Investigator on a number of foundation and industry-sponsored projects. At IBM, he works on the technical development of data and analytics-driven solutions, primarily in the transportation and automotive domains.

1657