Robust Multi-Sensor Classification via Joint Sparse ... - Trac D. Tran

0 downloads 0 Views 2MB Size Report
University, Baltimore, MD, 21218 USA (email: nam, trac@jhu.edu ). Nasser M. Nasrabadi are with the U.S. Army Research Laboratory, Adelphi, MD 20783 USA ...
1

Robust Multi-Sensor Classification via Joint Sparse Representation

Nam H. Nguyen, Nasser M. Nasrabadi, Fellow, IEEE, and Trac D. Tran, Senior Member, IEEE

This work has been partially supported by the National Science Foundation (NSF) under Grants CCF-1117545 and CCF0728893; the Army Research Office (ARO) under Grant 58110-MA-II and Grant 60219-MA; and the Office of Naval Research (ONR) under Grant N102-183-0208. Nam H. Nguyen and Trac D. Tran are with the Department of Electrical and Computer Engineering, the Johns Hopkins University, Baltimore, MD, 21218 USA (email: nam, [email protected] ). Nasser M. Nasrabadi are with the U.S. Army Research Laboratory, Adelphi, MD 20783 USA (email: [email protected]).

2

Abstract In this paper, we propose a novel multi-task multivariate (MTMV) sparse representation method for multi-sensor classification, which takes into account correlations as well as complementary information between heterogeneous sensors simultaneously while considering joint sparsity within each sensor’s observations. This approach can be seen as the generalized model of multi-task and multivariate Lasso, where all the multi-sensor data are jointly represented by a sparse linear combination of the training data. We further modify our MTMV model by including an environmental clutter noise term that is also assumed to be sparse in the feature domain. An efficient algorithm based on alternative direction method is proposed for both models. Extensive experiments are conducted on real data sets and the results are compared with the conventional discriminative classifiers to verify the effectiveness of the proposed methods in the application of automatic border patrol, where we often have to discriminate between human and animal footsteps.

I. I NTRODUCTION Multi-sensor fusion have received considerable amount of attentions over the past few years for both military and non-military tasks [1] [2] [3]. A particular interest in multi-sensor fusion is classification, where the ultimate question is how to take advantage of having related information from different sources (tasks) recording the same physical event to achieve an improvement in the classification performance. A variety of approaches have been proposed in the literature to answer this question [4] and [5]. These methods mostly fall into two categories: decision in - decision out (DI-DO) and feature in - feature out (FI-FO) [3]. In [4], the authors investigated the DI-DO method on vehicle classification problem using data collected from acoustic and seismic sensors. They proposed to perform local classification (decision) for each sensor signal by conventional methods such as Support Vector Machine (SVM). These local decisions are then incorporated via a Maximum A Posterior (MAP) estimator to make the final classification decision, thus named the DI-DO method. In [5], the FI-FO method is studied for vehicle classification using both visual and acoustic sensors. They proposed a method to extract temporal gait patterns from both sensor signals to use them as inputs for SVM classifier. Furthermore, they compared the DI-DO and the FI-FO approaches on their dataset and showed a higher discrimination performance of the FI-FO approach over the DI-DO counterpart. In this paper, we study the multi-sensor classification problem under the FI-FO category. We employ appealing techniques from signal processing, particularly sparse signal representation, to considerably improve the classification performance. In signal processing, most natural signals are inherently sparse in

3

certain bases or dictionaries, where they can be approximately represented by only a few significant components carrying the most relevant information. In other words, the intrinsic signal information usually lies in a low-dimensional subspace and the semantic information is often encoded in the sparse representation. Especially with the emergence of the Compressed Sensing (CS) framework [6] and [7], sparse representation and related optimization problems involving sparsity as a prior called sparse recovery have increasingly attracted the interest of researchers in various diverse disciplines. Sparsity has been successfully employed in a large number of inverse problems, where sparsity acts as a strong prior to alleviate the ill-posed nature of the problem. Recent research has pointed out that sparse representation is also useful for discriminative applications [8] [9] [10]. These applications rely on the crucial observation that it is possible to represent the test sample as a linear combination of the training samples that belongs to the same class as the input test sample but not to other classes. Thus, if the dictionary is constructed from all the training samples in all the classes, the test samples can be sparsely represented by only a few columns of this dictionary. Therefore, the sparse coefficient vector, which is recovered efficiently via `1 -minimization techniques, can naturally be considered as the discriminative factor. In [9], the authors successfully applied this idea to the face recognition problem. Since then, many more sophisticated techniques have been exploited and applied to various fields such as hyperspectral target detection [11], acoustic signal classification [12], and visual classification [8] [13] [14]. For instance, the authors of [12] proposed a joint sparse model for acoustic signal classification, which exploits the fact that multiple observations from the same class could be simultaneously represented by a few columns of the training dictionary. Thus, coefficient vectors associated with these observations might deliver the same sparse pattern. Similarly, [14] investigated a multi-task model for visual classification, which also assumes that tasks that belong to the same class have the same sparse support distributions on their coefficient vectors. To improve classification performance, all of these models require efficient algorithms that take into account the sparsity as a prior information. In this paper, we consider a multi-sensor classification problem, focusing on discriminating between human and non-human footsteps. The experimental setup is as follows: a set of four acoustic, three seismic, one passive infrared (PIR), and one ultrasonic sensors are used to measure the same physical event simultaneously on the field. The ultimate goal is to detect whether the event involves human or human and animal footsteps. As opposed to the previous approaches on this problem in which only one single sensor is used to perform the classification [15] [16] [17], in this paper, we propose a novel regularized regression method, namely a multi-task multivariate lasso, which effectively incorporates both multi-task and multivariate Lasso ideas [8] [18]. This technique imposes joint-sparsity constraints both

4

within each task (sensor) and across multiple tasks. Furthermore, we extend our model to deal with sparse environmental noise with arbitrarily large magnitude. This high-energy environmental noise frequently appears in sensor data due to the unpredictable or uncontrollable nature of the environment during the data collection process. Although our technique is designed for military purposes, it is not restricted to this specific application. Rather, it can be applied to any set of classification or discrimination problems, where the data are usually collected from multiple sources. The remainder of this paper is organized as follows. Section II briefly introduces various sparsity models with the main focus on our proposed multi-task multivariate joint sparse representation (MTMVSRC) and MTMV-SRC with sparse environmental noise models (MTMV-SRC+E) in subsections C and D, respectively. We present in Section III a fast and efficient algorithm to solve convex optimization problems arose from these models. Extensive experiments are shown in Section IV and conclusions are drawn in Section V. II. S PARSITY MODELS Consider a multi-task (multi-sensor) C -class classification problem. Suppose we have a training set of p samples in which each sample has D different feature modalities (sensors). For each sensor i = 1, ..., D, X i1 , X i2 , ..., X iC ] as an n × p dictionary, consisting of C sub-dictionaries X ij ’s with we denote X i = [X xij,1 , x ij,2 , ..., x ij,pj ] ∈ Rn×pj represents a set of respect to C classes. Here, each sub-dictionary X ij = [x

training data from the i-th sensor labeled with the j -th class. Accordingly, x ij,k , usually referred to as an atom in the dictionary, is the k -th training sample from the i-th sensor with the j -th class label. Notice that pj is the number of training sample for class j -th and n is the feature dimension of each sample; therefore, P Y 1 , Y 2 , ..., Y D }, where the total samples is p = C j=1 pj . Given a test sample Y comprising of D tasks {Y

each sample from the sensor i, Y i , consists of di observations Y i = [yy i1 , y i2 , ..., y idi ] ∈ Rn×di , we would like to decide which class the sample Y belongs to.

A. Sparse representation for classification (SRC) We first review the single sensor sparse representation for classification method. For simplicity, we remove the subscript i representing the i-th sensor. In this sparse representation model, it is assumed that the training samples belonging to the same class approximately lie on a low-dimensional subspace. Given X 1 , X 2 , ..., X C ], where the j -th class sub-dictionary a dictionary representing C distinct classes X = [X xj,k }k=1,...,pj , a new test sample y belonging to the j -th class will approximately has pj training samples {x

5

lie in the linear sub-space spanned by the training samples associated with the j -th class: y = X w + n,

(1)

where w is the sparse coefficient vector whose entries have value 0’s except those associated with the atoms of the j -th class sub-dictionary: w = [00T , ..., 0 T , w Tj , 0 T , ..., 0 T ]T , and n is a small noise due to the imperfection of the test sample. In order to obtain the sparse vector w , it is natural to consider the following optimization w k0 b = argmin kw w

subject to

w

kyy − X w k2 ≤ ,

(2)

w k0 is the `0 -norm defined as the number of non-zero entries of w and  is the noise energy. where kw

This `0 -norm minimization can be interpreted as finding the sparsest solution obeying a quadratic constraint. However, (2) is well-known as an NP-hard problem due to the non-convexity and nondifferentiability of the `0 -norm. Many alternative approaches have been proposed to approximately solve (2) such as greedy pursuit algorithms (Orthogonal Matching Pursuit [19], Subspace Pursuit [20]) and Iterative Hard Thresholding algorithm (IHT) [21]. Alternatively, the `0 -minimization can be efficiently solved by recasting it as a `1 -based convex programming problem [6] [7]. In this paper, we favour the `1 -minimization approach, which is described as follows 1 w k1 , (3) kyy − X w k22 + λ kw 2 w P w k1 = pi=1 |wi |. This optimization is also known where λ is a positive regularization and the `1 -norm kw

b = argmin w

as Lasso [22], which can be solved efficiently in polynomial time by many convex optimization techniques [23] [24] [25] [26]. b is estimated, the class label of y is determined by the minimal residue Once the coefficient vector w between y and its approximation from each class sub-dictionary b w )k2 , j = argmin kyy − X δj (w

(4)

j

where δ j (·) is a vector indicator function, defined by keeping the coefficients corresponding to the j -th w i ) = [00T , ..., 0 T , w Tj , 0 T , ..., 0 T ]T . class and setting all others to be zero, i.e., δ k (w

6

Fig. 1. d test segments of the same size from one seismic test signal. These segments are then extracted to obtain cepstral features.

B. Multivariate sparse representation for classification (MV-SRC) In the previous section, we only considered a single task sparse representation, where the test sample is generated by a single sensor and each test sample consists of only one vector representing a single observation. However, the test sample Y may consists of multiple observations of the same physical event obtained by the same sensor: Y = [yy 1 , y 2 , ..., y d ] ∈ Rn×d . In our application, each observation is one segment of the test signal, where segments are obtained by partitioning the test signal into d (overlapping) segments, as shown in the Fig. 1. Again, suppose the test signal belongs to the j -th class, it can be assumed that each observation segment y i is a linear combination of training samples in the sub-dictionary X j , which consists of all training segments. That is, for all i = 1, ..., d, y i = X w i + n i , X 1 , X 2 , ..., X C ] is a concatenation of C sub-dictionaries and w i ’s are sparse vectors whose where X = [X

nonzero entries are associated with the j -th class: w i = [00T , ..., 0 T , w Ti,j , 0 T , ..., 0 T ]T and n i ’s are small w 1 , w 2 , ..., w d ] ∈ Rp×d , then W is a row-sparse matrix with only pj additive noises. If we denote W = [w

nonzero rows associated with the j -th class atoms. This model is the extension of the aforementioned SRC model to multiple observations, which has been extensively studied in the literature [8] [14] [27]. In these papers, the authors showed significant classification improvement of this model over the simple SRC model. To recover the row-sparse matrix W , the following joint sparse optimization is proposed with q ≥ 1. This optimization has been well known as multivariate Lasso [18] [28] [29]: c = argmin 1 kY Y − X W k2F + λ kW W k1,q , W 2 W W k1,q is a norm defined as kW W k1,q = where λ is a positive regularization parameter, and kW

(5) Pp

k=1

k

w q

where w k ’s are the row vectors of W . This norm can be phrased as performing an `q -norm across the columns (observations) and then an `1 -norm along the rows. It is clear that this `1 /`q regularization norm encourages shared patterns across related observations, and thus, the solution of the optimization (5) has

7

common support at column level. The class label is determined by the following rule b Y − X δ j (W W )kF , j = argmin kY

(6)

j

where δ j (·) is the matrix indication function, defined by keeping rows corresponding to the j -th class and setting all the others to zeros. We demonstrate here the advantage of exploiting the correlation between multiple test observations in order to improve the classification performance. We perform classification between human and humananimal footsteps using data recorded from one acoustic sensor. The details of our experimental setup is explained later in Section IV. This test signal represents the human-animal walking event. The test observations is extracted from multiple signal segments where the footstep events occur; each signal segments is transformed into the cepstral feature domain [30]. In this experiment, we compare the classification accuracy between the SRC and MV-SRC models and show the superiority of the latter. In the MV-SRC model, we employ 10 segments and perform classification jointly while in the SRC model, each of the 10 segments is applied independently. The reconstructions with respect to the human (H) and human-animal (HA) cepstral feature coefficients of these two approaches are plotted in Figs. 2 and 3, respectively, where each sub-figure corresponds to one single segment. In these figures, the red curve represents the original test segment while the blue and green curves are reconstructions from humananimal and human cepstral feature coefficients, respectively. As one can see, in the SRC model (Fig. 2), the green curves are closer to the red ones for all 10 segments, resulting in a wrong classification (we only plot nine segments here for a clearer presentation), while in the MV-SRC model (Fig. 3), the blue curves are closer to the red ones, implying a correct classification. More specifically, we show the reconstruction errors for these two methods in Table I and II. One can clearly observe that the reconstruction errors for the MV-SRC method are significantly smaller than that of the SRC model. Moreover, in Table II, the reconstruction errors with respect to the human-animal cepstral coefficients are lower than that of the human cepstral coefficients for all the segments while these are not the case in Table I. Consequently, the MV-SRC model should consistently return a more accurate classification. C. Multi-task multivariate sparse representation for classification (MTMV-SRC) In the previous section, we employed a single sensor sparse representation for classification. In the scenario where an event is captured by multiple heterogeneous sensors, thus multiple observations are available in the test sample from different sensors. By exploiting the correlation as well as the

8

TABLE I R ECONSTRUCTION ERRORS FROM HUMAN (H)

Reconst. error H coeffs HA coeffs

Seg 1 0.65 0.48

Seg 2 0.43 0.71

Seg 3 0.53 0.74

AND HUMAN - ANIMAL

Seg 4 0.51 0.70

Seg 5 0.53 0.62

(HA)

Seg 6 0.49 0.61

COEFFICIENTS

Seg 7 0.56 0.51

- SRC

Seg 8 0.57 0.56

MODEL .

Seg 9 0.40 0.76

Seg 10 0.44 0.58

TABLE II R ECONSTRUCTION ERRORS FROM HUMAN (H)

Reconst. error H coeffs HA coeffs

Seg 1 0.23 0.22

Seg 2 0.28 0.25

Seg 3 0.45 0.32

0.12

Seg 6 0.33 0.27

COEFFICIENTS

Seg 7 0.35 0.23

- MV-SRC MODEL .

Seg 8 0.27 0.23

Seg 9 0.32 0.23

Seg 10 0.30 0.25

OA 0.97 0.77

0.15 Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.1

0.25

0.06 0.04 0.02 0

Magnitude

0.2

Magnitude

Magnitude

Seg 5 0.21 0.20

(HA)

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.3

0.08

0.15 0.1

0.05

0

0.05

−0.02

0

−0.04 −0.06 0

Seg 4 0.26 0.22

0.35

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.1

AND HUMAN - ANIMAL

−0.05

−0.05 10

20

30

40

50

−0.1 0

60

Feature

10

20

30

40

50

−0.1 0

60

20

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

30

40

50

60

Feature

0.25

0.08

0.06

10

Feature 0.25 Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.2

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.2

0.15

0.15

0.02

Magnitude

Magnitude

Magnitude

0.04 0.1 0.05

0.1 0.05

0

−0.02

−0.04 0

10

20

30

40

50

0

0

−0.05

−0.05

−0.1 0

60

10

20

Feature 0.15

0.15 Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

50

0

−0.05

20

30

Feaure

10

20

40

50

60

0.05

0

−0.1 0

30

40

50

60

Feature

−0.05

10

−0.1 0

60

0.15 Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.1

Magnitude

0.05

−0.1 0

40

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.1

Magnitude

Magnitude

0.1

30

Feature

0.05

0

−0.05

10

20

30

Feature

40

50

60

−0.1 0

10

20

30

40

50

60

Feature

Fig. 2. Original test segments vs. theirs reconstructions from human and human-animal cepstral coefficients. These nine test segments are extracted from one acoustic test signal and the SRC model is employed to reconstruct these test segments independently. Only the first 60 out of 500 cepstral coefficients are plotted.

complementary information between different sensors, we can potentially improve the classification accuracy. To handle multiple sensors, a naive approach is to employ a simple voting scheme (or DIDO method), where for each sensor the aforementioned two-step classification algorithm (described in Section II-B) is performed and a class label is assigned. The final decision is made by selecting the label that occurs most frequently. It is clear that this approach does not exploit the relationship between different sources except at the post-processing step where the decision is made via the majority-voting

9

0.15 Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.1

0.35

0.15 Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.3 0.25

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients 0.1

0.05

−0.05 −0.1

Magnitude

Magntiude

Magnitude

0.2 0

0.15 0.1 0.05

0.05

0

0 −0.05

−0.05

−0.15 −0.1 10

20

30

40

50

−0.15 0

60

Feature 0.08

40

50

−0.1 0

60

0 −0.02 −0.04 −0.06

20

30

20

40

50

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0

60

0.1 0.05 0

−0.1

−0.05

10

20

30

40

50

−0.1 0

60

10

20

Feature

30

40

50

60

Feature

0.15 Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

50

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.2

−0.05

Feature 0.15

40

0.15

0.05

−0.15 0

60

30

Feature

0.1

0.02

10

10

0.25

0.15

Magnitude

Magnitude

30

Feature

0.04

−0.08 0

20

0.2

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.06

10

Magnitude

−0.2 0

0.15 Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.1

0.1

0.1

0.05

0

Magnitude

Magnitude

Magnitude

0.05 0 −0.05

0.05

0

−0.1 −0.05

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

−0.15 −0.1 0

10

20

30

Feature

40

50

60

−0.2 0

10

20

30

40

50

60

−0.05

−0.1 0

Feature

10

20

30

40

50

60

Feature

Fig. 3. Original test segments vs. theirs reconstructions from human and human-animal cepstral coefficients. These nine test segments are extracted from one acoustic test signal and the MV-SRC model is employed to reconstruct these test segments jointly. Only the first 60 out of 500 cepstral coefficients are plotted.

fusion process. In this section, we propose a general framework, called multi-task multivariate sparse representation (MTMV-SRC), which takes into account correlations and complementary information between sensors simultaneously while considering joint sparsity within each sensor’s multiple observations. In this model, we exploit a joint sparsity prior on the sparse coefficient vectors obtained from different sensors’s data in order to make a joint classification decision. To illustrate this model, let us first consider a two-task classification with a test sample Y consisting of two tasks Y 1 and Y 2 collected from two different sensors as shown in Fig. 4. Suppose that Y 1 belongs to the j -th class and is observed by sensor 1, it can be N 1, reconstructed by a linear combination of the atoms in the sub-dictionary X 1j . That is, Y 1 = X 1W 1 +N

where W 1 is a sparse matrix with only pj nonzero rows associated with the j -th class and N 1 is a small noise matrix. Similarly, Y 2 represents the same physical event observed by sensor 2, it belongs to the same class, and thus, can be approximated by the training samples in X 2j with a different set of coefficients W 2j , Y 2 = X 2W 2 + N 2 , where W 2 has the same sparsity pattern as W 1 . W 1 , W 2 ], then W is a sparse matrix with only pj nonzero rows. Therefore, in If we denote W = [W

order to seek for this row-sparse matrix, we should incorporate this common sparse pattern prior into the Y i }D optimization algorithm. In the more general case where we have D sensors, if we denote {Y i=1 as a

10

Fig. 4. d test segments of the same size from acoustic and seismic test signals. These segments are then extracted to obtain cepstral features.

set of D observations each consisting of d segments collected from D sensors and let W ∈ Rn×pD be W 1 , W 2 , ..., W D ]. This matrix an unknown matrix formed by concatenating coefficient matrices W = [W W can be recovered by solving the following `1 /`q -regularized least square problem

c = argmin 1 W 2 W

D X

i

Y − X iW i 2 + λ kW W k1,q , F

(7)

i=1

where λ is a positive parameter and q is set to be greater than 1 to make the optimization convex. This optimization (7) is called multi-task multivariate Lasso. This framework can be seen as the generalization of both multi-task and multivariate sparse representation. In fact, if there is only one source of information used for classification inference, the MTMV model returns to the MV-SRC as presented in the previous section and the optimization (7) simplifies to (5). On the other hand, if there is only one observation from each sensor, then the optimization (7) returns to conventional multi-task Lasso as studied extensively in the literature [31]. c is obtained, the class label is decided by the minimal residual rule Again, once W b j = argmin j

D X

i

2

Y − X iδ ij (W W i ) F ,

(8)

i=1

where δ ij is the matrix indication function associated with i-th sensor, defined similarly as in the aforementioned section.

D. Multi-task multivariate sparse representation with sparse noise (MTMV-SRC+E) During the process of collecting data in the field, there might exists many different types of environmental clutter noises, such as impulsive noise or wind noise, affecting the true characteristics of the signal.

11

Unfortunately, due to the imperfectness of the environment, these noise sources are uncontrollable and can have arbitrarily large magnitudes, which sometimes dominate the collected signal. It is obvious that by removing these clutter noises it is possible to improve overall classification performance. Fortunately, these type of noises usually occupy certain frequency bands, depending on the environment. We expect that this will only affect some coefficients in our cepstral feature domain, which is the feature space used in our experiments. In this case, the linear model of the observation Y i , i = 1, ..., D with respect to the training data X i should be modified as Y i = X iW i + Z i + N i ,

i = 1, ..., D,

(9)

where N i is a small dense additive noise and Z i ∈ Rn×di is a matrix of clutter noise with entries having arbitrarily large magnitudes. The nonzero entries of E i ∈ Rn×di represents cepstral features of Y i corrupted by background clutter noise. Note that it might be possible that all these cepstral features

are corrupted. The location of the corruption can differ for different sensors’ data since heterogeneous sensors with different characteristics will have different types of errors. In more general scenarios, one can assume that the observations Y i is contaminated by the errors Z i , which can be sparsely represented i

with respect to some basis T i ∈ Rn×m . That is, Z i = T iE i for some sparse matrices E i ∈ Rmi ×di : Y i = X iW i + T iE i + N i ,

i = 1, ..., D.

(10)

The idea of exploiting the sparse prior of the error has been developed by Wright et. al. [9] in the context of face recognition, and by Cand`es et. al. [32] for robust principle component analysis. In this section, we propose a new sparse representation method that simultaneously performs classification and removes clutter noise. By taking the advantage of the sparse errors E i ’s are sparse, we propose to solve the following optimization to retrieve the coefficients W i ’s as well as the errors E i ’s c, E b ) = argmin 1 (W E 2 W ,E

D X

i

Y − X iW i − T iE i 2 + λW kW W k1,q + λE kE E k1 , F

(11)

i=1

where λW and λE are positive parameters, and E is a matrix formed by error matrices E i ’s: E = E 1 , E 2 , ..., E D ]. The `1 -norm of the matrix E is defined as the sum of absolute value of the entries: [E P E k1 = ij |eij | where eij ’s are the entries of the matrix E . It is clear from this minimization that we kE

encourage both row sparsity on W and entry-wise sparsity on the error E . Once the sparse solution W and error E are computed, the clean cepstral features Y ic are recovered by T iE i . To identify the class, we slightly modify the label inference in (8) that accounts setting Y ic = Y i −T

12

for the error E i b j = argmin j

D X

2

i

Y − X iδ ij (W W i ) − T i E i F .

(12)

i=1

III. A LGORITHM In this section, we propose a fast algorithm to solve (11). The optimization (7) can be solved similarly by setting the parameter λE in the optimization (11) with respect to the error regularization to zero. In a more general form, (11) can be seen as an instance of the following convex optimization: x) + g(x x) min f (x

(13)

where f, g : Rn 7→ R are both convex functions. Algorithms for solving this optimization have been extensively studied in the literature. In this paper, we focus on the class of alternating direction methods of multipliers (ADMM) that are based on the variable splitting technique combined with the augmented Lagrangian method. These methods have been shown particularly efficient in solving the `1 -norm minimization [33]. One of the key ideas of the ADMM algorithm is to decouple the variable x in two convex functions by splitting x into two variables, i.e., one introduces a new variable y and rewrites the problem (13) as x) + g(yy ) min f (x

such that x = y .

(14)

The reason behind variable splitting is that it might be easier to solve the constrained problem (14) than its unconstrained counterpart (13). Since (14) is an equality constrained problem, the augmented Lagrangian method (ALM) can be applied by minimizing the augmented Lagrangian function defined as x, y ; b ) := f (x x) + g(yy ) + hbb, y − x i + Fβ (x

β kyy − x k22 , 2

(15)

where b is called the multiplier of the linear constraints and β is the positive penalty parameter. The x, y ; b ) with respect to x and y jointly, i.e., it solves the subproblem ADMM consists in solving F(x xt+1 , y t+1 ) := argmin Fβ (x x, y ; bt ) (x

(16)

and then updates the Lagrange multiplier b via b t+1 := b t + β(yy t+1 − x t+1 ).

(17)

Minimizing (16) with respect to x and y jointly is often not easy. However, by considering the separable x, y ; b), we can further simplify the problem by minimizing structure of the objective function Fβ (x

13

x, y ; b) with respect to variables x and y separately. The alternating direction method of multipliers Fβ (x

(ADMM) for solving (13) is given in Algorithm 1 below:

1: 2: 3: 4: 5:

Choose β , b 0 and x 0 = y 0 While not converged do x, y t ; b t ) x t+1 := argminx Fβ (x xt+1 , y ; b t ) y t+1 := argminy Fβ (x b t+1 := b t + β(yy t+1 − x t+1 ) Algorithm 1: Alternating direction method of multipliers (ADMM)

We now apply the ADMM method to solve the optimization (11). Denote the loss function W ,E) = L(W

D

1 X

Y i − X iW i − T iE i 2 . F 2 i=1

Our ultimate goal is to solve the following optimization: W , E ) + λE kE E k1 + λW kW W k1,q . min L(W

(18)

E W ,E

First, we introduce auxiliary variables to reformulate the problem into a constrained optimization W , E ) + λE kU U k1 + λW kV V k1,q min L(W

E ,V V ,U U W ,E

(19)

subject to W = V , E = U . We reformulate this constrained problem to a constrained counterpart by introducing the augmented W , E , V , U ; B E , B W ) defined as Lagrangian function FβE ,βW (W W , E , V , U ; B E , B W ) = L(W W , E ) + λE kU U k1 + hB BE, E − U i + FβE ,βW (W V k1,q + hB BW , W − V i + + λW kV

βE E − U k2F kE 2

βW W − V k2F , kW 2

(20)

where B E and B W are the multipliers for the two linear constraints, and βE , βW are the positive penalty parameters. As suggested by the ADMM method, the optimization consists of solving W , E, V , U ; BE, BW ) FβE ,βW (W

with respect to one variable, keeping the other variables fixed, and then updating the variables sequentially. The algorithm is formally presented in Algorithm 2:

14

1: 2: 3: 4: 5: 6: 7: 8:

Choose W 0 , U 0 , V 0 , B W,0 B E,0 and βE , βW While not converged do W , E t , U t , V t ; B W,t , B E,t ) W t+1 = argminW FβE ,βW (W W t+1 , E , U t , V t ; B W,t , B E,t ) E t+1 = argminE FβE ,βW (W W t+1 , E t+1 , U , V t ; B W,t , B E,t ) U t+1 = argminU FβE ,βW (W W t+1 , E t+1 , U t+1 , V ; B W,t , B E,t ) V t+1 = argminV FβE ,βW (W W t+1 − U t+1 ) B W,t+1 := B W,t + βE (W W t+1 − V t+1 ). B E,t+1 := B E,t + βW (W Algorithm 2: ADMM for MTMV-SRC+E

More details of the algorithm, including close-formed solutions of each sub-optimization, is presented in the appendix. IV. S IMULATION RESULTS AND ANALYSIS In this section, we perform extensive experiments on a real multi-sensor data set and compare the results with several conventional classification methods such as logistic regression and SVM to verify the effectiveness of our proposed approach. A. Experimental setup First, let us briefly explain the experimental setup. 1. Data collection. Footstep data collection was conducted by two sets of nine sensors consisting of four acoustic, three seismic, one passive infrared (PIR), and one ultrasonic sensors over two days (see Fig. 5 for all four different types of sensors). Tests with human footsteps include one person walking, one person jogging, two people walking, two people running, and a group of people walking or running. Tests with human-animal footsteps include one person leading horse or dog, two people leading a horse and a mule, three people leading a horse, a mule and a donkey, and a group of multiple people with several dogs. In each test, people and animal could be carrying varying loads such as backpack or a metal pipe. People in the test comprised of both males and females. Ideally, we would like to discriminate between human and wild animal footsteps. However, such data is difficult to collect and not available currently for testing. During each run, the test subjects follow a path, where two sets of sensors were positioned, and returned to the starting point. The two sensor sets (consisting of all nine sensors) are placed 100 meters apart on a path. A total of 68 round-trip runs were conducted in two days, including 34 runs for human footsteps and another 34 runs for human-animal footsteps. The collected data, named DEC09 and DEC10 corresponding to two different days December 09 and 10, are presented in Table III.

15

TABLE III T OTAL AMOUNT OF DATA COLLECTED IN TWO DAYS .

Data DEC09 DEC10

Human 16 18

Human lead animal 15 19

Fig. 5. Four acoustic sensors (left), seismic sensor (middle left), passive infrared (PIR) sensor (middle right) and ultrasound sensor (right).

2. Segmentation. To accurately perform classification, it is necessary to extract the actual events from the run series. Although the raw signal of each run might last several minutes, the event is much shorter as it occurs in a short period of time when the test subject is close to the sensors. In addition, the event can be at arbitrary locations. To extract useful features, we need to detect time locations where the physical event occurs. To do this, we identify the location with strongest signal response using the spectral maximum detection method [34]. From this location, ten segments with 75% overlap on both sides of the signals are taken; each has 30, 000 samples corresponding to 3 seconds of the signal. This process is performed for all the sensor data. Overall, for each run, we have nine signals captured by nine sensors; each signal is divided into ten overlapping segments, thus D = 9 and each di = 10, i = 1, ..., D. Fig. 6 visually demonstrates the signals captured by four distinct sensors where the event is one person walking. As one can see, different sensors characterize different signal behaviors. The seismic signal shows the cadences of the test person more clearly, while we are not able to observe this event by other sensing signals. To provide a closer look at the sensing signals, we further show in Fig. 7 one segment extracted from each of the nine distinct sensing signals. In this figure, the forth acoustic signal is corrupted due to the sensor failure during the collection process. 3. Feature extraction. After segmentation, we extract the cepstral features [30] in each segment and keep the first 500 coefficients for classification. Cepstral features have been proved effective in speech recognition and acoustic signal classification. The feature dimension, which is represented by the number of extracted cepstral features, is n = 500.

16 Acoustic sensor

Seismic sensor 0.1 magnitude

magnitude

0.1

0

−0.1 0

0.5

1

1.5 second Passive infrared sensor

2

−0.1 0

2.5 5 x 10

0.5

1 1.5 second Ultrasonic sensor

2

2.5 5 x 10

2

2.5 5 x 10

−1.947 magnitude

magnitude

4

2

0 0

0.5

1

1.5

2

−1.948 −1.949 −1.95 0

2.5 5 x 10

second

Fig. 6.

0

0.5

1

1.5 second

Signals captured by four different sensors including acoustic, seismic, PIR and ultrasonic sensors. First accoustic sensor

Second accoustic sensor 0.2 magnitude

magnitude

0.05

0

−0.05 0

0.5

1

1.5 2 second Third accoustic sensor

2.5

0

−0.2 0

3

magnitude

magnitude

5

0

0.5

1

−3

x 10

2.5

3

1

1.5 2 second Third seismic sensor

1

−3

2.5

2.5

x 10

3

1.5 2 second Pasive infrared sensor

2.5

3 4

x 10

0 −5 −10 0

0.5

1

4

x 10

3 4

x 10

1.75 magnitude

magnitude

0.5

4

x 10

0.02

0

−0.02 0

1.5 2 second Second seismic sensor

3 4

x 10

−5

5

0.5

2.5

x 10

−10 0

0

−5 0

1.5 2 second Forth accoustic sensor

0

magnitude

magnitude

5

1.5 2 second First seismic sensor

1

−4

0.05

−0.05 0

0.5

4

x 10

0.5

1

1.5 second

2

2.5

1.7 1.65 1.6 0

3

0.5

1

4

x 10

1.5 second

2

2.5

3 4

x 10

Ultrasonic sensor magnitude

−1.932 −1.933 −1.934 −1.935 0

0.5

1

1.5 second

2

2.5

3 4

x 10

Fig. 7. Signal segments of length 30000 captured by sets of all the available sensors including four acoustic, three seismic, one PIR and one ultrasonic sensors.

B. Two-class problem Now we demonstrate the effectiveness of using multiple sensors over using a single sensor for a two-class classification problem, particularly, in discriminating between the human and human-animal footsteps. In this experiment, we use the DEC09 data for training and the DEC10 data for testing, which leads to 31 training and 37 testing samples. For each i-th sensor, the corresponding training

17

dictionary X i is constructed from all the cepstral feature segments extracted from the 31 training signals. In our experiments, ten segments are taken from each individual sensor signal. Therefore, each training dictionaries X i , i = 1, ..., 9 is of size 500 × 360 and the associated observations Y i is of size 500 × 10, where 500 is the feature dimension. The classification performance is summarized in Table IV, where the first column refers to the methods used in our experiments, which include multivariate sparse representation for classification (MV-SRC) for seven sensors separately and multi-task MV-SRC (MTMV-SRC) for different combinations of sensors. We note here that the first two sensors are acoustic, the next three (sensors 5 to 7) are seismic sensors, and the last two are PIR and ultrasonic sensors, respectively. The second and third columns describe the classification accuracy of human and human-animal footsteps, and the last column is the overall accuracy. In these experiments, we exclude 3rd and 4th acoustic sensors due to the sensor malfunction on December 10. As can be seen for the table IV, MTMV-SRC using the first two acoustic sensors simultaneously outperforms MV-SRC when using each sensor separately. Similar behavior can be observed with the three seismic sensors. When all seven sensors are employed, MTMV-SRC yields the best performance. To provide more insight into the MTMV-SRC algorithm, we vary the regularization parameter λ in (7) and run the MTMV-SRC algorithm. This parameter is known to control the sparsity level of the coefficient matrix W . In this experiment, classification performance versus λ is displayed in Fig. 8, where values of λ are selected from 10−5 to 1. As one can observe, the performance curves of sensors behave quite differently for different values of λ between 10−3 and 5 × 10−1 , while showing relatively similar patterns for values of λ greater than 10−3 and smaller than 5 × 10−1 . For values of λ outside [10−3 , 5 × 10−1 ], the performance of the MTMV-SRC algorithm tends to saturate. Each sensor’s performance curve has its own peak where the MTMV-SRC+E algorithm yields the highest classification accuracy. More importantly, as we expected, the classification performance of the combined three sensors 5, 6, and 7 is significantly better than that of each sensor separately for a wide range of the regularization parameter λ. The same behaviors can be observed with the set of two sensors 1 and 2. When all the sensors 1, 2, and 5 − 9 are used, the MTMV-SRC algorithm achieves the highest classification performance. To further show the efficiency of the MTMV-SRC model, we repeat the same experiments using the DEC10 data for training and the DEC09 data for testing. The classification performances are provided in Table V. As we can see from Table V that incorporating all seven sensors yields the best classification accuracy.

18

75

Classification accuracy

70 65 60 55 50

Sensor 1 Sensor 2 Sensor 5 Sensor 6 Sensor 7 Sensor 8 Sensor 9 Sensors 1,2 Sensors 5−7 Sensors 1,2,5−9

45 40 35 −5

−4.5

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

log10(λ)

Fig. 8. λ versus the overall classification performance for different combination of sensors. In the figure, we plot classification performances of sensors 1, 2, 5, 6, 7, 8 and 9 separately; and a combination of sensors (1, 2), (5 − 7); and a combination of seven sensors (1, 2, 5 − 9). TABLE IV C LASSIFICATION ACCURACY (%) FOR THE TWO - CLASS PROBLEM WITH TRAINING SAMPLES TAKEN FROM DEC09, COMPARING MV-SRC TO MTMV-SRC WITH DIFFERENT COMBINATIONS OF SENSORS .

Methods MV-SRC sensor 1 MV-SRC sensor 2 MV-SRC sensor 5 MV-SRC sensor 6 MV-SRC sensor 7 MV-SRC sensor 8 MV-SRC sensor 9 MTMV-SRC sensors 1-2 MTMV-SRC sensors 5-7 MTMV-SRC sensors 1,2,5-9

H 94.44 94.44 44.45 44.44 50.00 50.00 94.44 94.44 72.22 72.22

HA 26.32 31.58 78.95 89.47 68.42 78.95 5.26 36.84 68.42 73.68

OA 60.38 63.01 61.70 66.96 59.21 64.47 49.85 65.64 70.32 72.95

C. Dealing with arbitrarily large errors In this subsection, we show the significance of imposing an additional term related to the environmental cluster noise in the MTMV-SRC+E model (10). This term leads to another `1 -regularization in the MTMV-SRC+E optimization in (11). On one hand, this term is expected to compensate for the unwanted environmental clutter noise in the signals during the data collection. On the other hand, it can be seen as the compensation for the model mismatch. We conduct similar two-class classification experiments with training samples taken from the DEC09 data and report the results in Table VI. In addition, Table VII

19

TABLE V C LASSIFICATION ACCURACY (%) FOR THE TWO - CLASS PROBLEM WITH TRAINING SAMPLES TAKEN FROM DEC10, COMPARING MV-SRC TO MTMV-SRC WITH DIFFERENT COMBINATIONS OF SENSORS .

Methods MV-SRC sensor 1 MV-SRC sensor 2 MV-SRC sensor 5 MV-SRC sensor 6 MV-SRC sensor 7 MV-SRC sensor 8 MV-SRC sensor 9 MTMV-SRC sensors 1-2 MTMV-SRC sensors 5-7 MTMV-SRC sensors 1,2,5-9

H 75.00 75.00 50.00 56.25 56.25 31.25 68.75 81.25 75.00 81.25

HA 46.67 53.33 66.67 66.67 66.67 73.33 53.33 60.00 66.67 66.67

OA 60.84 64.17 58.33 61.46 61.46 52.29 61.04 70.63 70.84 73.96

TABLE VI C LASSIFICATION ACCURACY (%) FOR THE TWO - CLASS PROBLEM WITH TRAINING SAMPLES TAKEN FROM DEC09, SHOWING THE SUPERIORITY OF TAKING SPARSE NOISE INTO ACCOUNT FOR BOTH THE MV-SRC AND MTMV-SRC METHODS .

Methods MV-SRC+E sensor 1 MV-SRC+E sensor 2 MV-SRC+E sensor 5 MV-SRC+E sensor 6 MV-SRC+E sensor 7 MV-SRC+E sensor 8 MV-SRC+E sensor 9 MTMV-SRC+E sensors 1-2 MTMV-SRC+E sensors 5-7 MTMV-SRC+E sensors 1,2,5-9

H 94.44 72.22 38.89 66.67 22.22 88.89 83.33 72.22 50.00 100.00

HA 31.58 78.95 94.74 63.16 89.47 21.05 26.32 84.21 94.74 73.68

OA 63.01 75.58 66.81 64.91 50.00 54.97 54.82 78.21 72.39 86.84

shows the results with training samples taken from the DEC10 data. It can be seen from these tables that by simultaneously removing the environmental noise while performing classification, we can considerably improve the classification accuracy for both classes as well as the overall performance. To give a more intuitive explanation of how imposing sparse error leads to better performance, we perform experiments with one human-animal test sample on both the MTMV-SRC and MTMVSRC+E models. Seven sensors, including two acoustic, three seismic, one passive infrared, and one ultrasonic sensors are used as the test data. The training dictionary is extracted from the DEC09 data. We show in Figs. 9 and 10 the human-animal test segments from one of the four acoustic sensors

20

TABLE VII C LASSIFICATION ACCURACY (%) FOR THE TWO - CLASS PROBLEM WITH TRAINING SAMPLES TAKEN FROM DEC10, SHOWING THE SUPERIORITY OF TAKING SPARSE NOISE INTO ACCOUNT FOR BOTH THE MV-SRC AND MTMV-SRC METHODS .

Methods MV-SRC+E sensor 1 MV-SRC+E sensor 2 MV-SRC+E sensor 5 MV-SRC+E sensor 6 MV-SRC+E sensor 7 MV-SRC+E sensor 8 MV-SRC+E sensor 9 MTMV-SRC+E sensors 1-2 MTMV-SRC+E sensors 5-7 MTMV-SRC+E sensors 1,2,5-9

H 81.25 75.00 68.75 56.25 68.75 37.75 68.75 81.25 56.25 81.25

HA 53.33 60.00 40.00 73.33 66.67 73.33 53.33 66.67 86.67 73.33

OA 67.29 67.50 53.38 64.79 67.71 55.54 61.04 73.96 71.46 77.29

and their reconstructions from both human and human-animal cepstral coefficients, respectively. Here, Fig. 9 represents reconstructed segments from the MTMV-SRC model, while Fig. 10 represents the reconstructed segments from the MTMV-SRC+E model. In Fig. 11, the sparse error vectors recovered from the MTMV-SRC+E model are displayed. For clearer visualization, we only plot the first 60 out of 500 cepstral coefficients in these figures, where the red, blue, and green curves display the original test segment, the reconstruction from the human-animal cepstral coefficients and the reconstruction from the human cepstral coefficients, respectively. As illustrated in Fig. 9, with the absence of sparse clutter noise in the MTMV-SRC model, the reconstruction from both the human and human-animal cepstral coefficients are quite poor compared to Fig. 10 where the MTMV-SRC+E model is applied. Moreover, we can easily recognize in Fig. 10 that the blue curves follow considerably close to the red ones for all segments whereas the green curves do not. This implies that the reconstruction error from the humananimal cepstral coefficients is lower than that of the human cepstral coefficients, thus leading to more accurate classification. In more detail, we exhibit reconstruction errors from the human and human-animal cepstral coefficients in Tables VIII and IX, respectively. As we can see in Table IX, the overall (OA) reconstruction error from the human-animal (HA) cepstral coefficients is significantly smaller than that of the human (H) cepstral coefficients, thus leading to the correct classification. Furthermore, for every segments (seg. 1 − 10), the human-animal cepstral coefficients give smaller reconstruction errors than that of the human

cepstral coefficients. However, this is not the case with the MTMV-SRC model. In Table VIII, the overall

21

TABLE VIII R ECONSTRUCTION ERRORS FROM HUMAN (H)

Reconst. error H coeffs HA coeffs

Seg 1 0.72 0.81

Seg 2 1.14 0.95

Seg 3 1.10 0.65

AND HUMAN - ANIMAL

Seg 4 0.87 0.90

Seg 5 0.85 0.99

(HA)

Seg 6 0.98 1.13

TABLE IX R ECONSTRUCTION ERRORS FROM HUMAN (H) AND HUMAN - ANIMAL (HA)

Reconst. error H coeffs HA coeffs

Seg 1 0.46 0.19

Seg 2 0.47 0.44

Seg 3 0.46 0.36

Seg 4 0.43 0.31

Seg 5 0.45 0.28

Seg 6 0.44 0.32

COEFFICIENTS

Seg 7 0.76 0.80

Seg 8 0.78 1.06

COEFFICIENTS

Seg 7 0.40 0.24

- MTMV-SRC MODEL .

Seg 9 0.77 1.11

Seg 10 0.96 0.86

OA 2.86 2.97

- MTMV-SRC+E MODEL .

Seg 8 0.42 0.39

Seg 9 0.51 0.29

Seg 10 0.36 0.33

OA 1.43 1.02

reconstruction errors for both the human and human-animal cepstral coefficients are roughly similar, though the later is higher. In addition, the human-animal cepstral coefficients give a better reconstruction for most of the segments. Hence, MTMV-SRC misclassifies this test sample. Comparing the two tables, one can see that the MTMV-SRC+E model leads to considerably smaller overall reconstruction error as well as errors at every segments than the MTMV-SRC model. Though we care more about classification accuracy rather than reconstruction, a better reconstruction usually results in a better classification performance. Furthermore, the MTMV-SRC+E model is preferred in applications in which both classification accuracy and small reconstruction errors are required. Now we examine the effect of the sparse cluster noise on the classification performance by varying its sparsity level and then executing the MTMV-SRC+E algorithm. As one can see in the minimization (7), this noise sparsity level is mainly controlled by the regularization parameter λE . As λE is small, we b . As the value of λE increases, E b is expected to expect to achieve a rather dense reconstruction noise E be sparser. In this experiment, we use the same training and testing set in the Table VI and vary λE from 10−5 to 1 while fixing the other regularization parameter λW = 10−2 . The classification performance is

visually displayed in Fig. 12. As can be seen from this figure, with varying values of the regularization parameter λE , the performance curve for each sensor behaves quite differently. Each sensor’s performance curve has its own peak where the MTMV-SRC+E algorithm yields the highest classification accuracy. In more detail, the performance of the MTMV-SRC+E algorithm generally varies for values of λE between 5 × 10−3 and 10−1 , while on the other two sides the performance tends to saturate. For λE between 10−3 and 5 × 10−3 , the MTMV-SRC+E algorithm leads to the best overall accuracy when

22

0.1 0.08

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.1

0.04

Magnitude

Magnitude

0.06

0.15

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.02 0 −0.02

0.05

0

−0.04 −0.06 −0.08 0

10

20

30

40

50

−0.05 0

60

Feature 0.06

20

30

40

50

60

40

50

60

40

50

60

Feature 0.1

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.04

10

0.08 0.06

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

Magnitude

Magnitude

0.02 0 −0.02

0.04 0.02 0 −0.02

−0.04 −0.04 −0.06 −0.08 0

−0.06 10

20

30

40

50

−0.08 0

60

10

20

Feature 0.15 0.1

30

Feature 0.15

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.1

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

Magnitude

Magnitude

0.05 0 −0.05

0.05

0

−0.1 −0.05 −0.15 −0.2 0

10

20

30

40

50

−0.1 0

60

10

20

Feature

30

Feature

0.06

0.08

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.06

0.04

0.04

Magnitude

Magnitude

0.02 0 −0.02

0.02 0 −0.02 −0.04

−0.04 −0.06 −0.08 0

−0.06

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients 10

20

30

−0.08 40

50

−0.1 0

60

10

20

Feature 0.15

50

60

0.04 0.02

Magnitude

0.05

Magnitude

40

0.06 Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.1

0 −0.05 −0.1

0 −0.02 −0.04 −0.06

−0.15 −0.2 0

30

Feature

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

−0.08 10

20

30

Feature

40

50

60

−0.1 0

10

20

30

40

50

60

Feature

Fig. 9. Original test segments vs. theirs reconstructions from human and human-animal cepstral coefficients. These 10 test segments are extracted from one acoustic test signal and the MTMV-SRC model is employed to reconstruct these test segments. Only the first 60 out of 500 cepstral coefficients are plotted.

23

0.1 0.08

0.15

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.1

0.04

Magnitude

Magnitude

0.06

0.02 0

0.05

0

−0.02 −0.05 −0.04 −0.06 0

10

20

30

40

50

−0.1 0

60

10

20

Feature

30

40

50

60

Feature

0.06

0.06 Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.04

0.04

Magnitude

Magnitude

0.02 0.02

0

0 −0.02

−0.02 −0.04 −0.04

−0.06 0

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

−0.06

10

20

30

40

50

−0.08 0

60

10

20

Feature 0.12 0.1

40

50

60

40

50

60

0.15 Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients 0.1

0.06

Magnitude

Magnitude

0.08

30

Feature

0.04 0.02

0.05

0 0

−0.02 −0.04 −0.06 0

10

20

30

40

50

−0.05 0

60

10

20

Feature 0.06

0.06 Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.05 0.04

Magnitude

0.02 0.01 0

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.04

0.03

Magnitude

30

Feature

0.02

0

−0.02

−0.01 −0.02

−0.04

−0.03 −0.04 0

10

20

30

40

50

−0.06 0

60

10

20

Feature 0.06

0.04

Magnitude

Magnitude

0.02

0

−0.04

−0.04

20

30

Feature

60

40

50

60

40

50

60

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0

−0.02

10

50

0.02

−0.02

−0.06 0

40

0.06

Original test segment Reconstruction from human−animal coefficients Reconstruction from human coefficients

0.04

30

Feature

−0.06 0

10

20

30

Feature

Fig. 10. Original test segments vs. theirs reconstructions from human and human-animal coefficients. These 10 test segments are extracted from one acoustic test signal and the MTMV-SRC+E model is employed to reconstruct these test segments. Only the first 60 out of 500 cepstral coefficients are plotted.

24

0.08

0.1

0.06

0.08

Magnitude

Magnitude

0.06 0.04

0.02

0.04 0.02

0 0 −0.02

−0.04 0

−0.02

10

20

30

40

50

−0.04 0

60

10

Sparse error coefficients

20

30

40

50

60

50

60

50

60

50

60

50

60

Sparse error coefficients

0.08

0.05 0.04

0.06

Magnitude

Magnitude

0.03 0.04

0.02

0

0.02 0.01 0 −0.01

−0.02

−0.04 0

−0.02 10

20

30

40

50

−0.03 0

60

10

Sparse error coefficients 0.1

40

0.12 0.1

0.06

0.08

0.04

0.06

Magnitude

Magnitude

30

Sparse error coefficients

0.08

0.02 0

0.04 0.02

−0.02

0

−0.04

−0.02

−0.06 0

20

10

20

30

40

50

−0.04 0

60

10

Sparse error coefficients

20

30

40

Sparse error coefficients

0.04

0.03

0.03

0.02

Magnitude

Magntiude

0.01 0.02

0.01

0 −0.01

0 −0.02 −0.01

−0.02 0

−0.03

10

20

30

40

50

−0.04 0

60

10

Sparse error coefficients 0.02

0.04 0.03

0.01

40

0.02

0.005

Magnitude

Magnitude

30

Sparse error coefficients

0.015

0 −0.005 −0.01

0.01 0 −0.01 −0.02

−0.015

−0.03

−0.02 −0.025 0

20

10

20

30

40

Sparse error coefficients

50

60

−0.04 0

10

20

30

40

Sparse error coefficients

Fig. 11. Sparse clutter errors reconstructed from the MTMV-SRC+E model. As we can see, in each segment, there are a few significant coefficients while most of the other are zeros or approximately zeros.

25

incorporating information from all seven sensors. Moreover, the performance of sensors 1 and 2 jointly is significantly better than that of sensor 1 on its own and typically better than that of sensor 2. Similar behavior can be observed from the combination of three sensors 5, 6 and 7. These results are expected from our intuitive explanation and experimental analysis.

90

Classification performance

80

70

Sensor 1 Sensor 2 Sensor 5 Sensor 6 Sensor 7 Sensor 8 Sensor 9 Sensors 1,2 Sensors 5−7 Sensors 1,2,5−9

60

50

40

30 −5

−4.5

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

log10 (λe)

Fig. 12. λE versus classification performance for different combination of sensors. In the figure, we plot classification performances for sensors 1, 2, 5, 6, 7, 8 and 9 separately and for combinations of sensors 1, 2 and combinations of seven sensors 1, 2, 5 : 9.

D. Comparing with other methods The next experiment compares our proposed approach with current state-of-the-art classification methods such as (joint) sparse logistic regression (SLR) [35], kernel (joint) SLR [36], linear support vector machine (SVM), and kernel SVM [37]. In this experiment, seven sensors, namely, the sensors 1, 2, and 5-9 are used. In the conventional classifiers such as SVM and SLR, we incorporate information from

multiple sensors by concatenating all D = 7 sensors’ training dictionaries to form an elongated dictionary X ∈ RDn×p . Atoms of the new dictionary X are considered as the new training samples, which are then

used to train the SVM and SLR classifiers. The coefficient vector obtained by these classifiers is used to test on the concatenated test segments and voting scheme is employed to finally assign a class label for each test signal. For the kernel versions, we use a RBF kernel with bandwidth selected via cross validation.

26

Another more effective method to exploit the information across the sensors is to use the heterogeneous model proposed in [36]. This model, called heterogeneous feature machine (HFM), has been shown to be efficient in solving problems in which various modalities (sensors) are simultaneously used. The main idea of this model is to associate to all the training data in each training dictionary X i ∈ Rn×p an appropriate coefficient vector w i ∈ Rn and the logistic loss is taken over the sum of all the sensors. Sparsity or joint sparsity regularization can be incorporated into the optimization to retrieve the coefficient vectors w i , i = 1, ..., D, where D = 7 in this particular problem. This regularization implies that some features are assumed to be irrelevant for classification, thus they should be removed via the `1 /`q -norm penalty. In particular, the probabilistic model of the logistic regression is formulated as follows:

 D

x1j , ..., xj P yj |x

exp =

hP

D iT i i=1 x j w

1 + exp(

 i yi

PD

iT i i=1 x j w )

,

(21)

where the feature vector x ij is the j -th column vector of the training dictionary X i associated with the i-th sensor, and the index yi receiving values 0 or 1 indicates the class label.

To recover w i ’s, the authors of [36] propose to apply maximum likelihood over all the training data w 1 , w 2 , ..., w D ]. In particular, the following function is together with imposing joint sparsity on W = [w

minimized W ) + λ kW W k1,q , min L(W

(22)

W

W ) is defined as where the loss function L(W W) = L(W

p X j=1

yj

D X i=1

T

x ij w i −

p X

log(1 + exp(

j=1

D X

T

x ij w i )).

(23)

i=1

Once these coefficient vectors w i ’s are obtained, each segment of the test sample is then assigned to a class and the final decision is made by selecting the label that occurs the most. This method can also xij , X i ). This kernel version is be generalized to kernel domain by replacing x ij by its kernel version K(x

referred as the kernel HFM in our experiments. The optimization in (22) can also be solved efficiently via the alternating direction method presented in Section III. All the aforementioned methods (SVM, SLR, and HFM) in our setup can be seen as a combination of both the FI-FO and DI-DO categories: feature-level fusion across multiple sensors and then decision-level fusion on the observation segments of each test signal. Though these methods are efficient in combining information from different sensors, they are clearly suboptimal in combining information within each sensor. One example to demonstrate this sub-optimality is that if the event does not exist in all the

27

TABLE X C LASSIFICATION ACCURACY (%) FOR THE TWO - CLASS PROBLEM WITH TRAINING SAMPLES TAKEN FROM DEC09, COMPARING MTMV-SRC WITH CONVENTIONAL CLASSIFIERS . 7 SENSORS 1, 2 AND 5-9 ARE USED IN THIS EXPERIMENT.

Methods MTMV-SRC MTMV-SRC+E SLR Kernel SLR HFM Kernel HFM SVM Kernel SVM

H 72.22 100 55.56 88.89 55.56 88.89 77.78 77.78

HA 73.68 73.68 84.21 52.63 89.47 57.89 63.61 68.42

OA 72.95 86.84 69.89 70.76 72.52 73.39 70.47 73.10

observation segments, then fusing all the observations segments at the decision level will probably result in misclassification. Table X compares the classification accuracy of the aforementioned approaches. As one can clearly observe, our MTMV-SRC approach is comparable to the conventional classifiers. However, by further exploiting the possibility that the test segments are contaminated by the environmental clutter noise, our MVMT-SRC+E, which simultaneously removes this noise while performing classification, outperforms all the conventional classifiers. To further show the efficiency of our approach, we repeat the same experiments using the DEC10 data for training and the DEC09 data for testing. The classification performances are provided in Table XI. Similar behaviors can be observed in this table that while the MTMV-SRC approach gives comparable classification performance, the MTMV-SRC+E approach considerably outperforms all the conventional classification methods. These experiments validate the potential use of our MTMV-SRC+E model for multi-sensor classification. V. C ONCLUSION AND DISCUSSION In this paper, we propose a novel multi-task multivariate joint structured sparsity-based classification method (MTMV-SRC) for personnel footsteps recognition, where the data are collected from nine sensors including acoustic, seismic, PIR, and ultrasonic sensors. Our proposed approach shows how to efficiently exploit correlations as well as complementary information between sensors measuring the same physical events. Simulation results demonstrate that our method yields highly accurate classification performance and outperforms many classical classifiers such as (group) SLR, SVM, and their kernel versions. Furthermore, we extend our model to deal with environmental clutter noise, which is indispensable in

28

TABLE XI C LASSIFICATION ACCURACY (%) FOR THE TWO - CLASS PROBLEM WITH TRAINING SAMPLES TAKEN FROM DEC10, COMPARING MTMV-SRC WITH CONVENTIONAL CLASSIFIERS . 7 SENSORS 1, 2 AND 5-9 ARE USED IN THIS EXPERIMENT.

Methods MTMV-SRC MTMV-SRC+E SLR Kernel SLR HFM Kernel HFM SVM Kernel SVM

H 81.25 81.25 81.25 87.50 75.00 87.50 81.25 81.25

HA 66.67 73.33 53.33 46.67 66.67 53.33 53.33 60.00

OA 73.96 77.29 67.29 67.09 70.84 70.42 67.29 70.63

many practical scenarios. We experimentally illustrate the importance of enforcing another sparse noise regularization into the MTMV-SRC to remove arbitrarily large noise. This model significantly improves the overall classification accuracy. Lastly, we propose a first-order fast algorithm based on classical alternative direction method to solve the aforementioned models. VI. A PPENDIX We now present details of close-formed solutions of each sub-optimization. A. Update step for the W -variables W , E , V , U ; B E , B W ) with respect to the variable W . By First, we discuss the minimization of FβE ,βW (W

reformulating the Lagrangian function and removing variables that do not contribute into the optimization, we find it sufficient to minimize D

1 X

Y i − X iW i − T iE it 2 + βW min F W 2 2 i=1

2

W − V t + 1 B W,t .

βW F

(24)

This optimization has a quadratic structure, which is easy to solve via setting the first-order derivative W , E ) is a sum of convex functions associated with to zero. Furthermore, since the loss function L(W

sub-matrices W i , one can simultaneously seek for W it+1 , i = 1, ..., D, which have explicit solutions as follows T

T

X i X i + βW I )−1 [X X i (Y Y i − T iE it ) + βW V it − B iV,t ], W it+1 = (X

(25)

where I is an identity matrix of size p × p, and U it , B iU,t , V it , B iV,t are sub-matrices of U t , B U,t , V t and B V,t , respectively.

29

B. Update step for the E -variables The second optimization subproblem with respect to E (line 4 of the Algorithm 1) has a similar structure and can be solved by T

T

T i T i + βE I)−1 [T T i (Y Y i − X iW it+1 ) + βE U it − B iE,t ]. E it+1 = (T

(26)

C. Update step for the U -variables The third optimization subproblem with respect to U (line 5 of the Algorithm 1) is the standard `1 -minimization. In fact, via a simple calculus manipuilation, this optimization is equivalent to

2

1 1 λE

U k1 , min E t+1 + B E,t − U + kU

βE βE U 2 F

(27)

which is well known as the shrinkage problem. The explicit solution is derived by E t+1 + U t+1 = Shrink(E

1 B E,t , λE /βE ), βE

(28)

where the operator Shrink(·) performs entry-wise and is defined by Shrink(z, ) = sgn(z)(|z| − ) for |z| ≥  and zero otherwise.

D. Update step for the V -variables The last optimization subproblem with respect to V (line 6 of the Algorithm 1) can be recast as

2

1 1 λW

V k1,q . min W t+1 + B W,t − V + kV (29)

βW βW V 2 F It is clear that the objective function in (29) has a separable structure. Therefore, this optimization can be tackled by minimizing with respect to each row of V separately. Specifically, if we denote w i,t+1 , b W,i,t and v i,t+1 as rows of matrices W t+1 , B W,t and V t+1 , respectively, then for each i = 1, ..., p we

solve a sub-problem, v i,t+1 = argmin v

1 kzz − v k2l2 + γ kvv kq , 2

(30)

where z := w i,t+1 − bW,i,t /βW and γ := λW /βW . Though (30) is sufficiently simple to derive an explicit solution for any value of q . In this paper we only consider the simplest situation q = 2. It is now easy to check that the solution of (30) has the explicit form  v i,t+1 =

γ 1− kzz k2

 z +

30

u)+ is a vector with entries receiving values max(ui , 0). where (u

R EFERENCES [1] D. L. Hall and J. Llinas, “An introduction to multisensor data fusion,” Proceedings of the IEEE, vol. 85, no. 1, pp. 6–23, 1997. [2] M. E. Liggins, J. Llinas, and D. L. Hall, Handbook of Multisensor Data Fusion: Theory and Practice, 2nd ed.

CRC

Press, 2008. [3] P. K. Varshney, “Multisensor data fusion,” Electronics and Communication Engineering Journal, vol. 9, no. 6, pp. 245–253, 1997. [4] M. Duarte and Y.-H. Hu, “Vehicle classification in distributed sensor networks,” Journal of Parallel and Distributed Computing, vol. 64, no. 7, pp. 826–838, 2004. [5] A. Klausner, A. Tengg, and B. Rinner, “Vehicle classification on multi-sensor smart cameras using feature- and decisionfusion,” IEEE conference on Distributed Smart Cameras, pp. 67–74, 2007. [6] E. J. Cand`es, J. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Communications on Pure and Applied Mathematics, vol. 59, pp. 1207–1223, 2006. [7] D. L. Donoho, “Compressed sensing,” IEEE Transaction on Information Theory, vol. 52, pp. 1289–1306, 2006. [8] G. Obozinski, B. Taskar, and M. I. Jordan, “Joint covariate selection and joint subspace selection for multiple classification problems,” Journal of Statistics and Computing, vol. 20, no. 2, pp. 231–252, 2010. [9] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227, 2009. [10] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, and S. Yan, “Sparse representation for computer vision and pattern recognition,” Proceedings of the IEEE, vol. 98, no. 6, pp. 1031 – 1044, 2010. [11] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Sparse representation for target detection in hyperspectral imagery,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 3, pp. 629–640, 2011. [12] H. Zhang, N. M. Nasrabadi, T. S. Huang, and Y. Zhang, “Transient acoustic signal classification using joint sparse representation,” International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2220–2223, 2011. [13] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Discriminative learned dictionaries for local image analysis,” Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1 – 8, 2008. [14] X.-T. Yuan and S. Yan, “Visual classification with multi-task joint sparse representation,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3493–3500, 2010. [15] H. O. Park, A. A. Dibazar, and T. W. Berger, “Cadence analysis of temporal gait patterns for seismic discrimination between human and quadruped footsteps,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1749–1752, 2009. [16] K. M. Houston and D. P. McGaffigan, “Spectrum analysis techniques for personnel detection using seismic sensors,” Proceedings of the SPIE, vol. 5090, pp. 162–173, 2003. [17] J. M. Sabatier and A. E. Ekimov, “Range limitation for seismic footstep detection,” Proceedings of the SPIE, vol. 6963, 2008. [18] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” J. Roy. Statist. Soc. Ser. B, vol. 68, no. 1, pp. 49–67, 2006.

31

[19] J. A. Tropp, “Greed is good: Algorithmic results for sparse approximation,” IEEE Transactions on Information Theory, vol. 50, no. 10, pp. 2231–2242, 2004. [20] W. Dai and O. Milenkovic, “Subspace pursuit for compressive sensing signal reconstruction,” IEEE Transactions on Information Theory, vol. 55, no. 5, pp. 2230–2249, 2009. [21] T. Blumensath and M. E. Davies, “Iterative hard thresholding for compressed sensing,” Applied and Computational Harmonic Analysis, vol. 27, no. 3, pp. 265–274, 2009. [22] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of Royal Statistical Society, Sery B, vol. 58, no. 1, pp. 267–288, 1996. [23] E. T. Hale, W. Yin, and Z. Zhang, “A numerical study of fixed-point continuation applied to compressed sensing,” Journal of Computational Mathematics, vol. 28, no. 2, pp. 170–194, 2010. [24] E. van den Berg and M. P. Friedlander, “Probing the pareto frontier for basis pursuit solutions,” SIAM Journal on Scientific Computing, vol. 31, no. 2, pp. 890–912, 2008. [25] Y. Nesterov, “Gradient methods for minimizing composite objective function,” CORE Discussion Paper 2007/76, 2007. [26] P. Tseng, “On accelerated proximal gradient methods for convex-concave optimization,” 2008, submitted to SIAM Journal on Optimization. [27] H. Liu, M. Palatucci, and J. Zhang, “Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery,” International Conference on Machine Learning (ICML), pp. 649–656, 2009. [28] J. A. Tropp, “Algorithms for simultaneous sparse approximation. part ii: Convex relaxation,” Signal Processing, vol. 86, no. 3, pp. 589–602, 2006. [29] X. Chen, W. Pan, J. Kwok, and J. Garbonel, “Accelerated gradient method for multi-task sparse learning problem,” IEEE International Conference on Data Mining, pp. 746–751, 2009. [30] D. G. Childers, D. P. Skinner, and R. C. Kemerait, “The cepstrum: A guide to processing,” Proceedings of the IEEE, vol. 65, no. 10, pp. 1428 – 1443, 1977. [31] P. Zhao, G. Rocha, and B. Yu, “The composite absolute penalties family for grouped and hierarchical variable selection,” Annals of Statistics, vol. 37, pp. 3469–3497, 2009. [32] E. J. Cand`es, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of ACM, vol. 58, no. 3, pp. 1–37, 2011. [33] J. Yang and Y. Zhang, “Alternating direction algorithms for L1 problems in compressive sensing,” SIAM Journal on Scientific Computing, vol. 33, no. 1, pp. 250–278, 2011. [34] S. Engelberg and E. Tadmor, “Recovery of edges from spectral data with noise - a new perspective,” SIAM Journal on Numerical Analysis, vol. 46, no. 5, pp. 2620 – 2635, 2008. [35] L. Meier, S. V. D. Geer, and P. Bhlmann, “The group lasso for logistic regression,” Journal of the Royal Statistical Society: Series B, vol. 70, no. 1, pp. 53–71, 2008. [36] L. Cao, J. Luo, F. Liang, and T. S. Huang, “Heterogeneous feature machines for visual recognition,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1095–1102, 2009. [37] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis.

Cambridge University Press, 2004.

Suggest Documents