Multi-view low-rank dictionary learning for image

1 downloads 0 Views 2MB Size Report
Aug 21, 2015 - matrix recovery theory, we provide a multi-view dictionary low-rank ..... jointly learn multiple uncorrelated dictionaries Dk k ¼ 1; ⋯; M р. Ю and ...... [34] H. Murase, S.K. Nayar, Visual learning and recognition of 3-D objects from.
Pattern Recognition 50 (2016) 143–154

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Multi-view low-rank dictionary learning for image classification Fei Wu a,b, Xiao-Yuan Jing a,b,n, Xinge You c, Dong Yue a, Ruimin Hu b, Jing-Yu Yang d a

School of Automation, Nanjing University of Posts and Telecommunications, China School of Computer, Wuhan University, China School of Electronic Information and Communications, Huazhong University of Science and Technology, China d College of Computer Science, Nanjing University of Science and Technology, China b c

art ic l e i nf o

a b s t r a c t

Article history: Received 17 October 2014 Received in revised form 25 May 2015 Accepted 13 August 2015 Available online 21 August 2015

Recently, a multi-view dictionary learning (DL) technique has received much attention. Although some multi-view DL methods have been presented, they suffer from the problem of performance degeneration when large noise exists in multiple views. In this paper, we propose a novel multi-view DL approach named multi-view low-rank DL (MLDL) for image classification. Specifically, inspired by the low-rank matrix recovery theory, we provide a multi-view dictionary low-rank regularization term to solve the noise problem. We further design a structural incoherence constraint for multi-view DL, such that redundancy among dictionaries of different views can be reduced. In addition, to enhance efficiency of the classification procedure, we design a classification scheme for MLDL, which is based on the idea of collaborative representation based classification. We apply MLDL for face recognition, object classification and digit classification tasks. Experimental results demonstrate the effectiveness and efficiency of the proposed approach. & 2015 Elsevier Ltd. All rights reserved.

Keywords: Multi-view dictionary learning Multi-view dictionary low-rank regularization Structural incoherence constraint Collaborative representation based classification

1. Introduction Learning effective features plays a crucial role in image classification application. Dictionary learning (DL) is an important feature learning technique with state-of-the-art classification performance. Most of DL methods have been addressed to solve single or two views based learning problems [1–3]. With the fisher discriminant criterion, a fisher discrimination DL method [4] constructs a structured dictionary whose atoms correspond to the class labels. Incoherent DL method [5] learns dictionaries that exhibit a low mutual coherence while providing a sparse approximation with favorable signal-to-noise ratio. Discriminative DL with a low-rank regularization (D2L2R2) method [6] applies lowrank regularization on dictionary and applies fisher discriminant function to coding coefficient for low-rank discrimination DL. However, these methods do not apply to multi-view data. In many real-world applications, the same object can be observed at different viewpoints or by different sensors, thus generating multiview (more than two views) data. For example, given a face, photos can be taken from different viewpoints, resulting multi-pose face images. Since more useful information exists in multiple views than

n Corresponding author at: School of Computer, Wuhan University, Wuhan, China. E-mail address: [email protected] (X.-Y. Jing).

http://dx.doi.org/10.1016/j.patcog.2015.08.012 0031-3203/& 2015 Elsevier Ltd. All rights reserved.

in a single one, a multi-view based feature learning technique has attracted a lot of research interests [7–9]. In this field, multi-view subspace learning is an important research direction. Canonical correlation analysis (CCA) [10] based and discriminant analysis based multi-view subspace learning are two representative techniques under this direction. A multi-view canonical correlation analysis (MCCA) method [11] exploits the correlation features among multiple views. From both intra-view and inter-view, the multi-view discriminant analysis (MVDA) method [12] maximizes the between-class variations and minimizes the within-class variations of samples in the learning common space. Besides multi-view subspace learning, several other multi-view learning methods have been developed. Co-training based methods [13,14] are usually used for semi-supervised classification that combines both labeled and unlabeled data under multi-view setting. A low rank multi-view learning (LR-MVL) method [15] uses a fast spectral method to estimate low dimensional context-specific word representations from unlabeled data for natural language processing. Based on low-rank matrix completion theory, Mosabbeb et al. [16] adapted the inexact augmenting Lagrangian multiplier algorithm to realize multiview action recognition. Ding and Fu [17] developed a multi-view lowrank representation method, called low-rank common subspace (LRCS), which seeks a common low-rank linear projection to mitigate the semantic gap among different views. From the perspective of DL, most of multi-view feature learning methods do not incorporate a DL technique into the multi-view feature learning framework.

144

F. Wu et al. / Pattern Recognition 50 (2016) 143–154

Recently, some multi-view based sparse representation or DL methods have been presented [18–23]. A sparse multimodal biometrics recognition (SMBR) method [24] uses original training sample set as dictionary and exploits the joint sparsity of coding coefficients from different biometric modalities to make a joint decision. Monaci et al. [25] developed a multi-view DL method that owns favorable audiovisual localization effects. Zhuang et al. [26] utilized the class information of samples to jointly learn discriminative multi-modal dictionaries as well as mapping functions between different modalities for multi-modal retrieval. For classifying lung needle biopsy images, a multimodal sparse representation-based classification (MSRC) method [27] aims to select the topmost discriminative samples for each individual modality as well as to guarantee the large diversity among different modalities. Gao et al. [28] presented a multi-view discriminative and structured DL with group sparsity and graph model (GM–GS–DSDL) method to fuse different views and recognize human actions. Based on the Hilbert– Schmidt independence criterion, a multiview supervised dictionary learning (MSDL) method [29] is developed for speech emotion recognition, which learns one dictionary in each view and subsequently fuses the sparse coefficients in the spaces of learned dictionaries. By designing vector-form based uncorrelated constraint, our previous work, namely the uncorrelated multi-view discrimination DL (UMD2L) method [30], jointly learns multiple uncorrelated discriminative dictionaries from multiple views. 1.1. Motivation For image classification tasks, the learned dictionary bases can be used to represent test data, which facilitates classification. And multiple views provide description of an object from different viewpoints, such that traits of an object can be fully exploited. Although some multi-view DL methods have been presented, there exists much room for improvement. The following two aspects have not been investigated comprehensively and thoroughly: (1) How to handle noisy information in multiple views of images for DL: In real-world applications, noisy information in training samples will lead to noise in multiple views, which will undermine the discriminative ability of the learned dictionaries. However, existing multi-view DL methods have not taken this issue into consideration. (2) How to remove redundancy among dictionaries learned from different views and realize multi-view image classification with the learned dictionaries in an efficient and effective manner: Information redundancy in original multi-view data will lead to redundancy in the learned dictionaries, which will bring trouble to subsequent classification. Although UMD2L specially designs uncorrelated constraint for multi-view DL, the provided vectorform based uncorrelation is not efficient enough. An effective and efficient classification scheme is crucial for DL methods. However, as the representative multi-view DL methods, MSRC employs a simple majority voting strategy for classification, which cannot make full use of multi-view information; MSDL simply concatenates the sparse coefficients corresponding to different views of samples and utilizes the fused coefficients for recognition, which is not physically meaningful; the UMD2L method uses the sparse representation based classification algorithm [31] to learn coding coefficient for each view of a test sample, which is timeconsuming.

the other hand, some works have successfully incorporated lowrank learning techniques into multi-view learning, e.g., the lowrank representation technique based work [17] and low-rank matrix completion technique based work [16]. Inspired by these works, we propose a multi-view low-rank dictionary learning (MLDL) approach for image classification. We summarize the contributions of our work as following three points: (1) Based on the low-rank matrix recovery theory, we provide a multi-view dictionary low-rank regularization term to learn pure dictionaries that can reconstruct the de-noised images even when the training data is contaminated. To the best of our knowledge, we are the first to apply the low-rank learning technique into multi-view dictionary learning. (2) We design a structural incoherence constraint for multi-view DL, which can promote dictionaries of different views to be independent from each other, such that the redundancy among dictionaries of different views is effectively and efficiently reduced. (3) By borrowing the idea of collaborative representation based classification, we provide a novel classification scheme for multi-view DL methods, which employs the l2 -norm regularization instead of the l1 -norm regularization such that complexity of the classification process is significantly decreased. The proposed MLDL approach is verified on the Multi-PIE [32] and AR [33] face datasets, the COIL-20 object dataset [34], and the MNIST handwritten digit dataset [35]. Experimental results demonstrate its effectiveness and efficiency as compared with several related methods. 1.3. Organization We outline the remainder of this paper as follows. In Section 2, we briefly review the background and related work. In Section 3, we describe the model of MLDL. The optimization and classification scheme of MLDL are separately presented in Section 4 and Section 5. In Section 6, we discuss the differences between MLDL and related multi-view feature learning methods. Experimental results are reported in Section 7. And we conclude this paper in Section 8.

2. Background and related work 2.1. Dictionary learning Given a set of data samples A ¼ ½a1 ; a2 ; ⋯; an , the goal of dictionary learning (DL) is to find a dictionary D ¼ d1 ; d2 ; ⋯; dm with a matrix of representation coefficients X ¼ ½x1 ; x2 ; ⋯; xn , such that each sample in the set can be represented as a sparse linear combination of dictionary atoms (columns in the dictionary matrix), i.e., ai ¼ Dxi ; i ¼ 1; 2; ⋯; n. DL methods solve the following optimization problem: ^ Xg ^ ¼ argmin‖A  DX‖2 þ τ‖X‖1 ; fD; F D;X

ð1Þ

where τ is a scalar constant. Several optimization algorithms have been developed to solve the dictionary and coefficient variables, such as method of optimal directions (MOD) [36], orthogonal matching pursuit (OMP) [37] and KSVD [38].

1.2. Contribution

2.2. Low-rank learning

Based on recent advances in DL theory, several single-view based works have integrated a low-rank learning technique into DL procedure for handling large noise, such as works in [2,6]. On

Recently, low-rank learning techniques, such as low-rank matrix recovery [39], low-rank matrix completion [40] and low-rank representation [41], have been applied in many areas. Low-rank matrix

F. Wu et al. / Pattern Recognition 50 (2016) 143–154

recovery is effective in recovering underlying low-rank structure in data. By solving a matrix rank minimization problem, a robust principal component analysis method [39] can recover corrupted data in a single subspace, and it can be regarded as an extension of sparse representation from vector to matrix. Chen et al. [42] presented a low-rank matrix recovery method with structural incoherence for face recognition. Zheng et al. [43] addressed a low-rank matrix recovery algorithm with fisher discrimination regularization for face recognition. D2L2R2 [6] seeks low-rank dictionary for each class. Let A be a corrupted sample set, low-rank matrix recovery based methods solve the following nuclear norm regularized optimization problem: min ‖A0 ‖n þ ξ‖E‖1   s:t: A ¼ B A0 þ E;

ð2Þ

where A0 and E are unknown matrices to learn, Bð U Þ is a linear operator, ‖ U‖n is the nuclear norm, l1 norm is used to measure the noise, and ξ 4 0 is a balance parameter. Several optimization algorithms have been addressed to solve the above problem, such as semi-definite programming (SDP) [44] and accelerated proximal gradient (APG) [45]. However, these algorithms suffer large computational burden. Recently, Lin et al. [46] presented an augmented Lagrange multipliers (ALM) algorithm that solves the nuclear norm optimization problem efficiently. In this paper, we adopt the ALM algorithm to solve the low-rank regularized problem. 2.3. Multimodal sparse representation-based classification (MSRC) 0

0

0

Assume that fDS ; DC ; DT g is the original sample sets, where S0 , 0 C and T 0 separately denote shape, color, and texture modalities. 0

0

0

MSRC [27] uses the binary sample selectors fβS ; βC ; βT g to select samples from the original sample set for constructing dictionaries S0

S0

C0

C0

T0

T0

fβ ðD Þ; β ðD Þ; β ðD Þg. The selection criteria are that: 1) each sub-dictionary can train a good classifier independently, and 2) the diversity among different sub-dictionaries is encouraged to be  N C0 T0 large. Denote by xS0 k ; xk ; xk ; yk k ¼ 1 the kth cell nucleus, where yk is the label of the kth cell nucleus and N is the number of training cell nuclei. For the kth training cell nucleus, MSRC denotes     m ; m A S0 ; C 0 ; T 0 as the mapping function, which is the f xm k ;β SRC predicted label for xm k with sub-classifier (i.e., single-modal   [27]) trained on the learned sub-dictionary βm Dm . Then, the objective function of MSRC is formulated as: 8 9 N X X  m m > > > > > > 〉 ; f x ; β 〈y  > > k k > > > > > > k ¼ 1m A fS0 ;C 0 ;T 0 g > > > > > 2 3> > > > > > > < = 6 7 ; min exp 6 7> N  S0 C 0 T 0 > X X 6 7>   β ;β ;β > ^ m > ^ 7> > > ; f xm υ6 〈f xm > k ;β k ; β m 〉7 > 63  >  0 0 0 > > 7> > k¼1 6 ^ > > m; m A S ; C ; T 4 5> > > > > > > > > : ; ^ mam

ð3Þ

where 〈a; b〉 is the Kronecker delta function and υ is the parameter to control influence caused by the second term.

145

transformation that maps the data to the space of maximum dependency with the labels, and I is the identity matrix. In formula (4), trð U Þ denotes the trace of a matrix, and ð U Þt represents the transposition operation. By computing the corresponding eigenvectors of the largest eigenvalues of AðkÞ HLHAðkÞt , the subdictionary DðkÞ can be obtained. 2.5. Uncorrelated Multi-view Discrimination Dictionary Learning (UMD2L) For the given M views Ak ðk ¼ 1; ⋯; M Þ, UMD2L [30] aims to jointly learn multiple uncorrelated dictionaries Dk ðk ¼ 1; ⋯; M Þ and corresponding coefficient matrices X k ðk ¼ 1; ⋯; M Þwith each corresponding to one view. The objective function of UMD2L is formulated as follows: 0

1

M X C B C M C X X X B i C i i i ii ‖Djk X ijk ‖2F C þ ζ ‖X k ‖1 B‖Ak  Dk X k ‖2F þ ‖Ak  Dk X k ‖2F þ @ A i¼1 j¼1

min D1 ; ⋯; DM k ¼ 1 X 1 ; ⋯; X M

k¼1

ja i

s:t: CorrðDk ; Dl Þ ¼ 0; l a k;

ð5Þ  where ζ is a scalar parameter. q Aik ; Dk ; X ik ¼ ‖Aik Dk X ik ‖2F þ ‖Aik  P Dik X iik ‖2F þ Cj ¼ 1 ‖Djk X ijk ‖2F is a discriminative fidelity term that jai requires Dk should have powerful reconstruction capability of Ak and have powerful discriminative capability of samples in Ak . Here, C is the number of classes, Aik is the training sample subset from the ith class in Ak , Dik denotes the class-specified dictionary associated with class i in Dk , X ik is the coding coefficient matrix of Aik over Dk , and X ijk denotes the coding coefficient matrix of Aik over Djk . CorrðDk ; Dl Þ ¼ 0 is the uncorrelated constraint for DL with the correlation coefficient between Dk and Dl (the dictionary corresponding to lth view and l a k) being defined as:  i j PN PN CorrðDk ; Dl Þ ¼

k i ¼ 1

l j ¼ 1

Corr dk ;dl

Nk U Nl

. Here, Nk and N l separately denote

numbers of dictionary atoms in Dk and Dl . And  di  di U IT dj  dj U I i j l l indicates the correlation coefficient Corr dk ; dl ¼ ki ki j j the

‖dk  dk U I‖d‖dl  dl U I‖

between

i dk

j

(the ith dictionary atom in Dk ) and dl ( the jth dictionary i dk

j dl

and are separately the mean values of these atom in Dl ), where two atoms, and I is a vector with all elements being equal to one.

3. The Model of MLDL In this section, we firstly describe the discriminative reconstruction error term, and then we introduce the designed multiview dictionary low-rank regularization term and structural incoherence constraint. Finally, we give the objective function of MLDL. 3.1. Discriminative reconstruction error term

2.4. Multiview supervised dictionary learning (MSDL) MSDL [29] learns one sub-dictionary from the data samples in each view. For the kth view, MSDL learns the sub-dictionary DðkÞ by solving the following optimization problem:  max tr U ðkÞt AðkÞ HLHAðkÞt U ðkÞ U ðkÞ

s:t: U ðkÞt U ðkÞ ¼ I; ðkÞ

ð4Þ

where A denotes the data sample set of the kth view, H is a centering matrix, L is a kernel on the labels, U ðkÞ is the

Assume that there exists M views of training data, denoted by Ak ðk ¼ 1; ⋯; M Þ. For the kthview, let Aik denote the training sample subset from the ith class, Dik and Djk separately denote the classspecified sub-dictionaries associated with class i and class j. Suppose that the coding coefficient matrix of Aik over Dk , namely ij i2 iC X ik , can be written as X ik ¼ ½X i1 k ; X k ; U U U ; X k , where X k is the coding coefficient of Aik over the sub-dictionary Djk . C denotes the number of classes. First of all, the dictionary Dk should have the capacity to well represent samples in Aik , so we require the minimization of

146

F. Wu et al. / Pattern Recognition 50 (2016) 143–154

‖Aik  Dk X ik ‖2F . Second, since the sub-dictionary Dik is associated with the ith class, it is expected to well represent samples from the ith class. We should minimize‖Aik  Dik X iik ‖2F . Third, Dik should not be C P able to represent samples from other classes, that is ‖Dik X jik ‖2F j¼1 j ai

should be as small as possible. Thus, the discriminative reconstruction error for sub-dictionary Dik can be defined as: RðDik ; X ik Þ ¼ ‖Aik  Dk X ik ‖2F þ ‖Aik  Dik X iik ‖2F þ

C X

j¼1

‖Dik X jik ‖2F :

ð6Þ

M X C X M X

j ai

We should minimize the value of

M P

C P

k¼1i¼1

‖Ditk Dl ‖2F ¼ 0;

ð8Þ

k ¼ 1i ¼ 1 l¼1

RðDik ; X ik Þ.

By making

dictionary atoms correspond to the class labels such that the obtained reconstruction error is discriminative, we aim to jointly learn multiple dictionaries with totally favorable discriminative power. 3.2. Multi-view dictionary low-rank regularization term For image classification tasks, training samples belonging to the same class are usually linearly correlated and reside in a low dimensional subspace. The sub-dictionary for representing samples from one class should be reasonably of low rank. In addition, based on the lowrank matrix recovery theory, low-rank regularization on dictionary will output a compact and pure dictionary that can reconstruct the denoised images even when the training samples are contaminated. For the kth view, to achieve the most compact bases to represent samples of ith class, we design the dictionary low-rank regularization as LðDik Þ ¼ ‖Dik ‖n . Therefore, we should minimize the value of M X C X

of dictionary, we also require the non-redundancy among dictionaries of different views. Our previous work UMD2L [30] tries to reduce redundancy among dictionaries of multiple views by minimizing the correlation of different dictionaries. However, the correlation between any two dictionaries is computed by the sum of dictionary-base level (i.e., vector versus vector) correlation, which is time-consuming. Recently, some works introduce incoherence to encourage the matrices (dictionary matrix or the low-rank intrinsic matrix) corresponding to different classes to be independent [42,47,48] for enhancing discrimination ability of the learning model. Motivated by these works, we provide a structural incoherence constraint for multi-view DL as follows:

LðDik Þ:

ð7Þ

la k

where ð U Þt denotes the transposition operation. The introduction of such incoherence would prefer the resulting dictionaries corresponding to different views to be as independent as possible, which can effectively reduce redundancy among dictionaries. Furthermore, the introduced structural incoherence can realize redundancy reduction for dictionaries directly in dictionary-level (i.e., matrix versus matrix). And this will facilitate subsequent optimization of variables, especially for the dictionary matrices, since the variable like dictionary can be optimized as a whole, rather than atom by atom. 3.4. The objective function of MLDL Considering the discriminative reconstruction error, the low-rank regularization on sub-dictionary and the structural incoherence constraint on multiple dictionaries, we can formulate the objective function of MLDL as follows: J ðD1 ; ⋯; DM ; X 1 ; ⋯; X M Þ ¼ arg

k¼1i¼1

3.3. Structural incoherence constraint for multi-view DL Information redundancy in original multi-view data will lead to redundancy in the learned dictionaries, which will bring trouble to subsequent classification. Therefore, besides the requirements of desirable discriminative reconstruction capability and low-rank characteristic

s:t:

M X C X M X

min D1 ; ⋯; DM X 1 ; ⋯; X M

M X C X k¼1i¼1

‖Ditk Dl ‖2F ¼ 0;

RðDik ; X ik Þ þ λ

M X k¼1

‖X k ‖1 þα

M X C X

LðDik Þ

k¼1i¼1

ð9Þ

k ¼ 1i ¼ 1 l¼1

l ak

where λ and α are parameters that control relative contributions among the discriminative reconstruction error, coding coefficient sparsity and

Fig. 1. Illustration of the model of MLDL. Dictionaries of different views are mutually incoherent. Each sub-dictionary is of low-rank, which can reduce the negative effect of noise contained in original samples.

F. Wu et al. / Pattern Recognition 50 (2016) 143–154

low-rank regularization terms. Fig. 1 illustrates the idea of MLDL. In the next section, we will give the optimization procedure of MLDL.

147

Objective function in (13) can be solved as the following essentially the same problem: min ‖Z‖1 þ α‖G‖n þ η‖Eik ‖2;1 þ μqðDik Þ

Dik ;Eik ;X iik

4. The optimization of MLDL

s:t: Aik ¼ Dik X iik þ Eik ; Dik ¼ G; X iik ¼ Z; The variables in the objective function (9) can be optimized alternatively with an iterative process. For each iteration we use the following optimization strategy: 1) when updating variables of the kthðk ¼ 1; 2; ⋯; M Þ view, variables of other views are fixed; 2) for the kth view, X k and Dk are updated alternatively.

where Z and G are two relaxation variables. Then the problem can be solved by using the Inexact Augmented Lagrange Multiplier (ALM) algorithm [50]. The augmented Lagrangian function of Eq. (14) is: min ‖Z‖1 þ α‖G‖n þ η‖Eik ‖2;1 þ μqðDik Þ

Dik ;Eik ;X iik

h  i þ tr T t1 Aik  Dik X iik  Eik h  i þ tr T t2 Dik  G h  i þ tr T t3 X iik  Z δ þ ‖Aik  Dik X iik  Eik ‖2F þ ‖Dik  G‖2F þ ‖X iik  Z‖2F ; 2

4.1. Updating coding coefficient matrix X k Assume that Dk is fixed, and the coding coefficient matrix can h i be written as X k ¼ X 1k ; X 2k ; U U U ; X Ck . We compute X ik class by class. When computing X ik , all X jk ðj a iÞ are fixed. Thus, the objective function in Eq. (9) is reduced to:

9 8 > > > > > > = < C X j ij 2 i i i i 2 i ii 2 i JðX k Þ ¼ argmin ‖Ak  Dk X k ‖F þ ‖Ak  Dk X k ‖F þ ‖Dk X k ‖F þ λ‖X k ‖1 : > > > ðX ik Þ > j¼1 > > ; :

ð10Þ

jai

This is a sparse coding problem and can be solved by using the Iterative Projection Method (IPM) [49] algorithm. The detailed implementation of IPM can be referred to the Literatures [4,49]. 4.2. Updating sub-dictionary Dk With the updated X k , we update Dik class by class. When updating all Djk ðja iÞ are fixed, and we update the coding coefficient of Aik over Dik , namely X iik , simultaneously. Thus, the Eq. (9) is reduced to:

Dik ,

JðDik

Þ¼

þ

C X argmin‖Aik  Dik X iik  i ii Dk ;X k j¼1 j ai

C X

j¼1

Djk X ijk ‖2F þ‖Aik Dik X iik ‖2F

ð15Þ

where tr½ U  denotes the trace of a matrix, T 1 , T 2 and T 3 are Lagrange multipliers, and δðδ 40Þ is a balance parameter. The details of solving the problem can be referred to Algorithm 1. A similar proof to demonstrate the convergence property of Algorithm 1 can be found in Lin et al.'s work [46]. Algorithm 2 summarizes the optimization procedure of MLDL. Algorithm 1. Inexact Augmented Lagrange Multiplier (ALM) algorithm for Eq. (15) Input: The given sub-dictionaries Dik and Dl ðl a kÞ, sample set Ak , and parameters α, η and μ. Output: Dik , Eik and X iik . Initialize: G ¼ 0, Eik ¼ 0, T 1 ¼ 0, T 2 ¼ 0, T 3 ¼ 0, δ ¼ 10  6 , maxδ ¼ 1030 , ε ¼ 10  8 , ϑ ¼ 1:1. While not converged do 1. Fix other variables and update Z by:   Z ¼ argmin 1δ ‖Z‖1 þ 12‖Z  X iik þ Tδ3 ‖2F . Z

‖Djk X ijk ‖2F þα‖Dik ‖n

2. Fix other variables and update X iik by:   1  Dit T  T Ditk Aik  Eik þ Z þ k 1δ 3 , where I denotes X iik ¼ Ditk Dik þ I

j ai

s:t:

ð14Þ

M X

‖Ditk Dl ‖2F :

ð11Þ

l¼1 l ak

G

for each column in G.

And the Eq. (11) can be rewritten as:

4. Fix other variables and update Dik by:

 ii iit M t i i 2μX k X k 2μβ P D D þ X iik X iit l l Dk þ Dk k þI δ δ

C C  X X J Dik ¼ argmin‖Aik  Dik X iik  Djk X ijk ‖2F þ ‖Aik  Dik X iik ‖2F þ ‖Djk X ijk ‖2F Dik ;X iik

þ α‖Dik ‖n þ β

the identity matrix. 3. Fix other variables and update G by:   G ¼ argmin αδ‖G‖n þ 12‖G  Dik þ Tδ2 ‖2F , length normalization

j¼1 jai

M X

j¼1 jai

l¼1 lak

‖Ditk Dl ‖2F ;

¼

ð12Þ

C 2μ i iit X T 1 X iit i iit k T2 Ak X k  Djk X ijk X iit þ Aik X iit k k  Ek X k þ G þ δ δ j¼1

l¼1

jai

lak



. And the above problem is essentially the following problem:



where β is a tuning parameter. Denote q Dik ¼ ‖Aik  Dik X iik  PC PM PC j ij 2 j ij 2 it 2 j ¼ 1 Dk X k ‖F þ j ¼ 1 ‖Dk X k ‖F þ β l ¼ 1 ‖Dk Dl ‖F , the above jai j ai lak

0

1

M B C C P P T 1 X iit B i iit C i iit i iit k  T2 V ¼ 2μ A X  Djk X ijk X iit ,H ¼ 2μβ k C þ Ak X k  Ek X k þ G þ δB δ δ @ k k A j¼1 jai

equation can be converted to the following  formulation: min ‖X iik ‖1 þ α‖Dik ‖n þ η‖Eik ‖2;1 þμq Dik

l¼1

Dl Dtl

lak

2μX ii X iit ¼ δk k þX iik X iit k þ I.

Dik ;Eik ;X iik

s:t: Aik ¼ Dik X iik þ Eik ;

HY þ YQ ¼ V, where Y ¼ Dik ,

ð13Þ

where Eik is the error matrix corresponding to Aik , η and μ are tuning parameters, and ‖ U‖2;1 is the l2;1 -norm that is usually employed to measure the sample-specific corruption or noise.

andQ Here H and Q are square matrices. The converted problem can be solved by using the algorithm in the Literatures [51,52]. Length normalization for each atom in Dik . 5. Fix other variables and update Eik by:   Eik ¼ argmin ηδ‖Eik ‖2;1 þ 12‖Eik  Aik  Dik X iik þ Tδ1 ‖2F . Eik

148

F. Wu et al. / Pattern Recognition 50 (2016) 143–154

 6. Update T 1 , T 2 and T 3 by: T 1 ¼ T 1 þ δ Aik  Dik X iik  Eik ,   T 2 ¼ T 2 þ δ Dik  G , T 3 ¼ T 3 þδ X iik  Z . 7. Update δ by: δ ¼ minðϑδ; maxδ Þ. 8. Determine when to stop the iterations: ‖Dik G‖1 o ε and ‖Aik  Dik X iik  Eik ‖1 o ε and ‖X iik  Z‖1 o ε. End while

class i is computed as: ei ¼

M X

‖zk  Dik ρ^ ik ‖2 =‖^ρik ‖2 ;

where ρ^ ik is the coefficient vector associated with sub-dictionary Dik . Then we do classification via identityðzÞ ¼ argmin fei g: i

Algorithm 2. Optimization procedure of MLDL

1. Initialization fD1 ; D2 ; ⋯; DM g. We initialize all the atoms of each Dk ðk ¼ 1; 2; ⋯; M Þ as random vectors with unit l2 -norm. 2. For k ¼ 1 to M do 2.1 Update the sparse coding coefficient matrix X k . Fix Dk and solve X ik class by class by solving Eq. (10) with the IPM algorithm. 2.2 Update the dictionary Dk . Fix X k and solve for Dik class by class by solving Eq. (15) using Algorithm 1. End 3. Iterative learning.Repeat 2 until the values of objective function in adjacent iterations are close enough, or the maximum number of iterations is reached. 4. Output fX 1 ; X 2 ; ⋯; X M g and fD1 ; D2 ; ⋯; DM g.

5. The classification scheme of MLDL Sparse representation based classification (SRC) has been widely used for face recognition [31] and image classification [53]. It codes query image as a sparse linear combination of all the pre-defined dictionary bases via l1 -norm minimization. Despite the wide use of sparse representation for classification, recently researchers have begun to question the role of sparsity in classification [54,55]. Collaborative representation based classification with regularized least square (CRC_RLS) [54] reveals that it is the collaborative representation (i.e., representing the query image collaboratively by dictionary bases from all the classes) but not the l1 -norm sparse representation that makes SRC effective for image classification. It has been shown that the l2 -regularized CRC_RLS has very competitive image classification accuracy but with significantly lower complexity as compared with the l1 -regularized SRC. Inspired by CRC_RLS, we design a classification scheme for MLDL. For the given testing sample z ¼ fz1 ; z2 ; ⋯; zM g, we code zk over the learned Dk for k ¼ 1 : M. The representation coefficients can be achieved by solving: ( ) M  X    ρ^ 1 ; ⋯; ρ^ M ¼ arg min ð16Þ ‖zk  Dk ρk ‖22 þγ‖ρk ‖22 ; ρ1 ;⋯;ρM

k¼1

where ρk ðk ¼ 1; ⋯; M Þ is the coding vector of zk over Dk , and γ is a positive constant. The variables in Eq. (16) can be optimized by using alternate optimization strategy that is updating ρk for the kthðk ¼ 1; ⋯; M Þ view by fixing coding coefficients corresponding to the other views. For updating ρk , Eq. (16) can be reduced to:     ð17Þ ρ^ k ¼ argmin ‖zk  Dk ρk ‖22 þ γ‖ρk ‖22 : ρk

And the solution of Eq. (17) can be analytically derived as  1 t ρ^ k ¼ Dk t Dk þ γ U I Dk z k : ð18Þ   Once ρ^ 1 ; ⋯; ρ^ M is obtained, we use the following strategy for classification. For the given query sample zk , the coding error by

ð19Þ

k¼1

ð20Þ

The designed classification scheme for MLDL can be also used for other multi-view DL methods.

6. Comparison with related multi-view feature learning methods 6.1. Comparison with non-dl based multi-view feature learning methods Multi-view subspace learning is an important research direction of multi-view feature learning. CCA based and discriminant analysis based multi-view subspace learning (such as MCCA [11] and MVDA [12], respectively) are two representative techniques. CCA based multi-view subspace learning methods are dedicated to learning features depicting intrinsic correlation among multiple views. Discriminant analysis based multi-view subspace learning methods usually aim to achieve multiple linear transformations, with which the between-class variations of low-dimensional embeddings are maximized and the within-class variations of low-dimensional embeddings are minimized. Besides multi-view subspace learning methods, some works try to introduce low-rank learning technique into multi-view feature learning, such as the low-rank representation technique based method, i.e., LRCS [17], and the low-rank matrix completion technique based method, i.e., Mosabbeb et al. [16]. The proposed MLDL approach differs from these non-DL based multi-view feature learning methods, because we aim to fully extract complementary discriminant information from multiple views by learning multiple structurally incoherent discrimination dictionaries for helping classification. 6.2. Comparison with existing multi-view DL methods DL aims to learn from the training samples the space where the given signal could be well represented or coded for processing. As multi-view feature learning techniques develop, the DL technique has been successfully incorporated into multi-view learning framework. Several multi-view based sparse representation classification methods regard original training sample set as dictionary, like SMBR [24] and Zhang et al.'s work [20], without specially designing DL procedure. The linear decomposition of a data using atoms of learned dictionary instead of original training samples has led to state-of-the-art classification results, and MSRC [27], MSDL [29] and UMD2L [30] are representative multi-view DL methods with state-of-the-art classification effects. Considering the diversity of different modalities, MSRC utilizes a simple scheme to guarantee large disagreement among dictionaries corresponding to different modalities. MSDL learns one dictionary and the corresponding coefficients in each view, fuses the representation coefficients, and feeds the fused coefficients into SVM classifier for recognition. To make full use of complementary discriminative information among different views, our previous work UMD2L provides a multi-view based structured DL strategy based on the fisher discrimination DL method, and it designs a vector-form based uncorrelated constraint for multi-view DL. In addition, UMD2L adopts the multi-view based sparse representation classification scheme. Compared with these multi-view DL

F. Wu et al. / Pattern Recognition 50 (2016) 143–154

149

We replace a certain percentage of randomly selected pixels of each image with pixel value of 255. Fig. 2 exemplifies random pixel corruption on face images of one subject. Principal component analysis (PCA) transformation [56] is used to reduce the dimension of samples to 100. We randomly select 8 samples (each sample with 5 different poses) per class for training and use the remained samples for testing. And we repeat random selection 20 times and record average results. We list the classification accuracies of all compared methods under various noise percentages in Table 2. We can see that our approach constantly performs the best under different levels of noise. When there exists no noise in images (0% corruption), MLDL improves the average classification accuracies at least by 0.06% (¼ 95.91–95.85%). The reason for the improvement may be: low-rank regularization on each subdictionary (i.e., the dictionary corresponding to one class), which takes within-class correlation of samples into consideration, leads to a compact dictionary, and this is helpful to achieve a dictionary with favorable discriminative power. When the percentage of noise is large, for example when there exist 40% pixel corruptions in images, MLDL improves the average classification accuracies at least by 12.88% (¼68.19–55.31%). The improvement is mainly due to the designed multi-view dictionary low-rank regularization term that requires the sub-dictionary Dik be pure so that the noise Eik in training samples Aik can be effectively separated. 7.2.2. Experiment on the AR dataset The AR face dataset [33] contains images of 119 individuals (26 images for each people), including frontal view of faces under different lighting conditions and with various occlusions. Each image is scaled to 60  60. The images are manually corrupted by an unrelated block image at a random location. Fig. 3 shows demo images of one subject with 20% block corruptions. We extract Gabor transformation [57] features, Karhunen–Loève (KL) transformation [58] features and Local Binary Patterns (LBP) [59] features to construct three views (feature sets) for experiment. Similar to experimental setting for the Multi-PIE dataset, we employ PCA transformation to reduce the dimension of these feature samples to 100. We randomly select 8 samples per class for training, use the remained samples for testing, and run all compared methods 20 times for reporting the average results. Table 3 shows the classification accuracies of compared methods under different levels of block corruptions on the AR dataset. We can see that the proposed approach constantly outperforms related methods. When there is no corruption, MLDL improves the average classification results at least by 0.05% ( ¼96.42–96.37%).

methods, the proposed MLDL approach focuses on multi-view based image classification with large noise. Therefore, low-rank regularization is applied on class-specific dictionaries. Moreover, to enhance the time efficiency of multi-view DL methods, MLDL provides the structural incoherence constraint and the collaborative representation based classification scheme for multi-view DL. To better clarify the differences between the proposed MLDL approach and related multi-view feature learning methods, we provide a brief summary in Table 1.

7. Experimental results and analysis 7.1. Compared methods and experimental settings In this section, we compare the proposed MLDL approach with multi-view subspace learning methods including MCCA [11] and MVDA [12], multi-view low-rank matrix completion method, i.e., Mosabbeb et al.'s method [16], multi-view low-rank representation method LRCS [17], multi-view sparse representation classification method SMBR [24], and multi-view DL methods including MSRC [27], MSDL [29] and UMD2L [30] on benchmark image datasets from diverse domains. In all experiments, the tuning parameters in MLDL (λ, α, β, η and μ in dictionary learning phase, and γ in classification phase) and all the parameters of other compared methods are evaluated by 5-fold cross validation to avoid over-fitting. Concretely, the parameters of MLDL are set as λ ¼ 0:005, α ¼ 1, β ¼ 0:05, η ¼ 0:01, μ ¼ 1, and γ ¼ 0:005 for four datasets. In addition, the default dictionary atoms number for each view in MLDL is set as the number of training samples. 7.2. Face recognition In order to verify the effectiveness of MLDL for face recognition, we perform experiments on two widely used face image datasets, namely the Multi-PIE and AR datasets. 7.2.1. Experiment on the Multi-PIE dataset Multi-PIE dataset contains more than 750,000 images of 337 people under various views, illuminations and expressions. More introductions about this dataset can be referred to the Literature [32]. Here, a subset of 68 peoples (24 samples for each people) with 5 different poses (C05, C07, C09, C27, and C29) is selected for experiment. The image size is 64  64 pixels.

Table 1 Comparison of MLDL and related multi-view feature learning methods. Methods

DL method

Low-rank learning method

Manner of redundancy reduction

Classifier

MCCA

N/A

N/A

The nearest neighbor (NN) classifier can be used

MVDA Mosabbeb et al. [16] LRCS

N/A N/A

Maximize the total correlation between any two views without considering redundancy reduction N/A N/A

SMBR MSRC

MSDL

UMD2L MLDL

N/A Distributed nuclear norm minimization for matrix completion N/A Low-rank regularization on projection and representation coefficient matrices N/A N/A For each view, select the topmost N/A discriminative samples to construct dictionary Learn one dictionary in each view N/A

Jointly learn dictionaries for different views Jointly learn dictionaries for different views

N/A

The NN classifier can be used Test samples are classified in the process of matrix completion

N/A

The NN classifier can be used

N/A Encourage the classification results on sample predicted by using any two dictionaries to be different N/A

Robust multimodal multivariate SRC Use the majority voting strategy for each testing sample

Vector-form based uncorrelated constraint for multi-view DL Low-rank regularization on sub- Structural incoherence constraint for dictionary multi-view DL

Learn sparse coefficients with SRC for each view, input the fused coefficients of different views into SVM Multi-view SRC Multi-view collaborative representation based classification

150

F. Wu et al. / Pattern Recognition 50 (2016) 143–154

And when the percentage of corruption is large (exist 40% pixel corruptions in images), MLDL improves the average classification accuracies at least by 9.85% (¼ 49.28–39.43%). 7.3. Object classification The COIL-20 object dataset [34] contains 1440 grayscale images of 20 objects (72 images per object) under various poses. The objects are rotated through 3601 and taken at the interval of 5o .

The size of each image is 64  64 pixels. We replace a certain percentage of randomly selected pixels of each image with pixel value of 255. Fig. 4 shows demo images of one object with 30% random pixel corruptions. We extract Gabor transformation features, KL transformation features and LBP features to construct three feature sets for experiment. PCA transformation is employed to reduce the dimension of these feature samples to 100. 36 samples per class are randomly chosen to form the training set, while the remaining samples are regarded as the testing set. The

Fig. 2. Demo images (with 10% random pixel corruptions) of one subject from the Multi-PIE dataset.

Table 2 Average classification accuracies ( 7 standard deviation) (%, averaging over 20 random tests) of compared methods on the Multi-PIE dataset with various corruption percentages. Bold numbers indicate the best results. Corruptions (%)

MCCA

MVDA

Mosabbeb et al. [16]

LRCS

SMBR

MSRC

MSDL

UMD2L

MLDL

0 10 20 30 40

94.067 1.09 87.86 7 1.63 76.34 7 2.41 62.767 2.63 45.017 2.81

94.28 7 0.92 91.127 1.85 79.777 2.06 60.20 7 2.18 40.22 7 2.30

93.177 1.15 90.89 7 2.47 87.25 7 2.43 74.58 7 3.44 53.28 7 3.36

93.52 71.32 91.95 71.63 87.83 72.26 77.21 72.35 55.31 72.76

91.57 7 1.22 86.81 7 1.75 81.38 7 2.84 65.89 7 2.72 42.69 7 3.67

93.02 7 1.31 77.25 7 2.02 47.20 7 3.41 35.597 2.53 28.177 3.11

92.85 7 0.84 88.217 1.78 82.93 7 2.05 66.077 2.69 45.20 7 2.99

95.85 7 1.02 91.177 1.93 85.53 7 2.31 69.78 7 2.16 47.727 2.81

95.917 1.05 94.537 1.60 91.177 1.84 82.697 2.10 68.197 2.44

Fig. 3. Demo images (with 20% block corruptions) of one subject from the AR dataset.

Table 3 Average classification accuracies ( 7 standard deviation) (%, averaging over 20 random tests) of compared methods on the AR dataset with various corruption percentages. Bold numbers indicate the best results. Corruptions (%)

MCCA

MVDA

Mosabbeb et al. [16]

LRCS

SMBR

MSRC

MSDL

UMD2L

MLDL

0 10 20 30 40

93.617 1.23 83.747 1.72 63.26 7 2.64 51.94 7 3.22 35.057 3.17

94.19 71.31 84.76 71.31 66.51 72.42 54.25 72.46 36.88 72.87

92.127 1.44 84.42 7 1.51 70.857 2.06 55.167 2.52 39.43 7 3.15

93.337 1.09 84.38 7 1.64 70.597 2.06 55.277 2.43 39.047 2.54

91.83 7 1.24 82.59 7 1.80 69.32 7 2.48 52.617 3.18 36.717 3.32

94.647 1.83 79.127 2.47 47.917 3.45 39.107 3.83 30.62 7 3.70

93.29 71.51 82.62 71.73 70.0572.37 53.7772.42 37.48 73.00

96.37 71.36 81.04 71.38 70.49 72.28 54.62 72.40 38.11 73.20

96.42 7 1.15 87.59 7 1.57 75.65 7 2.19 63.69 7 2.53 49.28 7 2.75

F. Wu et al. / Pattern Recognition 50 (2016) 143–154

151

Fig. 4. Demo images (with 30% random pixel corruptions) of one object from the COIL-20 dataset.

Table 4 Average classification accuracies (7 standard deviation) (%, averaging over 20 random tests) of compared methods on the COIL-20 dataset with various corruption percentages. Bold numbers indicate the best results. Corruptions (%)

MCCA

MVDA

Mosabbeb et al. [16]

LRCS

SMBR

MSRC

MSDL

UMD2L

MLDL

0 10 20 30 40

95.647 0.62 93.727 1.19 89.05 7 1.59 82.277 1.62 75.447 2.03

95.3370.64 94.36 70.81 91.02 71.47 84.88 71.89 79.53 72.27

95.26 70.80 93.3571.33 90.3571.67 87.51 71.85 81.19 72.16

95.47 70.68 94.69 70.71 91.72 71.43 87.73 71.61 81.25 71.84

92.85 7 1.66 91.46 7 2.14 88.157 2.42 84.177 2.96 75.85 7 3.38

96.56 7 1.21 89.54 7 1.32 79.90 7 1.92 76.59 7 1.88 64.957 2.14

96.3370.71 93.66 70.95 90.54 71.38 86.08 71.73 78.20 72.42

97.747 0.83 95.677 0.87 90.707 1.22 86.36 7 1.54 79.43 7 2.15

97.96 7 0.73 97.047 0.80 95.317 0.84 90.417 1.15 85.667 1.60

Fig. 5. Demo images (with 40% random pixel corruptions) of ten digits from the MNIST dataset.

Fig. 6. Average classification accuracies of compared methods on the MNIST dataset with various corruption percentages (averaging over 20 random tests).

random selection process is performed 20 times, and we record the average experimental results for all compared methods. We list the classification accuracies of all compared methods in Table 4. Due to Table 4, MLDL constantly outperforms all the other compared methods under different levels of corruptions. Multi-view DL methods including UMD2L, MSDL and MSRC obtain good

Fig. 7. Rank of sub-dictionary corresponding to MLDL across iteration on four datasets with 20% pixel corruptions in images.

classification results when there is no noise exists in original images; however, the classification accuracies of them decrease fast with increasing noise. On the other hand, Mosabbeb et al.'s method, LRCS and our MLDL approach can keep desirable classification accuracies under large noise in images. This demonstrates the superiority of low-rank regularization in terms of handling noise.

152

F. Wu et al. / Pattern Recognition 50 (2016) 143–154

7.4. Handwritten digit classification The MNIST dataset [35] used in the experiment contains 1000 handwritten digit images (100 images for each digit). The image size is 28  28 pixels. We replace a certain percentage of randomly selected pixels of each image with pixel value of 255. Fig. 5 exemplifies random pixel corruption on images from ten digits. Gabor transformation features, KL transformation features and LBP features are extracted to build three feature sets for experiment. The dimension of feature samples is reduced to 100 by using the PCA transformation. We randomly select 40 samples per class for training, use the remainder for testing, and run all compared methods 20 times. Fig. 6 shows the average classification accuracies of all compared methods. We can see that MLDL constantly outperforms all the other competing methods under different levels of corruptions. And the gap between the classification result of MLDL and those of other compared methods becomes larger as the percentage of corrupted pixels increases. 7.5. Evaluation of multi-view dictionary low-rank regularization term In this subsection, we specially conduct experiments to evaluate the effectiveness of our multi-view dictionary low-rank regularization term. We observe the change of rank corresponding to sub-dictionary as the iterations increase. And Fig. 7 shows the change of rank versus the number of iterations on four datasets with 20% pixel corruptions in images. It can be seen that the ranks of sub-dictionaries decrease fast as the iterations increase on all datasets.

Furthermore, we separately perform MLDL without/with the multiview dictionary low-rank regularization term on four datasets. And the classification accuracies on all datasets with 20% pixel corruptions in images are listed in Table 5. We can see that the designed term can improve the classification accuracies, which demonstrates the effectiveness of the multi-view dictionary low-rank regularization. 7.6. Evaluation of structural incoherence constraint To evaluate the designed structural incoherence constraint, we separately conduct MLDL without/with the incoherence constraint. Table 6 shows the classification results on all datasets with 20% pixel corruptions in images. We can find that the designed structural incoherence constraint can improve the classification results. 7.7. Evaluation of statistical significance of difference To statistically analyze the classification results on four datasets, we conduct a statistical test, i.e., Mcnemar's test [60]. This test can provide statistical significance between MLDL and compared methods. Here, Mcnemar's test uses a significance level of 0.05, that is, if the p-Value is below 0.05, the performance difference between two compared methods is considered to be statistically significant. Table 7 shows the p-Values between MLDL and other compared methods on four datasets with 20% pixel corruptions in images. According to the table, the proposed approach makes a statistically significant difference in comparison with the related methods. 7.8. Evaluation of time cost

Table 5 Average classification accuracies (%) of MLDL without/with the designed multiview dictionary low-rank regularization term on all datasets with 20% pixel corruptions in images. Datasets Without dictionary low-rank regularization term

With low-rank regularization term

MultiPIE AR COIL-20 MNIST

85.56

91.17

70.50 90.78 70.48

75.65 95.31 74.09

In this section, we evaluate the computational cost of MLDL and other compared methods on four datasets. Table 8 reports the average running time corresponding to each compared method on all datasets with 20% pixel corruptions in images. In experiments, we use 32-bit computers with 2.09-GHz dual core processer and 4 GB RAM. Due to Table 8, multi-view sparse representation classification or DL methods usually cost much more time than Table 8 Average running time of different methods on four datasets with 20% pixel corruptions in images. Method

Average running time (s)

Table 6 Average classification accuracies (%) of MLDL without/with the designed structural incoherence constraint on all datasets. Datasets Without structural incoherence constraint

With structural incoherence constraint

MultiPIE AR COIL-20 MNIST

89.42

91.17

73.97 94.79 72.86

75.65 95.31 74.09

MCCA MVDA Mosabbeb et al. [16] LRCS SMBR MSRC MSDL UMD2L MLDL

Multi-PIE

AR

COIL-20

MNIST

1.49 7.45 276.63 294.17 1045.07 261.57 13.75 3476.20 1665.22

1.14 4.51 232.59 194.51 692.84 205.37 6.55 2206.57 813.59

0.49 2.06 108.47 62.08 434.78 129.21 4.43 1630.34 582.39

0.30 1.28 40.81 15.33 265.47 55.82 1.96 912.88 469.81

Table 7 p-Values between MLDL and other compared methods on four datasets with 20% pixel corruptions in images. Datasets

Multi-PIE AR COIL-20 MNIST

MLDL MCCA

MVDA

Mosabbeb et al. [16]

LRCS

SMBR

MSRC

MSDL

UMD2L

2.37  10  11 3.68  10  9 2.05  10  7 6.08  10  13

2.64  10  8 1.95  10  12 7.22  10  13 1.82  10  5

1.22  10  9 4.02  10  8 1.60  10  12 6.50  10  10

5.46  10  7 7.80  10  6 5.10  10  12 3.87  10  7

1.52  10  10 4.25  10  10 3.49  10  13 4.99  10  12

2.70  10  10 8.44  10  9 2.55  10  8 7.35  10  6

3.28  10  9 1.62  10  7 7.54  10  9 1.43  10  10

9.53  10  13 5.39  10  10 4.97  10  11 6.99  10  11

F. Wu et al. / Pattern Recognition 50 (2016) 143–154

multi-view subspace learning methods like MCCA and MVDA. We can also see that the running time of MLDL is between that of SMBR and UMD2L, which demonstrates the efficiency of our designed structurally incoherent dictionary learning procedure and collaborative representation based classification scheme.

8. Conclusions In this paper, we for the first time incorporate low-rank learning technique into multi-view dictionary learning and propose the MLDL approach for image classification. It provides a dictionary low-rank regularization term to solve the multi-view DL problem with large noise. MLDL also designs a structural incoherence constraint for multi-view DL such that redundancy among dictionaries of different views is reduced. MLDL can jointly learn multiple discrimination dictionaries by making dictionary atoms correspond to class labels. In addition, MLDL provides a novel classification scheme for multi-view DL methods, which uses l2 -norm regularization instead of l1 -norm one such that complexity of the classification process can be significantly decreased. We evaluate the classification performances of our approach and several multi-view based feature learning methods on various multi-view datasets. The experimental results demonstrate that MLDL is superior to state-of-the-art multi-view DL methods and other representative multi-view feature leaning methods, especially when there is noise or corruption present. In addition, experiments validate the effectiveness of the designed multiview dictionary low-rank regularization term and structural incoherence constraint. The experimental results also indicate that the designed structurally incoherent DL procedure and the designed classification scheme can significantly reduce computing time.

Conflict of interest statement None declared.

Acknowledgments The work described in this paper was supported by the National Natural Science Foundation of China under Project nos. 61272273, 61233011, 61272203, 61533010, 61231015, and the 863 Program under Project no. 2015AA016306. References [1] T. Lin, S. Liu, H. Zha, Incoherent dictionary learning for sparse representation, in: Proceedings of the 21st International Conference on Pattern Recognition, (ICPR), 2012, pp. 1237–1240. [2] L. Ma, C. Wang, B. Xiao, W. Zhou, Sparse representation for face recognition based on discriminative low-rank dictionary learning, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), 2012, pp. 2586–2593. [3] Z. Jiang, Z. Lin, L.S. Davis, Label consistent K-SVD: learning a discriminative dictionary for recognition, IEEE Trans. Pattern Anal. Mach. Intell. 35 (11) (2013) 2651–2664. [4] M. Yang, L. Zhang, X. Feng, D. Zhang, Fisher discrimination dictionary learning for sparse representation, in: Proceedings of the 13th IEEE International Conference on Computer Vision, (ICCV), 2011, pp. 543–550. [5] D. Barchiesi, M.D. Plumbley, Learning incoherent dictionaries for sparse approximation using iterative projections and rotations, IEEE Trans. Signal Process. 61 (8) (2013) 2055–2065. [6] L. Li, S. Li, Y. Fu, Learning low-rank and discriminative dictionary for image classification, Image Vis. Comput. 32 (10) (2014) 814–823. [7] R. Memisevic, On multi-view feature learning, in: Proceedings of the 29th International Conference on Machine Learning, (ICML), 2012, pp. 161–168. [8] C. Xu, D. Tao, C. Xu, A survey on multi-view learning, arXiv preprint arXiv 1304 (2013) 5634.

153

[9] Y. Guo, Convex subspace representation learning from multi-view data, in: Proceedings of the 27th AAAI Conference on Artificial Intelligence, (AAAI), 2013, pp. 387–393. [10] Y.H. Yuan, Q.S. Sun, H.W. Ge, Fractional-order embedding canonical correlation analysis and its applications to multi-view dimensionality reduction and recognition, Pattern Recognit. 47 (3) (2014) 1411–1424. [11] J. Rupnik, J. Shawe-Taylor, Multi-view canonical correlation analysis, in: Proceedings of the International Conference on Data Mining and Data Warehouses, (SIKDD), 2010, pp. 1–4. [12] M. Kan, S. Shan, H. Zhang, S. Lao, X. Chen, Multi-view discriminant analysis, in: Proceedings of the 12nd European Conference on Computer Vision, (ECCV), 2012, pp. 808–821. [13] A. Kumar, H. Daumé, A co-training approach for multi-view spectral clustering, in: Proceedings of the 28th International Conference on Machine Learning, (ICML), 2011, pp. 393–400. [14] A. Kumar, P. Rai, H. Daume, Co-regularized multi-view spectral clustering, in: Proceedings of the 24th Advances of Neural Information Processing Systems, (NIPS), 2011, pp. 1413-1421. [15] P. Dhillon, D.P. Foster, L.H. Ungar, Multi-view learning of word embeddings via CCA, in: Proceedings of the 24th Advances of Neural Information Processing Systems, (NIPS), 2011, pp. 199-207. [16] E.A. Mosabbeb, K. Raahemifar, M. Fathy, Multi-view human activity recognition in distributed camera sensor networks, Sensors 13 (7) (2013) 8750–8770. [17] Z. Ding, Y. Fu, Low-rank common subspace for multi-view learning, in: Proceedings of International Conference on Data Mining, (ICDM), 2014, pp. 110–119. [18] I. Tošić, P. Frossard, Dictionary learning for stereo image representation, IEEE Trans. Image Process. 20 (4) (2011) 921–934. [19] S. Zheng, B. Xie, K. Huang, D. Tao, Multi-view pedestrian recognition using shared dictionary learning with group sparsity, in: Proceedings of the 18th International Conference on Neural Information Processing, (ICONIP), 2011, pp. 629–638. [20] H. Zhang, N.M. Nasrabadi, Y. Zhang, T.S. Huang, Joint dynamic sparse representation for multi-view face recognition, Pattern Recognit. 45 (4) (2012) 1290–1298. [21] X. Zhang, D.S. Pham, S. Venkatesh, W. Liu, D. Phung, Mixed-norm sparse representation for multiview face recognition, Pattern Recognit. 48 (9) (2015) 2935–2946. [22] F. Zhu, L. Shao, M. Yu, Cross-modality submodular dictionary learning for information retrieval, in: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, (CIKM), 2014, pp. 1479–1488. [23] W. Liu, D. Tao, J. Cheng, Y. Tang, Multiview hessian discriminative sparse coding for image annotation, Comput. Vis. Image Underst. 118 (2014) 50–60. [24] S. Shekhar, V.M. Patel, N.M. Nasrabadi, R. Chellappa, Joint sparse representation for robust multimodal biometrics recognition, IEEE Trans. Pattern Anal. Mach. Intell. 36 (1) (2014) 113–126. [25] G. Monaci, P. Jost, P. Vandergheynst, B. Mailhe, S. Lesage, R. Gribonval, Learning multimodal dictionaries, IEEE Trans. Image Process. 16 (9) (2007) 2272–2283. [26] Y. Zhuang, Y. Wang, F. Wu, Y. Zhang, W. Lu, Supervised coupled dictionary learning with group structures for multi-modal retrieval, in: Proceedings of the 27th AAAI Conference on Artificial Intelligence, (AAAI), 2013, pp. 1070– 1076. [27] Y. Shi, Y. Gao, Y. Yang, Y. Zhang, D. Wang, Multi-modal sparse representationbased classification for lung needle biopsy images, IEEE Trans. Biomed. Eng. 60 (10) (2013) 2675–2685. [28] Z. Gao, H. Zhang, G.P. Xu, Y.B. Xue, A.G. Hauptmann, Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition, Signal Process. 112 (2015) 83–97. [29] M.J. Gangeh, P. Fewzee, A. Ghodsi, M.S. Kamel, F. Karray, Multiview supervised dictionary learning in speech emotion recognition, IEEE/ACM Trans. Audio Speech Lang. Process. 22 (6) (2014) 1056–1068. [30] X. Jing, R. Hu, F. Wu, X. Chen, Q. Liu, Y. Yao, Uncorrelated multi-view discrimination dictionary learning for recognition, in: Proceedings of the 28th AAAI Conference on Artificial Intelligence, (AAAI), 2014, pp. 2787–2795. [31] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227. [32] D. Cai, X. He, J. Han, H.J. Zhang, Orthogonal laplacianfaces for face recognition, IEEE Trans. Image Process. 15 (11) (2006) 3608–3614. [33] A.M. Martinez, R. Benavente, The AR Face Database CVC Technical Report 24, , 1998. [34] H. Murase, S.K. Nayar, Visual learning and recognition of 3-D objects from appearance, Int. J. Comput. Vis. 14 (1) (1995) 5–24. [35] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324. [36] K. Engan, S.O. Aase, J.H. Husoy, Method of optimal directions for frame design, in: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP), 1999, pp. 2443–2446. [37] J.A. Tropp, Greed is good: algorithmic results for sparse approximation, IEEE Trans. Inf. Theory 50 (10) (2004) 2231–2242. [38] M. Aharon, M. Elad, A.M. Bruckstein, K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation, IEEE Trans. Signal Process. 54 (11) (2006) 4311–4322. [39] E.J. Candès, X. Li, Y. Ma, J. Wright, Robust principal component analysis? J. ACM 58 (3) (2011) 11.

154

F. Wu et al. / Pattern Recognition 50 (2016) 143–154

[40] E.J. Candès, B. Recht, Exact matrix completion via convex optimization, Found. Comput. Math. 9 (6) (2009) 717–772. [41] G. Liu, Z. Lin, Y. Yu, Robust subspace segmentation by low-rank representation, in: Proceedings of the 27th International Conference on Machine Learning, (ICML), 2010, pp. 663–670. [42] C.F. Chen, C.P. Wei, Y.C. Wang, Low-rank matrix recovery with structural incoherence for robust face recognition, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), 2012, pp. 2618–2625. [43] Z. Zheng, M. Yu, J. Jia, H. Liu, D. Xiang, X. Huang, J. Yang, Fisher discrimination based low rank matrix recovery for face recognition, Pattern Recognit. 47 (11) (2014) 3502–3511. [44] V. Chandrasekaran, S. Sanghavi, P.A. Parrilo, A.S. Willsky, Rank-sparsity incoherence for matrix decomposition, SIAM J. Optim. 21 (2) (2011) 572–596. [45] A. Ganesh, Z. Lin, J. Wright, L. Wu, M. Chen, Y. Ma, Fast algorithms for recovering a corrupted low-rank matrix, in: Proceedings of the 3rd IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing, (CAMSAP), 2009, pp. 213–216. [46] Z. Lin, M. Chen, Y. Ma, The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices, arXiv preprint arXiv 1009 (2010) 5055. [47] I. Ramirez, P. Sprechmann, G. Sapiro, Classification and clustering via dictionary learning with structured incoherence and shared features, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), 2010, pp. 3501–3508. [48] Y. Zhang, Z. Jiang, L.S. Davis, Discriminative tensor sparse coding for image classification, in: Proceedings of British Machine Vision Conference, (BMVC), 2013. [49] L. Rosasco, A. Verri, M. Santoro, S. Mosci, S. Villa, Iterative Projection Methods for Structured Sparsity Regularization MIT Technical Reports, MIT-CSAIL-TR2009-050, CBCL-282, Massachusetts Institute of Technology, America, 2009.

[50] D.P. Bertsekas, Constrained Optimization and Lagrange Multiplier Methods, Academic Press, America, 1982. [51] R.A. Smith, Matrix equation XA þ BX¼ C, SIAM J. Appl. Math. 16 (1) (1968) 198–201. [52] G. Golub, S. Nash, C. Van Loan, A Hessenberg-Schur method for the problem AX þ XB¼ C, IEEE Trans. Autom. Control 24 (6) (1979) 909–913. [53] J. Yang, K. Yu, Y. Gong, T. Huang, Linear spatial pyramid matching using sparse coding for image classification, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), 2009, pp. 1794–1801. [54] L. Zhang, M. Yang, X. Feng, Sparse representation or collaborative representation: Which helps face recognition? in: Proceedings of the 13th IEEE International Conference on Computer Vision, (ICCV), 2011, pp. 471–78. [55] J. Waqas, Z. Yi, L. Zhang, Collaborative neighbor representation based classification using l2-minimization approach, Pattern Recognit. Lett. 34 (2) (2013) 201–208. [56] M. Turk, A. Pentland, Eigenfaces for recognition, J. Cogn. Neurosci. 3 (1) (1991) 71–86. [57] S.E. Grigorescu, N. Petkov, P. Kruizinga, Comparison of texture features based on Gabor filters, IEEE Trans. Image Process. 11 (10) (2002) 1160–1167. [58] K. Fukunaga, W.L. Koontz, Application of the Karhunen–Loeve expansion to feature selection and ordering, IEEE Trans. Comput. 19 (4) (1970) 311–318. [59] T. Ahonen, A. Hadid, M. Pietikainen, Face description with local binary patterns: application to face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 28 (12) (2006) 2037–2041. [60] B.A. Draper, W.S. Yambor, J.R. Beveridge, Analyzing PCA-based face recognition algorithms: eigenvector selection and distance measures, Empirical Evaluation Methods in Computer Vision, World Scientific Press, Singapore, 2002, pp. 1-15.

Fei Wu is now a visiting student at the School of Computer, Wuhan University, China and a Ph.D. student at the Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, China. His research interest includes pattern recognition, biometrics and machine learning.

Xiao-Yuan Jing is now a Professor at the School of Computer, Wuhan University, and a Professor at the Nanjing University of Posts and Telecommunications, China. His research interest includes pattern recognition, feature extraction and feature learning, machine learning, artificial intelligence and software engineering.

Xin-Ge You is now a Professor and Vice Dean at the School of Electronic Information and Communications, Huazhong University of Science and Technology, China. His research interest includes pattern recognition, image processing, document recognition and machine learning.

Dong Yue is now a Professor and Dean at the Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, China. His research interest includes automatic control, data analysis and processing.

Rui-Min Hu is now a Professor and Dean at the School of Computer, Wuhan University, China. His research interest includes pattern recognition, multimedia, biometrics and information security.

Jing-Yu Yang is now a Professor at the College of Computer Science, Nanjing University of Science and Technology, China. His research interest includes pattern recognition, image processing, machine learning and artificial intelligence.

Suggest Documents