This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CYBERNETICS
1
Exploiting Attribute Correlations: A Novel Trace Lasso-Based Weakly Supervised Dictionary Learning Method Lin Wu, Yang Wang, and Shirui Pan
Abstract—It is now well established that sparse representation models are working effectively for many visual recognition tasks, and have pushed forward the success of dictionary learning therein. Recent studies over dictionary learning focus on learning discriminative atoms instead of purely reconstructive ones. However, the existence of intraclass diversities (i.e., data objects within the same category but exhibit large visual dissimilarities), and interclass similarities (i.e., data objects from distinct classes but share much visual similarities), makes it challenging to learn effective recognition models. To this end, a large number of labeled data objects are required to learn models which can effectively characterize these subtle differences. However, labeled data objects are always limited to access, committing it difficult to learn a monolithic dictionary that can be discriminative enough. To address the above limitations, in this paper, we propose a weakly-supervised dictionary learning method to automatically learn a discriminative dictionary by fully exploiting visual attribute correlations rather than label priors. In particular, the intrinsic attribute correlations are deployed as a critical cue to guide the process of object categorization, and then a set of subdictionaries are jointly learned with respect to each category. The resulting dictionary is highly discriminative and leads to intraclass diversity aware sparse representations. Extensive experiments on image classification and object recognition are conducted to show the effectiveness of our approach. Index Terms—Sparse representation, weakly-supervised dictionary learning.
trace
lasso,
I. I NTRODUCTION ICTIONARY learning and sparse representation [16], [27], [34], [48], [53] models for data representation achieved state-of-the-art performance in various visual recognition tasks such as image classification [2], [37], face recognition [46], [51], and zero-shot learning [18]. Compared with handcrafted features, sparse representations are more
D
Manuscript received July 19, 2016; accepted September 18, 2016. This work was supported by the UTS Early Career Researcher under Grant PRO16-1383. This paper was recommended by Associate Editor M. Shin. (Corresponding author: Lin Wu.) L. Wu is with the University of Adelaide, Adelaide, SA 5005, Australia (e-mail:
[email protected]). Y. Wang is with the University of New South Wales, Sydney, NSW 2052, Australia (e-mail:
[email protected]). S. Pan is with the Centre of Quantum Computation and Intelligent Systems, University of Technology Sydney, Broadway, NSW 2007, Australia (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2016.2612686
representative and discriminative with lower dimensionality, and thus can potentially improve their performance. In sparse coding, a high-dimensional visual object is recovered as a sparse linear representation with respect to a set of nonparametric basis set, known as dictionary. Originally, predefined dictionaries based on various types of wavelets have been used. However, a wealth of existing studies [24], [25], [31], [44] have shown that learning a dictionary and corresponding sparse representations instead of using predefined ones can dramatically improve the recognition performance. Original formulations for dictionary learning and sparse representation [1], [20] (also known as unsupervised learning) are developed based on the minimization of the reconstruction error between the original signal and its sparse representation in the space of the learned dictionary. Although this paradigm is optimal for solving problems such as denoising and coding, it may lead to suboptimal solution in classification tasks, where the ultimate goal is to make the learned dictionary and corresponding sparse representation as discriminative as possible. Orthogonally, in order to promote the discriminative power of a learned dictionary, supervised dictionary learning methods [28], [50], [51] are proposed by considering label information in the learning of dictionary atoms and the coefficients of sparse approximation. Specifically, they usually learn a set of subdictionaries, where each of which corresponds to a specific category/class. In spite of substantial research efforts for supervised dictionary learning, this pipeline still suffers from some limitations. 1) Intraclass Diversity and Interclass Similarity Curse: Visual objects from the same category may be visually different while those from different categories can be visually similar. The facts of visual dissimilarities (similarities) within (between) categories induced by pose, viewpoint, and appearance variations commit it difficult to learn a monolithic dictionary in the context of diverse samples. For example, in Fig. 2(a), three finegrained subcategories in tower, “watch tower,” “water tower,” and “common tower,” are essentially visually different from each other yet belonging to the same category. Also, watch towers look more similar to buildings because they indeed have subtle visual differences. Thereby, the existence of intradiversity and intersimilarity prohibits learning a discriminative dictionary toward a particular category.
c 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2
IEEE TRANSACTIONS ON CYBERNETICS
Fig. 1. Our framework consists of two phases. The first phase is to group visual objects into different categories through the learned sparse representations by exploiting attribute correlations. Then, we learn multiple subdictionaries, each of which corresponds to a subcategory. In the second phase, a test sample is recovered by a sparse linear combination of atoms from the learned dictionary.
2) Large amount of labeled data samples are not available in practice. Annotating images with class labels to precisely describe its visual content is extremely laborious. Worse still, it is inevitably user-biased. To address the salient limitations above, we need alternative sources of information that relate object classes. Visual attributes [10], [11], [18], [35], [38], which describe wellknown common characteristics of objects, are an appealing source of information, and they can be easily obtained through crowdsourcing techniques [6], [30] or generated by welldeveloped toolkits [4], [36]. Attributes provide a means to describe such intraclass diversity concepts. They model shared characteristics of objects such as color and texture, which are easily annotated by humans and converted to machine-readable vector format [2] [see examples in Fig. 2(b)].1 The emergence of visual attributes has motivated another stream in dictionary learning and sparse representation, known as weakly-supervised models. As a consequence, visual attributes are well-studied, leading to impressive results in a variety of visual recognition tasks [3], [5], [33], [39], [41], [43], [45], [47]. Nonetheless, these methods consider each attribute independently, despite the underlying informative knowledge revealed by attribute correlations. To illustrate this insight, we show an example below. Example 1: In Fig. 2(b), attributes “white,” “eats fish,” and “water” are combined to represent “polar bear.” When it comes to “bird,” attributes “has peak,” “has wing,” and “feather” are inter-related. Thereby, attribute correlations are critical priors to be exploited to form distinct visual object categories, from which a discriminative dictionary can be learned. Instead of learning a common dictionary over the whole data set, in this paper, we propose to jointly learn multiple 1 In this paper, the association between an attribute and a category is defined as a binary value indicating the presence/absence of an attribute.
subdictionaries, each of which corresponds to a particular subcategory. Each subcategory is defined to contain visual objects that have statistically semantic correlations associated with attributes [7]. In this paper, subcategories are formed by grouping samples with corresponding sparse representations regularized by trace lasso [13]. This leads to a novel weaklysupervised, discriminative dictionary learning framework for visual recognition. We remark that effectively discovering attribute correlations is technically nontrivial due to the fact of local attribute associations. That is, attribute correlation is locally assumed from a subset of visual objects rather than statistically over the whole data set. This may unfavorably include dissimilar visual objects into the same subcategory, as illustrated in Fig. 3. To combat this problem, we propose to embed the co-occurrence of attributes into dictionary learning. By doing this, visual objects with similar sparse representations and statistically-correlated attributes can be grouped into a cluster. Specifically, our framework can be decomposed into two phases. First, encoded by multiple attributes, visual objects are grouped in terms of their sparse codes, in which a trace lasso [13] based method is proposed to automatically reveal the grouping structure. Second, a set of subdictionaries are learned, each of which corresponds to a subcategory/group. We show the sketch of this framework in Fig. 1. To the best of our knowledge, we are the first to exploit attribute correlations in dictionary learning and sparse representation coupled with trace lasso. The main contributions of this paper can be summarized as follows. 1) We present a novel weakly-supervised dictionary learning method by exploiting visual attribute correlations. The proposed method is label free, cost-effective, and widely applicable. 2) We propose a novel objective function, regularized by trace lasso, to adaptively cluster visual objects into different groups based on attribute correlations. These correlations revealed by their distribution on visual
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. WU et al.: EXPLOITING ATTRIBUTE CORRELATIONS: NOVEL TRACE LASSO-BASED
3
Fig. 2. (a) Tower category includes three distinct subcategories, namely, watch tower, water tower, and common tower. This intraclass diversity makes it difficult to learn a monolithic dictionary for category tower. And a categoric dictionary is usually unable to discriminatively encode some ambiguous samples such as watch towers. (b) Visual objects usually contain multiple attributes which are combined to represent a class of objects.
Fig. 3. Three images selected from AWAs data set [17] are associated with three attributes, sky, wing, and feather. The middle image is more related to the right image than the left one since the attribute wing is found to be statistically closer to feather than sky in the AWA data set.
objects can serve as a critical cue to produce better grouping effects. Multiple discriminative subdictionaries are jointly learned with each corresponding to a group. 3) Extensive experiments on real-world benchmarks are conducted to show the superiority of our approach. The rest of this paper is structured as follows. Section II surveys some related works. We formulate the problem and present our approach in Section III. The optimization solution is derived in Section III-C, followed by the dictionary learning process in Section IV. Experimental evaluations are given in Section V, and we conclude this paper in Section VI. II. R ELATED W ORK This paper focuses on boosting visual object recognition by incorporating dictionary learning with more discriminative properties. Thus, it is closely related to dictionary learning and sparse coding. State-of-the-art approaches to dictionary learning and sparse representation can be roughly categorized into three divisions: 1) unsupervised; 2) supervised; and 3) weakly supervised methods. Prior works in unsupervised dictionary learning and sparse representation [1], [5], [8], [20] focused on learning a dictionary through the reconstruction optimization which minimized the residual errors of reconstructing the original signals. An appealing algorithm of method of optimal directions, was presented by Engan et al. [8]. They iteratively updated the dictionary by taking centroids of nearest neighbor clustering as atoms. Aharon et al. [1] generalized the K-means clustering process and proposed the K-singular value decomposition (SVD) algorithm to learn an over-complete dictionary from image patches. More recently, Chiang et al. [5] augmented K-means clustering with attribute similarities, and a number of resulted clusters were regarded as dictionaries. Since these methods focus on the reconstruction power of the
dictionary whilst ignoring the discrimination capability, the learned dictionaries usually lead to impressive reconstruction capability but not effective for image classification task. To make the learned dictionary discriminative, many supervised dictionary learning approaches are developed to learn category-dependent dictionaries [14], [28], [50]–[52]. Toward K-SVD, Zhang and Li [51] achieved reconstructive and discriminative dictionary learning in an unified process. Yang et al. [50] augmented their reconstructive objective function with Fisher discriminant criterion as an additional discrimination term. Considering that objects from different categories can be visually similar to each other, a joint dictionary learning algorithm was presented to leverage interobject visual correlations [52], where it jointly learned multiple category-specific dictionaries and a commonly shared dictionary for a group of visually correlated objects. In spite of the effectiveness of these approaches, image-level labels are required for subdictionary learning purpose, and data labels in practice are not always available. Moreover, manual labeling is rather laborious and inevitably user-biased. Even if training samples are precisely labeled, the intraclass diversities and interclass similarities still make it difficult to learn a monolithic dictionary that can best fit the data. In contrast, the proposed solution in this paper is based on attributeguided subcategory learning. Visual attributes have intrinsic human-nameable qualities that can be easily computed. Also, subcategories formed by grouping visual objects with highlycorrelated attributes can yield a more discriminative dictionary. Recent studies show that weakly-supervised dictionary learning and sparse representation are attracting increasing attention. A method proposed by Cao et al. [4] is to optimize attribute-guided sparse codes over a visual codebook. However, they fail to explicitly learn category-specific dictionaries and attributes are only used in the sparse coding step. Moreover, they treat each attribute individually while ignoring informative attribute correlations. A close work to us is attribute-aware dictionary learning (AttrDL) [3], where multiple dictionaries are learned and each dictionary corresponds to a single attribute. In fact, attributes are essentially inter-related rather than being independent. Hence, we account for attribute correlations and group visual objects into subcategories by their membership in attribute associations from which a subdictionary aware dictionary can be learned.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4
IEEE TRANSACTIONS ON CYBERNETICS
III. O UR A PPROACH In this section, we first introduce notations that are used throughout the rest of this paper, and then we shed light on multiattribute categorization, where visual data objects with semantically common attributes can be grouped automatically. With all data categories collected, we learn a subdictionary on each category, which together form a discriminative dictionary. A. Notations Given a visual data set X = {x1 , . . . , xi , . . . , xN }, and a set of K attributes associated with X; each xi can be represented by an attribute vector xi = {xi1 , . . . , xij , . . . , xiK }, where each entry is a binary value that indicates the presence of an attribute. That is, xij = 1 if xi has the attribute j, and xij = 0 otherwise. Thus, we have attribute-encoded data representation: X ∈ RK×N . On the other hand, visual attributes can also be represented by their distribution on visual data objects. In particular, each attribute ai can be encoded by a N-dimensional vector, where each entry is a binary value that indicates the occurrence of that attribute on data samples. That is, aij = 1 if xj contains ai , and 0 otherwise. Accordingly, we obtain an attribute representation matrix X (a) ∈ RN×K from the data distribution perspective.
globally, i.e., the co-occurrence of attributes on a whole set of data. This can incorporate co-occurrence statistics of attributes that frequently appear within the data set. Intuitively, the co-occurrence statistics encode meaningful correlations since semantically similar attributes such as “ice” and water occur together more frequently than semantically dissimilar attributes such as ice and “fashion.” To illustrate this observation, we provide an example in Fig. 3. The middle object is encoded as S2 = [1, 1, 1], however, it is difficult to determine which object is more closely related to it because both the left and the right share two common attributes with the middle. This gives rise to a question: is the correlation strength between “sky” and “wing” stronger than that of wing and feather? The answer is “no,” as wing and feather are found to co-occur more frequently than the pair of wing and sky in animals with attribute (AWA) data set [17]. To obtain the distribution of attribute correlation strengths, (a) we propose an objective function by optimizing si for the attribute xi(a) 1 (a) (a) (a) (a) 2 (a) min xi − Xˆ si + β Xˆ Diag si i i (a) 2 ∗ 2 s (a) (a) ∈ RN×(K−1) is obtained by excluding xi i (a) (a) X (a) . Diag(si ) converts the sparse vector si ∈
where Xˆ
B. Exploiting Attribute Correlations for Data Objects Grouping Given X, it is straightforward to group visual data objects into the same category that share a common subset of attributes. To this end, we learn sparse representations by considering the relationship of linear representation between dictionary atoms, where each data object will be linearly represented and reconstructed by the rest from the same category. Specifically, for each xi , the learning of its corresponding sparse representation si can be formulated to be 2 1 min xi − Xˆi si 2 + α Xˆi Diag(si )∗ (1) si 2 where Xˆi ∈ RK×(N−1) is obtained by excluding xi ∈ RK from X. Diag(si ) converts si ∈ RN−1 into a diagonal matrix, where the jth diagonal entry is the jth entry of si . α > 0 is a balancing parameter between the fitting accuracy and sparsity. ||XDiag(si )||∗ , named trace lasso [13], is a recently established regularizer that interpolates between 1 -norm and 2 -norm of si in terms of the grouping effect. The main difference between trace lasso and existing norms is that trace lasso has superior grouping effect to assign nearly equal positive values (nearly 0) to some coefficients of si if and only if corresponding data objects have high (week or no) correlation with xi , as proved by Liu et al. [23]. It has been shown that sparse representations optimized by trace lasso can well reflect data correlations [40], [42]. Thus, in the context of attribute space, visual objects xi s can be clustered through trace lasso if they have some overlapping in attribute dimensions. However, (1) treats each attribute independently and equally, while overlooking the fact that the strength of attribute correlations can be varied. That means we need to consider attribute distributions over data objects
(2)
i
∈ RN
from RK−1 into a diagonal matrix, and β is a balanced parameter. The (a) sparse code si yielded by minimizing (2) encodes attribute (a) (a) correlation strengths between xi and xj (j = i) over all visual objects globally and statistically. We provide more explanations for (2). (a) Remarks on (2): For each attribute xi , its sparse code (a) si revealed by trace lasso can identify a set of attributes (a) that have high correlations yet statistically-semantic to xi over a specific data set (e.g., AWA data set [17]). Apparently, for different attributes, xi(a) and xk(a) (i = k), their associated (a) (a) sparse codes si and sk can be varied but reflect respective correlation strengths toward the rest of attributes. Combining (1) and (2), we have our objective as min
S,G,S(a)
T 2 X − S(a) GS + γ1 J + γ2 Q
s.t. S(a) ≥ 0, S ≥ 0
2
(3)
where 1) the matrix G ∈ R(K−1) ×(N−1) bridging S(a) ∈ R(K−1) ×K and S ∈ R(N−1) ×N is to perform a more generalized tri-factorization reconstruction on X. More intuitively, G implicitly models the mutualpromotion between data correlations characterized by S and attribute correlations encoded by S(a) . That is, for each iteration in optimization (see Section III-C for details), the updating in data correlation matrix S will affect the updating in attribute correlation matrix S(a) , and vice versa; 2 2) J = N i=1 (1/2)||xi −Xˆi si ||2 +α||Xˆi Diag(si )||∗ is identical to (1);
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. WU et al.: EXPLOITING ATTRIBUTE CORRELATIONS: NOVEL TRACE LASSO-BASED
is of
K
(a) (a) (a) 2 (a) (a) i=1 (1/2)||xi − Xˆi si ||2 + β||Xˆi Diag(si )||∗ (a) identical to (2); si and si represent the ith column S(a) and S, respectively.
3) Q =
C. Optimization Due to the nonsmoothness of (3) with respect to G, S(a) , S, and the nuclear norm, we adopt the augmented Lagrange multiplier (ALM) [22] method by alternatively optimizing each variable while fixing the others for each iteration. Such procedure repeats until convergence. 1) Computation of G: It is equivalent to optimizing T 2 X − S(a) GS . (4) 2
Set the derivative with respect to G to be 0, we have the update rule as T −1
−1 (a) (a) G= S S S(a) XST SST . (5) S(a) :
2) Computation of We consider optimizing each col(a) umn si in S(a) individually. Following the optimization in low-rank minimization [21], we adopt the ALM [22] to solve (a) the problem regarding si T 2 (a) min X(i, ·) − si GS (a)
si
2
1 (a) (a) (a) (a) 2 + γ1 xi − Xˆ si + β Ri i ∗ 2 2 (a) (a) (a) (a) (a) (a) s.t. Ri = Xˆ Diag si , gi ≥ 0, gi = si i
(a) Ri
(6)
(a) gi
where and are the auxiliary variables. The ALM method performs on the following augmented Lagrange function: T 2 (a) (a) (a) = X(i, ·) − si GS L Ri , si 2 2 1 (a) (a) (a) (a) + γ1 xi − Xˆ si + β Ri i ∗ 2 2 (a) (a) (a) (a) (a) (a) T + < ki , si − gi > + Tr Ya Ri − Xˆ Diag si i 2 μ (a) (a) (a) (a) (a) 2 (7) − X Diag s + − g + si Ri i i ˆi 2 2 2 where matrix Ya is Lagrange multiplier and μ > 0 is (a) (a) the penalty parameter. We decompose L(Ri , si ) into two (a) subproblems by minimizing with respect to R(a) i and si independently. It iteratively solves two subproblems which have closed form solutions. (a) a) Updating Ri with others fixed: It is equivalent to the updating rule as β (a) (a) Ri (t + 1) = arg min Ri (a) ∗ Ri μ 2 1 1 (a) (a) (a) − Ya (t) + Ri − Xˆ Diag si . i 2 μ 2
(8)
5
(a)
(a)
with others fixed: We update si
b) Updating si follows: (a)
(a)
si (t + 1) = Ui
(a)
2X(i, ·)[GS]T + γ1 Mi
as
(9)
where Mi(a) = [Xˆ(a) ]T xi(a) + diag([Xˆ(a) ]T (Ya (t) + μR(a) i (t + i
1))) + ki(a) , and Ui(a)
i
([GS][GS]T + [Xˆ(a) ]T Xˆ(a) +
=
i
i
μDiag(diag([Xˆ(a) ]T Xˆ(a) )) + μI)−1 . i i c) Updating Ya with others fixed: The update rule for Ya is (a) (a) (a) Ya (t + 1) = Ya (t) + μ Ri (t + 1) − Xˆ Diag si (t + 1) . i
(10) (a)
(a) gi
d) Updating gi with others fixed: The update rule for enjoys a closed form (a)
gi (a)
k(a)
e) Updating ki is
(a)
= si
(a)
+
ki . μ
(11)
with others fixed: The update rule for
(a) (a) (a) (a) ki (t + 1) = ki (t) + μ si (t) − gi (t)
(12)
where μ is updated according to the adaptive rule in [22]. 3) Computation of S: We optimize each column si in S individually. Again, we use ALM method to update si , which optimizes the objective function as 2 2 1 (a) xi − Xˆi si 2 + α Xˆi Diag(si ) ∗ min xi − S Gsi + γ2 si 2 2 (13) s.t. Ri = Xˆi Diag(si ), si = gi , gi ≥ 0. This leads to the following augmented Lagrange function: 2 2 1 (a) P(Ri , si ) = xi − S Gsi + γ2 xi − Xˆi si 2 + αRi ∗ 2 2
T
+ Tr Y Ri − Xˆi Diag(si ) + < ki , si − gi > μ Ri − XˆDiag(si )2 + si − gi 2 (14) + 2 i 2 2 where we decompose P(Ri , si ) into two subproblems by minimizing with respect to Ri and si independently. We optimize one variable by fixing others. It iteratively solves two subproblems which have closed form solutions. a) Updating Ri with others fixed: It is equivalent to following the update rule: α Ri (t + 1) = arg min Ri ∗ Ri μ 2 1 1 . R Y(t) + − X Diag(s (15) − ) i i ˆi 2 μ 2
b) Updating si with others fixed: We update si as follows: si (t + 1) = Ui 2xi GT S(a) + γ2 XˆiT xi + diag XˆiT (Y(t) + μRi (t + 1)) + ki (16) (([S(a) ]T G)(GT S(a) )
where Ui = T μDiag(diag(Xˆ Xˆi )) + μI)−1 . i
+
XˆT Xˆi i
+
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6
IEEE TRANSACTIONS ON CYBERNETICS
Algorithm 1: ALM Optimization on (3) Input: training samples X = {x1 , . . . , xN }, K attributes Output: G, S(a) , and S Initialize: initialize S and S(a) by column-wise; set to be 0.05; randomly initialize μ, Y(0), and Ya (0). repeat —— Updating G ——solving G using Eq.(4). —— Updating S(a) ——solving S(a) using Eq.(6). solving R(a) i using Eq.(8). solving Ya using Eq.(10). solving ga using Eq.(11). solving k(a) using Eq.(12). —— Updating S ——solving S using Eq.(13). solving Ri using Eq.(15). solving Yi using Eq.(17). solving gi using Eq.(18). solving ki using Eq.(19). until ||si (t + 1) − si (t)||∞ ≤ and (a) ||s(a) i (t + 1) − si (t)||∞ ≤ ; (a) Return G, S and S.
c) Updating Y with others fixed: The update rule for Y is
Y(t + 1) = Y(t) + μ Ri (t + 1) − Xˆi Diag(si (t + 1)) . (17) d) Updating gi with others fixed: The update rule for gi is gi (t + 1) = si (t) +
ki (t) . μ
(18)
e) Updating ki : the rule is performed according to the following: ki (t + 1) = ki (t) + μ(si (t) − gi (t))
(19)
where μ is updated according to the adaptive rule in [22]. As aforementioned, we alternatively update G, S(a) and S until convergence. This can be guaranteed by the employed ALM method, and it converges globally due to the convexity of (3). As we focus on the sparse code vector si for each xi , we define the convergence condition to be ||si (t+1) −si (t)||∞ ≤ on all data points. We set to be 0.05 in our experiments. (a) We initialize each column si (0) and si (0) in S and S(a) by optimizing (1) and (2) using ALM method. The parameter μ, multipliers Y(0) and Ya (0) are randomly initialized with positive values. The overall procedure of optimization is shown in Algorithm 1. D. Similarity Matrix Construction and Object Categorization Once all sparse codes for a training set are collected, we can construct the data similarity matrix W whose element wij defines the similarity regarding a pair of objects xi and xj wij =
si (j) + sj (i) 2
(20)
si (j) is the jth entry of si , reflecting the correlation between xi and xj . To categorize visual objects into groups, we apply spectral clustering on W to form different clusters. The number of groups, M, is obtained by checking the eigenvalues raised by Laplacian matrix resulted from W. Specifically, λ1 , λ2 ,. . . ,λM are very small, and λM+1 is relatively large. IV. W EAKLY S UPERVISED D ICTIONARY L EARNING Let Xi ∈ RL×Ni , i ∈ {1, . . . , M}, be a collection of training samples in the ith group, and Di ∈ RL×Zi be the corresponding subdictionary, where L is dimension of a training sample, Ni is the number of training samples in the ith group and Zi is the number of atoms in Di . Thus, the dictionary learning model can be formulated as min
Di ,Si
M Xi − Di Si 22 + λSi 1 + θ (S1 , . . . , SM )
(21)
i=1
where Si is the sparse coefficient matrix of Xi over Di , (S1 , . . . , SM ) is a discrimination promoting term that will be defined in the following, and the parameter θ controls the tradeoff between reconstruction and discrimination. The term (S1 , . . . , SM ) is devised to couple the learning of multiple subdictionaries and meanwhile promoting the discrimination of sparse coefficients. Herein, we aim to obtain more discriminative coefficients by minimizing the intragroup scatter matrix and maximizing the intergroup scatter matrix of the decomposition coefficients of different groups. In our setting, matrix is defined as the intragroup scatter T , and the intergroup scat(s −μ )(s −μ ) Bintra = M j i j j=1 si ∈Sj i N (μ − μ)(μj − μ)T , where μj and ter matrix is Binter = M i j j=1 μ are the mean vectors of matrix Sj and S = {Si }M i=1 . Thus, the discrimination promotion term is defined as (S1 , . . . , SM ) = Tr(Bintra ) − Tr(Binter ), where Tr(·) is the matrix trace operator. The optimization of (21) iteratively goes through two subprocedures: 1) computing the sparse coefficients Si by fixing the dictionaries and 2) updating the dictionaries Di by fixing the coefficients. Mathematically, we update Si , by fixing Di , i = 1, . . . , M, and Sj , j = i, and the objective function is derived as Xi − Di Si 22 + λSi 1 + θ (Si )
(22)
where (Si ) is the discrimination term when othercoefficient matrices are fixed, given as (Si ) = ||Si −Mi ||22 − M j=1 ||Mj − 2 Z ×N i i consists of Ni copies mean vectors M||2 , where Mi ∈ R of μi as its columns, Mj ∈ RZj ×Nj and M ∈ RZj ×Nj are produced by stacking Nj copies of μj and μ as their columns, respectively. It can be seen that only 1 -norm is nondifferentiable, thus, we employ feature-sign search algorithm [20] to iteratively solve the 1 -norm penalty term, which guarantees to converge to a local minimum. Considering the coefficients are fixed, we update Di as follows: minXi − Di Si 22 , s.t. di 22 ≤ 1, ∀i = 1, . . . , Zi . (23) Di
Equation (23) is a least square problem with quadratic constraints which can be efficiently solved using its Lagrange duals [20].
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. WU et al.: EXPLOITING ATTRIBUTE CORRELATIONS: NOVEL TRACE LASSO-BASED
V. E XPERIMENTS We evaluate the effectiveness of our approach on two visual recognition tasks: 1) image classification and 2) object recognition. All results are computed in terms of recognition accuracy against the recognition rule. Specifically, a test sample t is judged to belong to the group with the lowest total reconstruction error accumulated over all M groups. This rule can be formulated as 2 (24) j∗ = argminj∈{1,...,M} t − Dj sj 2 . A. Image Classification 1) Data Sets: We conduct image classification on two publicly available data sets. 1) The AWAs data set [17]2 contains 30 475 animal images belonging to 50 classes. Twenty classes of 50 are selected for testing and the whole 50 classes are used for training. This dataset offers 85 attributes to describe animals such as “black,” “stripe,” and eats fish. The realvalued association strengths [15] between 85 attributes and 50 classes are used to determine the binary indicator of attributes in each image. 2) The SUN attribute data set [32] consists of 14 340 scene images belonging to 717 scene classes. A set of 102 manually labeled attributes are available for each image. In accordance with the setting suggested by [32], the updated SUN attribute dateset (v2.1) is split by specifying the training and test samples with 12 906 and 1434 images, respectively. These two datasets have predefined attributes available, hence each image can be encoded by a binary-pattern attribute vector and suitable for weakly-supervised coding. 2) Competitors: We compare the performance of our method with competitors including two unsupervised approaches, K-SVD [1] and multiattributed dictionary learning (MADL) [5], two supervised methods, discriminative K-SVD algorithm (D-KSVD) [51] and FDDL [50],3 and two weakly-supervised methods, namely, AttrDL [3] and WSC-GCP [4]. 1) K-SVD learns a dictionary by K-means clustering, where a subdictionary is learned from each cluster. It learns sparse codes by 1 norm. 2) MADL first defines a visual attribute similarity metric, and then conducts K-SVD on such metric. 3) AttrDL learns low-level feature-based subdictionaries corresponding to predefined attributes. They divide training images into different sets and each represents a specific attribute. Dictionary learning and sparse codes are learned by using 1 and 2 norms. 4) WSC-GCP is an attribute-guided bag-of-feature representation learning method, which refines sparse codes by imposing attribute-based label consistency 2 http://attributes.kyb.tuebingen.mpg.de/ 3 We delicately select D-KSVD and FDDL as competitors due to the fact that the two pipelines are widely accepted as representative and effective. Moreover, many recent supervised successors [14], [52] are stemmed from those two methods.
7
on 1 -norm regularizer. Each attribute is treated independently. 5) D-KSVD is a supervised method that equips the KSVD algorithm [1] with discriminative power by using category priors. 6) FDDL adopts the Fisher discrimination criterion into the dictionary learning in order to learn a set of classspecific subdictionaries. 3) Other Settings: In terms of evaluation metric, we use the classification precision per class as the evaluation protocol, which is the identical criterion used in competitors, thereby we can directly compare our approach to their reported performance. In terms of parameter setting, α, β, and λ are set by searching the grid {10−5 , 10−3 , 10−1 , 100 , 101 , 102 , 103 }. We study the influence of these parameters in Section V-C to determine the optimal values. The regularization parameters of γ1 and γ2 are fixed at 0.5 after a fully studied on their effect in Section V-D. The parameter θ in (21) is fixed at 0.1. Parameters in baselines are adaptively tuned to be the best performance in their own implementations. The subdictionary size for each category is set to be 1024. We use dense 2 -normalized scale-invariant feature transform (SIFT) [26] as image local descriptor due to its reasonably good performance in object recognition [12], [14], [49]. Specifically, referring to previous work [12], [49], we densely sample the interest regions from which SIFT are extracted and the patch size and step are fixed to be 16 and 8, respectively. We also resize the maximum side of each image to 300 pixels. To reduce the computational complexity, we perform a subsampling on generated SIFT descriptors. For a given image, the spatial pyramid feature [19] is computed as the representation by max pooling4 the sparse codes of the SIFT descriptors in a three-layer spatial pyramid configuration. K-SVD classifies a test sample by computing the different residual errors over subdictionaries. Meanwhile, a linear classifier is trained simultaneously with the dictionary learning in D-KSVD. The residual errors plus the distances between sparse coefficients and class centroids are used for classification rule in FDDL. In AttrDL, the classification of new sample is done by a winner-takes-all strategy, in which one-versus-all linear SVMs are used as classifier and the test object is recognized as the category with the highest classification score. The classifier setting of WSC-GCP is identical to AttrDL. 4) Evaluation on AWA Data Set: The results of classification rate are shown in Table I. We can see the following. 1) Our approach consistently outperforms unsupervised competitors, K-SVD and MADL, by a large margin. For K-SVD, the dependence of low-level feature representation in K-means clustering lacks semantic information, making the learned dictionary less discriminative in classification. For MADL, attributes are simply incorporated into the distance metric learning whilst the correlation between attributes is ignored which turns out to be helpful in boosting the discriminative power of a dictionary. 4 WSC-GCP uses its proposed geometric pooling strategy.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8
IEEE TRANSACTIONS ON CYBERNETICS
TABLE I I MAGE C LASSIFICATION P ERFORMANCE (M EAN±S TD ) OF THE C OMPARED M ETHODS OVER AWA DATA S ET
Fig. 4. Multiclassification confusion matrix on AWA data set (%). The entry in the ith row and jth column indicates the percentage of images from class i that are misclassified as class j, whilst the diagonal entries show the average classification rates for corresponding classes.
2) Supervised class-specific dictionary learning algorithms of FDDL and D-KSVD are inferior to attribute-aware methods. This is because some animal categories such as “cow” and “buffalo” are visually similar to each other. Also, the problem of intraclass diversity renders them not very effective in classifying fine-grained objects. 3) Weakly-supervised counterparts of AttrDL and WSCGCP show less competitive results to our algorithm. This is because the existence of correlation between attributes is overlooked, which on the contrary can benefit visual dictionary learning. To further justify the effectiveness of our approach, we plot a multiclassification confusion matrix, as shown in Fig. 4. It can be seen that our method presents a high classification rate in a wide collection of animal samples. Since we group animal images based on their membership in multiple correlated attributes, we are able to establish intraclass diverse or interclass similar aware subcategories. Thereafter, by learning dictionaries over these subcategories, we are capable of endowing dictionaries with semantic yet subcategory aware properties, leading to high classification rate.
5) Evaluation on SUN Attribute Data Set: We report the classification accuracy of our approach as well as competitors in Table II. Due to space limitation, we partially present the results of 24 scene classes from the original 717 classes. We can see the following. 1) Our method has a great gain in classification performance over a bundle of baselines, and outperforms these competitors. 2) Weakly-supervised methods of AttrDL and WSC-GCP are superior to both unsupervised and supervised alternatives, which demonstrates the effectiveness of using attribute contribution. 3) Supervised methods of FDDL and D-KSVD can still be regarded as effective tools to classify visual objects in the presence of labeled training data. B. Object Recognition In this experiment, we compare our approach with state-ofthe-art algorithms on the benchmarks of NUS-WIDE-OBJECT and Pascal 2012 visual object data set.5 5 http://pascallin.ecs.soton.ac.uk/challenges/VOC/
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. WU et al.: EXPLOITING ATTRIBUTE CORRELATIONS: NOVEL TRACE LASSO-BASED
9
TABLE II I MAGE C LASSIFICATION P ERFORMANCE (M EAN±S TD ) OF THE C OMPARED M ETHODS OVER SUN ATTRIBUTE DATA S ET
Fig. 5. Recognition accuracy of compared methods over NUS-WIDE-OBJECT data set. Please note that AttrDL may exhibit lower values in some categories compared to reported results in the original paper. This is due to the subsampling of SIFT descriptor in our experiment.
1) Data Sets: NUS-WIDE-OBJECT data set contains 30 000 manually labeled images belonging to 31 object classes. All of these images are crawled from Flickr Web site. The entire dataset is separated into two subsets: 1) 17 927 images as training subset and 2) the remaining 12 073 images are testing subset. Pascal VOC 2012 benchmark [9] contains 11 530 images spread over 20 classes. This dataset is collected to recognize objects from a number of visual object classes in realistic scenes. 2) Competitors: Competitors include K-SVD, MADL, D-KSVD, FDDL, AttrDL, and WSC-GCP. 3) Weakly Supervised Coding: There are no predefined attributes available in the above two datasets. However, visual attributes are cheap to be fetched. Given a reference image, we adopt the Classemes [36] based attribute detector to convert the image to a Classemes vector, whose definition comes from the concept of the large scale concept ontology for multimedia [29], including 2659 attribute categories. This process typically costs less than 0.5 s for an image on a regular PC. 4) Other Settings: Following [9], we use the object recognition accuracy per category as our evaluation metric, which can be directly comparable with selected baselines.
Fig. 6.
Performance comparison on Pascal VOC 2012 data set.
Other settings of parameter tuning, feature sampling, and classifier are configured the same as we have done in the last experiment. 5) Evaluation on NUS-WIDE-OBJECT: Fig. 5 shows the performance comparison of our approach with AttrDL and WSC-GCP on the NUS-WIDE-OBJECT data set. We plot the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10
Fig. 7.
IEEE TRANSACTIONS ON CYBERNETICS
Parameter study against (α, β) and λ on AWA and SUN attribute data sets.
recognition accuracy values of each individual class, and some observations can be seen as follows. 1) The performance of WSC-GCP loses slightly to AttrDL mainly because WSC-GCP only incorporates the guide of attributes in their sparse coding step whilst AttrDL effectively uses visual attributes to learn a semantic dictionary. Thus, AttrDL is able to optimize discriminative sparse codes so as to enhance their recognition performance. 2) Our method outperforms the competitors in all categories. Instead of learning an individually attributeaware dictionary, we automatically learn a set of subdictionaries corresponding to distinct image groups revealed by attribute correlations. This can lead to highly discriminative dictionary, and boosts its performance accordingly. 6) Evaluation on Pascal VOC 2012 Data Set: We show the recognition accuracy results of compared methods in Fig. 6, where we vary the number of training images sampled from the data set. We can see that our method consistently outperforms competitors with varied fraction of training samples. C. Parameter Study In this section, we study the influence of parameters on our method. The parameters α and β (3) control the effect of sparsity on trace lasso in the step of data categorization. The parameter λ (22) controls the term of sparsity in learning dictionary and sparse representation. For these parameters, we tune them from {10−5 , 10−3 , 10−1 , 100 , 101 , 102 , 103 }. As α and β appear simultaneously in (3), we first investigate how our method varies against different combinations of them. The results are shown in Fig. 7. We can observe that our algorithm can achieve a relatively higher recognition accuracy value when we have α = β = 10−1 . In terms of λ, a better performance can be obtained when we set λ = 10−3 . Thus, we use α = β = 10−1 , and λ = 10−3 in all experiments. D. Discussion In this section, we conduct quantitative study on the improvement contributed from the awareness of attribute correlations. To this end, we vary the effect of attribute correlation term in (3) by setting γ2 to be different values ranging from 0 to 1. The classification performance against varied (γ1 and γ2 ) are shown in Table III. It can be seen that
TABLE III I MAGE C LASSIFICATION P ERFORMANCE (M EAN±S TD ) AGAINST D IFFERENT S ETTINGS OF (γ1 , γ2 ) OVER AWA AND SUN ATTRIBUTE DATA S ETS
the attribute correlation plays a positive role in performance improvement. VI. C ONCLUSION In this paper, to mitigate the lackness of large amount of labeled data objects in discriminative dictionary learning, we proposed a novel weakly-supervised dictionary learning method for visual recognition by exploiting the attribute correlations via trace lasso norm, where subcategory aware visual data objects can be automatically grouped. By doing this, a discriminative dictionary is learned, which can be decomposed into a set of subdictionaries corresponding to different subcategories. Extensive experiments are conducted to demonstrate the superiority of our approach to state-of-the-arts on various visual recognition tasks without relying on labeled data objects in training. Our future work will attempt to improve existing semi-supervised learning methods by exploiting attribute correlations as well as the correlation structure among unlabeled data objects. R EFERENCES [1] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006. [2] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele, “Evaluation of output embeddings for fine-grained image classification,” in Proc. CVPR, Boston, MA, USA, 2015, pp. 2927–2936. [3] J. Cai, Z.-J. Zha, H. Luan, S. Zhang, and Q. Tian, “Learning attributeaware dictionary for image classification and search,” in Proc. ICMR, Dallas, TX, USA, 2013, pp. 33–40. [4] L. Cao, R. Ji, Y. Gao, Y. Yang, and Q. Tian, “Weakly supervised sparse coding with geometric consistency pooling,” in Proc. CVPR, Providence, RI, USA, 2012, pp. 3578–3585. [5] C.-K. Chiang, T.-F. Su, C. Yen, and S.-H. Lai, “Multi-attributed dictionary learning for sparse coding,” in Proc. ICCV, Sydney, NSW, Australia, 2013, pp. 1137–1144. [6] J. Deng, J. Krause, and L. Fei-Fei, “Fine-grained crowdsourcing for fine-grained recognition,” in Proc. CVPR, Portland, OR, USA, 2013, pp. 580–587.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. WU et al.: EXPLOITING ATTRIBUTE CORRELATIONS: NOVEL TRACE LASSO-BASED
[7] J. Dong et al., “Subcategory-aware object classification,” in Proc. CVPR, Portland, OR, USA, 2013, pp. 827–834. [8] K. Engan, S. O. Aase, and J. H. Husoy, “Method of optimal directions for frame design,” in Proc. ICASSP, Phoenix, AZ, USA, 1999, pp. 2443–2446. [9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. (2012). The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. [Online]. Available: http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2012/ [10] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects by their attributes,” in Proc. CVPR, Miami, FL, USA, 2009, pp. 1778–1785. [11] V. Ferrari and A. Zisserman, “Learning visual attributes,” in Proc. NIPS, Vancouver, BC, Canada, 2009, pp. 433–440. [12] S. Gao, I. W.-H. Tsang, L.-T. Chia, and P. Zhao, “Local features are not lonely-Laplacian sparse coding for image classification,” in Proc. CVPR, Colorado Springs, CO, USA, 2011, pp. 3555–3561. [13] E. Grave, G. R. Obozinski, and F. R. Bach, “Trace lasso: A trace norm regularization for correlated designs,” in Proc. NIPS, South Lake Tahoe, CA, USA, 2012, pp. 2187–2195. [14] Z. Jiang, Z. Lin, and L. S. Davis, “Learning a discriminative dictionary for sparse coding via label consistent K-SVD,” in Proc. CVPR, Colorado Springs, CO, USA, 2011, pp. 1697–1704. [15] C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda, “Learning systems of concepts with an infinite relational model,” in Proc. AAAI, Boston, MA, USA, 2006, pp. 381–388. [16] K. Kreutz-Delgado et al., “Dictionary learning algorithms for sparse representation,” Neural Comput., vol. 15, no. 2, pp. 349–396, Feb. 2003. [17] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in Proc. CVPR, Miami, FL, USA, 2009, pp. 951–958. [18] C. H. Lampert, H. Nickisch, and S. Harmeling, “Attribute-based classification for zero-shot visual object categorization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 3, pp. 453–465, Mar. 2014. [19] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proc. CVPR, New York, NY, USA, 2006, pp. 2169–2178. [20] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algorithms,” in Proc. NIPS, Vancouver, BC, Canada, 2006, pp. 801–808. [21] Z. Lin, M. Chen, L. Wu, and Y. Ma, “The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices,” Coordinated Sci. Lab., Univ. Illinois Urbana-Champaign, Champaign, IL, USA, Tech. Rep. UILU-ENG-09-2215, 2009. [22] Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction method with adaptive penalty for low-rank representation,” in Proc. NIPS, Granada, Spain, 2011, pp. 612–620. [23] C. Liu, J. Feng, Z. Lin, and S. Yan, “Correlation adaptive subspace segmentation,” in Proc. ICCV, Sydney, NSW, Australia, 2013, pp. 1345–1352. [24] L. Liu, L. Shao, X. Li, and K. Lu, “Learning spatio-temporal representations for action recognition: A genetic programming approach,” IEEE Trans. Cybern., vol. 46, no. 1, pp. 158–170, Jan. 2016. [25] L. Liu, M. Yu, and L. Shao, “Unsupervised local feature hashing for image similarity search,” IEEE Trans. Cybern., vol. PP, no. 99, pp. 1–11, Oct. 2015. doi: 10.1109/TCYB.2015.2480966. [26] D. G. Lowe, “Object recognition from local scale-invariant features,” in Proc. ICCV, 1999, pp. 1150–1157. [27] X. Lu, Y. Yuan, and P. Yan, “Alternatively constrained dictionary learning for image superresolution,” IEEE Trans. Cybern., vol. 44, no. 3, pp. 366–377, Mar. 2014. [28] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Supervised dictionary learning,” in Proc. NIPS, Vancouver, BC, Canada, 2008, pp. 1033–1040. [29] M. Naphade et al., “Large scale concept ontology for multimedia,” ACM Multimedia, vol. 13, no. 3, pp. 86–91, Jul. 2006. [30] D. Parikh and K. Grauman, “Relative attributes,” in Proc. ICCV, Barcelona, Spain, 2011, pp. 503–510. [31] V. M. Patel, Q. Qiu, and R. Chellappa, “Dictionaries for image-based recognition,” in Proc. ITA, San Diego, CA, USA, 2013, pp. 1–8. [32] G. Patterson, X. Chen, and J. Hays. SUN Attribute Database: Discovering, Annotating, and Recognizing Scene Attributes. (2011). [Online]. Available: http://cs.brown.edu/~gen/sunattributes.html [33] Q. Qiu, Z. Jiang, and R. Chellappa, “Sparse dictionary-based representation and recognition of action attributes,” in Proc. ICCV, Barcelona, Spain, 2011, pp. 707–714. [34] L. Shao, R. Yan, X. Li, and Y. Liu, “From heuristic optimization to dictionary learning: A review and comprehensive comparison of image denoising algorithms,” IEEE Trans. Cybern., vol. 44, no. 7, pp. 1001–1013, Jul. 2014.
11
[35] Y. Su and F. Jurie, “Improving image classification using semantic attributes,” Int. J. Comput. Vis., vol. 100, no. 1, pp. 59–77, 2012. [36] L. Torresani, M. Szummer, and A. W. Fitzgibborn, “Efficient object category recognition using classemes,” in Proc. ECCV, Heraklion, Greece, 2010, pp. 776–789. [37] J. Wang et al., “Locality-constrained linear coding for image classification,” in Proc. CVPR, San Francisco, CA, USA, 2010, pp. 3360–3367. [38] S. Wang, J. Joo, Y. Zhu, and S.-C. Zhu, “Weakly supervised learning for attribute localization in outdoor scenes,” in Proc. CVPR, Portland, OR, USA, 2013, pp. 3111–3118. [39] Y. Wang, X. Lin, L. Wu, and W. Zhang, “Effective multi-query expansions: Robust landmark retrieval,” in Proc. ACM Multimedia, Brisbane, QLD, Australia, 2015, pp. 79–88. [40] Y. Wang, X. Lin, L. Wu, W. Zhang, and Q. Zhang, “Exploiting correlation consensus: Towards subspace clustering for multi-modal data,” in Proc. ACM Multimedia, Orlando, FL, USA, 2014, pp. 981–984. [41] Y. Wang, X. Lin, L. Wu, W. Zhang, and Q. Zhang, “Lbmch: Learning bridging mapping for cross-modal hashing,” in Proc. ACM SIGIR, Santiago, Chile, 2015, pp. 999–1002. [42] Y. Wang et al., “Robust subspace clustering for multi-view data by exploiting correlation consensus,” IEEE Trans. Image Process., vol. 24, no. 11, pp. 3939–3949, Nov. 2015. [43] Y. Wang, X. Lin, and Q. Zhang, “Towards metric fusion on multi-view data: A cross-view based graph random walk approach,” in Proc. ACM CIKM, San Francisco, CA, USA, 2013, pp. 805–810. [44] Y. Wang et al., “Iterative views agreement: An iterative low-rank based structured optimization method to multi-view spectral clustering,” in Proc. IJCAI, New York, NY, USA, 2016, pp. 2153–2159. [45] Y. Wang, W. Zhang, L. Wu, X. Lin, and X. Zhao, “Unsupervised metric fusion over multiview data by graph random walk-based cross-view diffusion,” IEEE Trans. Neural Netw. Learn. Syst., no. 99, pp. 1–14, Dec. 2015. [46] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009. [47] L. Wu, Y. Wang, and J. Shepherd, “Efficient image and tag co-ranking: A bregman divergence optimization method,” in Proc. ACM Multimedia, Barcelona, Spain, 2013, pp. 593–596. [48] Y. Xie et al., “Discriminative object tracking via sparse representation and online dictionary learning,” IEEE Trans. Cybern., vol. 44, no. 4, pp. 539–553, Apr. 2014. [49] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in Proc. CVPR, Miami, FL, USA, 2009, pp. 1794–1801. [50] M. Yang, L. Zhang, X. Feng, and D. Zhang, “Fisher discrimination dictionary learning for sparse representation,” in Proc. ICCV, Barcelona, Spain, 2011, pp. 543–550. [51] Q. Zhang and B. Li, “Discriminative K-svd for dictionary learning in face recognition,” in Proc. CVPR, San Francisco, CA, USA, 2010, pp. 2691–2698. [52] N. Zhou, Y. Shen, J. Peng, and J. Fan, “Learning inter-related visual dictionary for object recognition,” in Proc. CVPR, Providence, RI, USA, 2012, pp. 3490–3497. [53] F. Zhu and L. Shao, “Weakly-supervised cross-domain dictionary learning for visual recognition,” Int. J. Comput. Vis., vol. 109, no. 1, pp. 42–59, 2014.
Lin Wu received the Ph.D. degree from the University of New South Wales, Sydney, NSW, Australia, in 2014. She is currently an ARC Senior Research Associate with the School of Computer Science, University of Adelaide, Adelaide, SA, Australia. Her current research interests include computer vision, machine learning, and multimedia analytics. She has published over 30 academic papers on competitive venues, such as CVPR, ACM Multimedia, IJCAI, ACM SIGIR, the IEEE T RANSACTIONS ON I MAGE P ROCESSING, the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS , the IEEE T RANSACTIONS ON C YBERNETICS , Computer Vision and Image Understanding. Dr. Wu was a co-recipient of the Best Research Paper Runner-Up Award for PAKDD 2014. She regularly serves as a Conference Program Committee Member, and an Invited Reviewer for the leading journals, such as the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS, the IEEE T RANSACTIONS ON I MAGE P ROCESSING, the IEEE T RANSACTIONS ON M ULTIMEDIA , and Pattern Recognition.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 12
Yang Wang received the Ph.D. degree from the University of New South Wales, Kensington, NSW, Australia, in 2015. He is currently a Research Fellow with the School of Computer Science and Engineering, University of New South Wales. He has published over 25 research papers including one book chapter, most of which have appeared in the major conferences and journals, such as ACM Multimedia, ACM SIGIR, IJCAI, IEEE ICDM, ACM CIKM, the IEEE T RANSACTIONS ON I MAGE P ROCESSING, the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS, the IEEE T RANSACTIONS ON C YBERNETICS, and KAIS. Dr. Wang was a recipient of the Winner of Best Research Paper Runner-Up Award for PAKDD 2014. He was a Program Committee Member for ECML/PKDD 2014 and ECML/PKDD 2015, while regularly served as an Invited Journal Reviewer for the IEEE T RANSACTIONS ON I MAGE P ROCESSING and the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS.
IEEE TRANSACTIONS ON CYBERNETICS
Shirui Pan received the Ph.D. degree in computer science from the University of Technology Sydney (UTS), Ultimo, NSW, Australia, in 2015. He is a Research Associate with the Centre for Quantum Computation and Intelligent Systems, UTS. He has published over 20 research papers in top-tier journals and conferences, including the IEEE T RANSACTIONS ON K NOWLEDGE AND DATA E NGINEERING, the IEEE T RANSACTIONS ON C YBERNETICS , Pattern Recognition, IJCAI, the IEEE International Conference on Data Mining, SDM, CIKM, and PAKDD. His current research interests include data mining and machine learning.