Image Annotation by Input–Output Structural Grouping ... - IEEE Xplore

4 downloads 0 Views 2MB Size Report
Yahong Han, Fei Wu, Qi Tian, Senior Member, IEEE, and Yueting Zhuang, Member, IEEE. Abstract—Automatic image annotation (AIA) is very important to image ...
3066

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 6, JUNE 2012

Image Annotation by Input–Output Structural Grouping Sparsity Yahong Han, Fei Wu, Qi Tian, Senior Member, IEEE, and Yueting Zhuang, Member, IEEE

Abstract—Automatic image annotation (AIA) is very important to image retrieval and image understanding. Two key issues in AIA are explored in detail in this paper, i.e., structured visual feature selection and the implementation of hierarchical correlated structures among multiple tags to boost the performance of image annotation. This paper simultaneously introduces an input and output structural grouping sparsity into a regularized regression model for image annotation. For input high-dimensional heterogeneous features such as color, texture, and shape, different kinds (groups) of features have different intrinsic discriminative power for the recognition of certain concepts. The proposed structured feature selection by structural grouping sparsity can be used not only to select group-of-features but also to conduct within-group selection. Hierarchical correlations among output labels are well represented by a tree structure, and therefore, the proposed tree-structured grouping sparsity can be used to boost the performance of multitag image annotation. In order to efficiently solve the proposed regression model, we relax the solving process as a framework of the bilayer regression model for multilabel boosting by the selection of heterogeneous features with structural grouping sparsity (Bi-MtBGS). The first-layer regression is to select the discriminative features for each label. The aim of the second-layer regression is to refine the feature selection model learned from the first layer, which can be taken as a multilabel boosting process. Extensive experiments on public benchmark image data sets and real-world image data sets demonstrate that the proposed approach has better performance of multitag image annotation and leads to a quite interpretable model for image understanding. Index Terms—Automatic image annotation (AIA), regularized regression, structural grouping sparsity, structured feature selection, tree-structured grouping sparsity.

Manuscript received March 20, 2011; revised October 18, 2011; accepted December 23, 2011. Date of publication January 12, 2012; date of current version May 11, 2012. This work was supported in part by the National Basic Research Program of China under Grant 2012CB316400, by the Natural Science Foundation of China under Grant 61070068 and Grant 90920303, and by the Fundamental Research Funds for the Central Universities. The work of Q. Tian was supported in part by the National Science Foundation Information and Intelligent Systems under Grant 1052851, by Faculty Research Awards by Google, by FX Palo Alto Laboratory (FXPAL), and by NEC Laboratories of America. The work of Y. Han was supported in part by Scholarship Award for Excellent Doctoral Student granted by the Ministry of Education of China. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Joseph P. Havlicek. Y. Han is with the School of Computer Science and Technology, Tianjin University, Tianjin 300072, China (e-mail: [email protected]). F. Wu and Y. Zhuang are with the College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China (e-mail: [email protected]; [email protected]). Q. Tian is with the Department of Computer Science, The University of Texas at San Antonio, San Antonio, TX 78249-0667 USA (e-mail: [email protected]. edu). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2012.2183880

I. INTRODUCTION

W

ITH THE booming of social media websites such as Facebook and Flickr, we are witnessing an explosive growth of image data, which brings along great challenges in the research of image retrieval. In the early stage, a very popular framework was to first manually annotate the images and then use text-based retrieval technologies to perform image retrieval [1]. Due to the vast amount of labor in manual image annotation and perception subjectivity, imprecise annotation may cause unrecoverable mismatches in retrieval processes [1]. Content-based image retrieval targets to index images by their lowlevel visual features for similar image retrieval [2], [3]. However, the performance of image retrieval by visual features is bounded due to the semantic gap between low-level visual features and high-level semantics [1], [2]. One of the key techniques in reducing the semantic gap is to use machine learning methods to associate low-level features with high-level semantic concepts [4], or the named automatic image annotation (AIA) techniques [5]. The main idea of AIA techniques is to automatically learn semantic concept models from a large number of image samples and use the concept models to label new images [5]. From a machine learning point of view, the approaches of AIA can be roughly classified into the generative and discriminative models. The generative model learns a joint distribution over image features and concepts (or annotation tags). To annotate a new image, the learned generative model computes the conditional probability over tags given the visual features [6], [7]. On the other hand, the discriminative model trains a separate classifier from visual features for each tag. These classifiers are used to predict particular tags for test image samples [8], [9]. Similarly, we can also train a regression model (regression coefficients) to predict tags for test images, taking features as predictors (input variables) and tags as responses (output labels). When using the discriminative model to perform AIA, two key issues should be addressed. First, the “input structure” of visual features should be taken into consideration. In real-world images, we can extract high-dimensional heterogeneous features from one given image, such as global features (color, shape, and texture) or local features (SBN [10], scale-invariant feature transform, shape context, and gradient location and orientation histogram). Taking each type of visual features as a group, we could obtain input structures of feature groups and within-group features. Furthermore, different subsets of heterogeneous features have different intrinsic discriminative

1057-7149/$31.00 © 2012 IEEE

HAN et al.: IMAGE ANNOTATION BY INPUT–OUTPUT STRUCTURAL GROUPING SPARSITY

power to characterize the semantics of images [11]. That is to say, only limited groups of heterogeneous features distinguish each label from others. Therefore, the selected visual features for the prediction of certain image concepts are usually sparse. Considering the input structure of visual features, to better annotate images with suitable tags, a process of structured visual feature selection should be integrated into the learned discriminative model. Second, due to the visually polysemous nature of images [12], one image may be labeled by multiple tags. Correlations or structures among multiple labeled tags usually indicate certain correlations of visual content. Previous studies have shown that effective utilization of the latent information hidden in related labels can boost the performance of multilabel annotation [13]–[17]. Furthermore, some works even defined hierarchical structures among multiple labeled tags and exploited these structures for better AIA [18]–[20]. Therefore, the “output structure” among the correlated multiple tags should be also considered during multitag image annotation. This paper proposes to train a multiresponse (or multilabel) regression model to predict multiple tags for new images. Taking high-dimensional heterogeneous visual features as input variables and multiple labeled tags as multiple output labels, the goal of training process is to learn regression coefficients that can be used to better fit the input features to their corresponding labels. A benefit of formulating the regression process is the ability to introduce certain prior structures of input or output by regularized penalties. First, considering the input structures, lasso [22] was proposed to use the -norm of coefficient vectors as a penalty to shrinkage many of coefficients to be zero in order to make the learned regression model sparse. Therefore, lasso can simultaneously perform regression and feature selection. Furthermore, by imposing the penalty of a sum of (not squared) -norm on a group of coefficients, which corresponds to a group of input features, group lasso [23] can induce sparsity at the group level. As a result, the group lasso method can perform the selection of group of features when input features could be partitioned into disjoint groups. For example, in computer vision research, Cao et al. [11] proposed to learn different metric kernel functions for different features in order to adaptively select visual features for the recognition of certain semantics or events. They formulated the heterogeneous feature machines (HFMs) [11] as a sparse logistic regression by the -norm at the group level. On the other hand, considering the output structures among multiple labels, early studies in multiresponse regression showed that [24], if the responses (or labels) are correlated, we may be able to obtain a more accurate prediction by a linear combination. For example, in multimedia research, a canonical correlation analysis (CCA)-based [25] approach [the named Curds and Whey (C&W) algorithm] was employed in [26] and [27] as a multilabel boosting process for the selection of heterogeneous features of images. In multitask learning research -norm penalty was imposed to combine the [28], [29], the tasks and ensures that common features will be selected across them. Furthermore, in order to integrate certain structures such as the hierarchical tree structure or even the graph structure among multiple tasks or multiple labels into the multiresponse

3067

regression process, the tree-guided group lasso method [30] and the graph-structured multitask regression model [31] were proposed, respectively. Inspired by the general structured sparsity-inducing norm penalty [32], these two methods [30], [31] proposed to impose well-defined structured sparsity penalties on the output of coefficient matrix. In this paper, we propose to combine structures of input features and output multiple tags into one regression framework for better performance of multitag image annotation. The training process of the proposed framework is illustrated in Fig. 1. For each image, we can extract heterogeneous visual features, e.g., color, texture, and shape, in order to form matrix . Each coefin matrix , ficient vector, e.g., column vector is used to fit the values of visual features to the values of corresponding label indicators in matrix . Since there are multiple labeled tags, the proposed framework is a process of multiresponse regression. As shown in Fig. 1, we propose to impose a well-defined structural grouping sparsity penalty on the input direction of matrix , i.e., on each column vector of , to perform heterogeneous visual feature selection. In addition, simultaneously, we propose to impose a tree-structured grouping sparsity penalty on the output direction of matrix , i.e., on each row vector of , to perform tree-structured multilabel boosting. The penalty of structural grouping sparsity is defined as follows. Taking each type of features as a group (see Fig. 1 for example), we get three groups of visual features, which correspond to color, texture, and shape. We first propose to impose the penalty of a sum of (not squared) -norm on each group of coefficients, which corresponds to the group of input features, denotes that to perform group selection. In Fig. 1, corresponding groups of features (color features) are selected otherwise. In order to further discern the most and discriminative features within groups, we propose to simultaneously impose on the whole coefficient vectors (e.g., column vector ) an -norm penalty to induce within-group sparsity. Therefore, the goal of the structural grouping sparsity is to select finite groups of heterogeneous features and identify subgroup within homogeneous features for predicting particular tags of images. As introduced above, however, correlations or structures among multiple tags are neglected by the input sparsity since such penalty is defined on each column vector (the input direction of ) and, therefore, is used to separately predict each tag for images. As shown in Fig. 1, correlations among five tags could be well modeled by a tree structure, which can be obtained by performing agglomerative hierarchical clustering algorithm on label indicator matrix . The internal node near the bottom of the tree shows that the tags of its subtree are highly correlated, whereas the internal node near the root represents relatively weaker correlations among the tags in its subtree. Therefore, the most correlated tag pairs are “sea–sunset” and “mountain–tree,” which faithfully correspond to the statistics of coexisting tags in images in multi-instance multilabel learning (MIML) data set, i.e., “sea–sunset” and “mountain–tree” are the top two coexisting tags [21]. We propose to impose a tree-structured grouping sparsity penalty on each row vector (the output direction) of . By this penalty, the hierarchical correlations among multiple tags are well

3068

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 6, JUNE 2012

represented and therefore can be used to conduct multilabel boosting for better performance of multitag image annotation. Note that the penalties of structural grouping sparsity and tree-structured grouping sparsity are imposed on different directions of regression coefficient matrix , i.e., both input and output directions. Such input–output penalized regression problem could not be easily solved due to the nonsmoothness and nonconvexity of the objective function. In this paper, we propose to relax the solving process as a two-step regression model, i.e., to sequentially solve the input- and output-sparsity penalized regression problems, which can be taken as a bilayer regression model regularized by input–output structural grouping sparsity. As a whole, in this paper, we propose a framework of a bilayer regression model for multilabel boosting by the selection of heterogeneous features with structural grouping sparsity (Bi-MtBGS). In the experiments of this paper, we first demonstrate the effectiveness of structured feature selection using two open benchmark image data sets. The performance of multitag image annotation by Bi-MtBGS is evaluated and compared with that of state-of-the-art algorithms using a real-world image data set. The remainder of this paper is organized as follows. We first introduce related works in Section II. The framework of Bi-MtBGS and its solutions are introduced in Sections III and IV, respectively. In Section V, we discuss some computational issues of Bi-MtBGS. The experimental analysis and conclusion are given in Sections VI and VII, respectively. II. RELATED WORKS In this section, we first review some related AIA techniques that also addressed two key issues presented in Section I, i.e., exploiting the input or output structures for multitag image annotation. Then, we introduce the definition of structured sparsity-inducing norm and group lasso, based on which we will propose the input–output structural grouping sparsity in the following sections. A. Related AIA Techniques To address the first issue (i.e., the structures of heterogeneous visual features for image annotation), three related approaches are available: 1) multiple-kernel-based approaches [11], [18]; 2) group-of-feature selection approaches [11], [33]; and 3) feature correlation exploiting methods [34]. The extracted heterogeneous global and local visual features can be implemented to characterize various visual properties of the images. In [18], they first partitioned the high-dimensional heterogeneous visual features into multiple subsets and each feature subset was used to characterize one certain type of visual properties of the images. Therefore, the underlying visual similarity relationships between the images are more homogeneous. Therefore, the homogeneous visual similarity within each subset could be approximated more precisely by using one particular type of kernel. This multiple-kernel-based method was successfully applied in multilevel image annotation [18]. Based on the similar observations in [18], Cao et al. [11] proposed that modeling such heterogeneous features has become an increasing critical issue in image understanding. The HFM

model [11] built a kernel logistic regression model by the similarities that combine different features and distance metrics. Experimental results showed that [11] different kinds of heterogeneous features have different intrinsic discriminative power for recognition of a particular event or concept. In order to combine different kernels or features for image annotation, a straightforward method is to combine by linear weights. By this way, however, the weights remain the same across different image samples. To adaptively handle features of different types with different metrics, group lasso penalty [23] was imposed on the logistic regression to obtain a heterogeneous feature selection model [11]. In [33], taking each type of feature as a group, the so-called clustering prior of features is utilized in image annotation task. The regression model of group-of-feature selection in [33] is similar as that in our Bi-MtBGS framework, i.e., imposing a sum of (not squared) -norm on each group of coefficients, which corresponds to the group of input features. However, the regression responses (or outputs) in [33] are similar vectors of training image pairs, which are obtained from a time-consuming electromagnetic process. -norm penalty, lasso can simultaneously perform By regression and feature selection. However, if pairwise correlations among group of features are very high, lasso tends to select only one of the pairwise correlated features and does not induce the grouping effect. Inspired by elastic net [35] and sparse discriminant analysis (SDA) [36], multitask SDA (MtSDA) [34] was proposed to avoid overfitting for highly correlated features and identify grouping effect in feature selection. MtSDA took a quadratic penalized matrix as a general -norm penalty and added it to the lasso model. Experiments in [34] demonstrated that MtSDA has better performance of multitag image annotation. As introduced in Section I, to address the second issue of considering structures of multiple tags, previous studies [13]–[17] have successfully utilized the latent information hidden in related labels to boost the performance of multilabel image annotation. Although many works exploited the hierarchical structures among multiple tags, i.e., define ontology on multiple tags, to improve the performance of image annotation [18]–[20], such hierarchical structures have not been integrated into the regression-based image annotation framework. The most similar work in our Bi-MtBGS framework may be the approach of multilabel sparse coding (MLSC) for AIA, which was proposed in [13]. MLSC also induced sparsity simultaneously on input visual features and output multiple labels, which includes two sparse coding phases, namely, image feature sparse reconstruction and multiple-label sparse reconstruction. To label new images, MLSC requires an additional label propagation process by the learned sparse reconstruction coefficients. B. Structured Sparsity-Inducing Norm and Group Lasso In this section, we first provide notations used in the rest of this paper. Then, we will introduce the definition of the structured sparsity-inducing norm and group lasso. In addition, we will also discuss in this section how group lasso can induce sparsity at the group level. 1) Notations: Assume that we have a training set of larepresents beled images with labels (tags). Matrix

HAN et al.: IMAGE ANNOTATION BY INPUT–OUTPUT STRUCTURAL GROUPING SPARSITY

the image data matrix; is the dimensionality of image features. represents the label indicator matrix of Matrix if the th image has the th label and image data; if otherwise. Matrix repis the coresents the coefficient matrix, and efficient vector for the th label. In the following, we let and denote the th row and th column vectors of matrix , respectively. Subscript in vector is used to index the th element of vector . Given a -dimensional vector , and denote we let the -norm and the -norm of vector , respectively. Moreover, we let and represent the transpose of vector and matrix , respectively. Throughout this paper, we assume that label indicator matrix is centered and feature matrix is centered and standardized. 2) Definition of the Structured Sparsity-Inducing Norm: Jenatton et al. proposed a general definition of the structured sparsity-inducing norm in [32], based on which many sparsity penalties such as lasso, group lasso, and even the tree-guided group lasso [30] could be instantiated. Definition 1 (Structured Sparsity-Inducing Norm): Given a -dimensional input vector , let us assume that the set of groups is defined as a subset of the power of inputs set of , the structured sparsity-inducing norm is defined as

where is the subvector of for the input index in group and is the predefined weight for group . and let be the set In Definition 1, if we ignore the weight , is instantiated of singleton, i.e., to be an -norm of vector . 3) Group-of-Feature Selection by Group Lasso: Suppose we can extract -dimensional heterogeneous features from images, disjoint and these heterogeneous features are divided into groups of homogeneous features, with as the number of features in the th group, i.e., . For ease of notation, we to represent the features of training use a matrix images corresponding to the th group, with a corresponding regression coefficient vector for the th label. There. fore, we have For the th label, the group-of-feature selection by group lasso [23] could be formulated to solve the following regularized regression problem: (1)

3069

define the -norm at the group level and can be used to conduct the group-of-feature selection [11], [33]. III. TREE-STRUCTURED MULTILABEL BOOSTING BY STRUCTURAL GROUPING SPARSITY In this section, we will describe the framework of bilayer regression for multilabel boosting by selecting heterogeneous features with structural grouping sparsity (Bi-MtBGS). In order to solve this framework efficiently, we relax the solving process as a bilayer regression model regularized by input–output structural grouping sparsity. A. Problem Statement and Formulation As introduced in Section I and illustrated in Fig. 1, we tend to train a regression model with input- and output-sparsity penalty terms as follows to learn the regression model for automatic multitag annotation of test images:

(2) where and are the regularization parameters for the input-sparsity penalty and output-sparsity penalty , respectively. Suppose that the final solution of (2), i.e., regression coefficient matrix , is learned, we can prethat test images belong to the th dict the probability label as follows: (3) As we will present in the following, we propose to instantiate to be a structural grouping sparsity defined on the input direction of to conduct structured visual feature selecto be a tree-structured grouping sparsity tion and defined on the output direction of to conduct hierarchical (or tree-structured) multilabel boosting. Note that the penalties of input–output structural grouping sparsity in (2) are defined on different directions of . Therefore, the optimization problem in (2) could not be easily solved due to the nonsmoothness and nonconvexity of the objective function. In the following, we relax the solving process of (2) as a two-step regression (or bilayer regression) model, i.e., sequentially solving the input- and output-sparsity penalized regression model. B. Visual Feature Selection by Structural Grouping Sparsity For the th label, first-layer regression is formulated to solve the following regularized regression model:

is the regularization parameter. Clearly, the group where lasso penalty is a special example of in Definition 1, when and is a partition of set according to the disjoint groups of homogeneous features. In particular, when the length of vector is 1, which means that there is only one feature in the th group, we have . Therefore, the sum of (not squared) -norm penalty, i.e., , in group lasso could be taken to

(4) where

and

are regularization parameters, is the instantiation of the input-sparsity penalty in (2) and is called the penalty of structural grouping sparsity. As discussed in Section II-B-3, when

3070

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 6, JUNE 2012

X

Fig. 1. Training process of the proposed framework. In matrix , different colors of rectangles indicate different types of heterogeneous visual features. Rectangles in dashed border indicate features’ corresponding coefficients. Blank rectangles indicate zero values in coefficient matrix and indicator matrix . Dark rectangles in indicate that images are labeled by corresponding tags. Images and corresponding labeled tags are from the MIML data set [21]. (“mou.” is the abbreviation for “mountain.”)

Y

B

the extracted heterogeneous features are divided into disjoint in groups of homogeneous features, the penalty (4) can induce sparsity at the group level in order to conduct group selection (see Fig. 1). Since the -norm penalty can induce sparsity on the whole coefficient vector , the struccan also conduct within-group tural grouping sparsity feature selection. Therefore, the structural grouping sparsity in (4) can be used to perform structured visual feature selection in the input direction. C. Multilabel Boosting by Tree-Structured Grouping Sparsity be the solution of (4), second-layer Let regression is formulated to solve the following regularized regression model: (5) For ease of notation, in (5), we let . In is the instantiation of the addition, output-sparsity penalty in (2) and is called the penalty of tree-structured grouping sparsity. Specifically, is a special example of in Definition 1 (see Section II-B3) when a set of groups is induced from a tree structure , which is defined as follows. Suppose the correlations among the labels (outputs) can with the set of vertices of size be represented as a tree

Y

. In this tree , each of the leaf nodes corresponds to an output label (tag), and the internal nodes of the tree represent groups of output labels located at the leaves of the subtree rooted at the given internal node (see the tree structure in Fig. 2 for example). Taking the leaf node as a singleton group, internal nodes as corresponding groups of tags located at the leaves of the subtree rooted at the given internal node, we get , where a tree-structured set of groups represents a group of tags corresponding to . For example, in Fig. 2(a), we have sea node and sea sunset desert mountain tree . in (5) is defined as [30] The weight is an internal node is a leaf node

(6)

and are associated with the inwhere the two quantities ternal node of the tree , which satisfy condition . Furthermore, represents the weight for selecting the labels associated with each of the children of node separately, and represents the weight for selecting them jointly. For the details of the weighting scheme, please refer to [30]. By this weighting scheme, formula (5) successfully induces a balanced tree-structured grouping sparsity on the output variables (labels) in order to incorporate multilabel correlations into the regresin Fig. 2(a) for exsion process. Taking the tree structure , , , and ample, we assign

HAN et al.: IMAGE ANNOTATION BY INPUT–OUTPUT STRUCTURAL GROUPING SPARSITY

3071

B

T

Fig. 2. Correlations among the labels are exploited by agglomerative hierarchical clustering on the label indicator matrix in Fig. 1 to form a tree structure in Fig. 2(a). Underlined numbers associated with each node denote corresponding weight. Each node in Fig. 2(a) is named from v to v to form a node set V for the tree . Corresponding tree-structured groups of regression coefficients are given in Fig. 2(b). (a) Tree structure of annotated labels in MIML data set [21]. (b) Tree-structured groups of regression coefficients.

T

, and then the weight [see the underlined number asis sociated with each node in Fig. 2(a)] for each node calculated by formula (6).

minimization of (9) can be solved by the Gauss-Seidel coordinate descent (GSCD) algorithm [26], [27], [38]. The subgradient is equation of (9) with respect to

IV. SOLUTIONS AND ALGORITHMS In this section, we will introduce the solutions and algorithms of solving (4) and (5), respectively, which presents the final solutions of the proposed framework Bi-MtBGS.

(10) , we

Let define

A. Regularized Regression With Structural Grouping Sparsity The objective function in (4) is convex and separable (considering the zero point in -norm penalty ) so that the blockwise coordinate descent at the group level [37] and piecewise coordinate descent within group [38] for individual features can be used for optimization. 1) Group Selection: The subgradient equation of the first two is terms in (4) with respect to (7) where if Therefore, if

if

and

is a vector with

.

(8)

then , which means that the th group is dropped out of the model. Therefore, (8) can be taken as the condition of performing the blockwise coordinate descent at the group level. Now, suppose that the condition in (8) does not hold, i.e., , then we can perform the piecewise coordinate descent within the th group. , 2) Within-Group Selection: Let , and . We need to minimize the following equation to conduct feature selection within the th group: (9) which is the sum of a convex differentiable function (first two terms) and a separable penalty, i.e., . Therefore,

, , .

viol

(11)

According to the Karush–Kuhn–Tucker (KKT) conditions, the first-order optimality conditions for (9) can be written as voil

(12)

where is the error tolerance. In fact, we refer to (12) as optimality with tolerance . In solving (9) by the Gauss–Seidel with the maximum viol that violates method, one variable the optimality conditions in (12) is chosen and the optimization alone, subproblem is solved with respect to this variable keeping the other ’s fixed. This procedure is repeated as long as there exists a variable that violates the conditions in (12). A combination of the bisection method and the Newton method [38] is used to optimize (9). Let us define the following two sets and . Two points and for which the derivative of the objective function in (9) has opposite signs are chosen, which ensures that the root . The minimizer computation always lies in an interval through a Newton update is (13) where

is the diagonal element of the Hessian matrix of (9). Now, we summarize the solutions of (4) in Algorithm 1. Algorithm 1 Regularized Regression with Structural Grouping Sparsity Input: Training images containing samples with dimensionality, corresponding label indicator , initialization of coefficient matrix matrix . Suppose that types of heterogeneous visual

3072

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 6, JUNE 2012

features are extracted from each image sample, we have with . Output: The estimated

solve the relaxation of the original problem in (5) by introducing additional variables

.

1: for

(14)

2: iterate over groups 3: if condition in (8) holds then 4:

, where problem in (5) can be rewritten as

;

. According to [30], the

5: else (15)

,

6: iterate over and

7: Initialize

;

8: while optimality violator exists in 9: find the maximum violator viol to ;

and move

where the formulation only contains smooth functions. Differentiating the objective function in (15) with respect to and set it to zero, we obtain the update equation [30] (16)

10: for each element in where

11: while voil 12: update

by (13);

13: if

or

then

14: compute

using (10); then

15: if

; then

16: else if

;

;

17: end while

Algorithm 2 Multilabel Boosting with Tree-structured Grouping Sparsity Input: Training data , the initialization of coefficient matrix is the output from Algorithm 1.

18: end for

Output: The estimated

19: end while

.

1: iterate

20: until convergence 21: output

is a diagonal matrix with , . Therefore, problem (15) could be solved by optimizing and alternately over iterations until convergence. Let be the output of Algorithm 1, we summarize the method of tree-structured multilabel boosting in Algorithm 2.

2: update

;

using (14),

, ;

3: for 22: end if

4: update

23: until convergence

5: end for ;

24: output 25: end for 26: output

using (16);

6: until convergence 7: output

.

. V. COMPUTATIONAL DISCUSSION

B. Tree-Structured Multi-Label Boosting Methods By introducing additional variables, an alternative optimization algorithm for solving (5) was developed in [30] to estimate . However, the discussion of the initialization of was ignored in [30]. In this paper, taking estimated from solving (4) as the initialization input for (5), we form a bilayer regression model. Based on the fact that the variational formulation of mixednorm regularization is equal to weighted -norm regularization [28], an alternating optimization method was proposed in [30] to

Some computational issues are discussed in this section. We mainly discuss the computational complexity of the least square regression with the proposed structural grouping sparsity for the feature selection and the convergence of GSCD optimization process. A. Complexity The computational complexity is crucial for the successful application of an algorithm. In Tibshirani’s original paper [22], he has found that the model selection with -norm usually can be finished with the iteration number within the range of

HAN et al.: IMAGE ANNOTATION BY INPUT–OUTPUT STRUCTURAL GROUPING SPARSITY

in practice. The runtime performance of the regression model with structural grouping sparsity is also very efficient when implemented by the cyclic coordinate descent method. From the description of the coordinate descent by the Gauss–Seidel method, we can see that for a complete cycle operations, where through all the coordinates, it takes is the number of nonzero elements, when sparsity of the data is considered. Therefore, the complexity of the regression model . with structural grouping sparsity is roughly B. Convergence Zhang and Oles [39] also used the coordinate descent optimization method to solve the ridge logistic regression. The convergence of the coordinate descent has been verified in [39]. The basic idea is that, after each iteration, the value of the objective function strictly decreases. Because of the convexity of the objective function, optimization will definitely converge to its global minimal. As discussed above, the objective function in formula (9) is convex. Therefore, the solutions of the coeffiof structural grouping sparsity for each label cient vector are guaranteed to converge to the global optimality. Theorem 1: At each step (each round of the for iteration from steps 10 to 18) of Algorithm 1, the KKT condition does not hold, and the objective function in formula (9) decreases. Theorem 1 guarantees the convergence of Algorithm 1 due to the convexity of the objective function in (9). A proof of Theorem 1 could be found in the Appendix. VI. EXPERIMENTS In this section, we will evaluate the performance of our proposed Bi-MtBGS framework in automatic multitag image annotation. The effectiveness of the structured feature selection is demonstrated using two open benchmark image data sets, and then we evaluate and compare the performance of Bi-MtBGS with state-of-the-art methods. A. Experimental Configuration 1) Data set: Three benchmark image data sets are used in our experiments, namely, Microsoft Research Cambridge (MSRC), MIML [21], and NUS-WIDE [40]. Note that NUS-WIDE could be taken as a real-world image data set since the images of this data set are downloaded from the Flickr website. In addition, 23 and 5 class labels (tags) are associated with the images in MSRC and MIML, respectively, which are multitagged and can be used as annotation ground truth. We randomly sampled 10000 images from NUS-WIDE in our experiments. For the ground truth of NUS-WIDE, we chose two indicator matrices from the selected data samples to form two data sets, i.e., NUS-6 and NUS-16. In these two data sets, the top 6 and 16 tags, which label the maximum numbers of positive instances, were selected, respectively. Multiple heterogeneous features were extracted and concatenated as a visual feature vector for each image. Taking each type of homogeneous features as a group, we sequentially numbered the feature groups in the following sections. Details of features, dimensionality, and also the sequence numbers for each of the three data sets are listed as follows:

3073

MSRC and MIML 638-D features are sequentially divided into seven groups, i.e., 1: 256-D color histogram; 2: 6-D color moments; 3: 128-D color coherence; 4: 15-D textures; 5: 10-D tamura-texture coarseness; 6: 8-D tamura-texture directionality; and 7: 80-D edge orientation histogram; moreover, 8 (only for MIML): 135-D SBN colors [10]. Note that the eighth group of features is only used in the MIML data set. Therefore, the dimensionality of the visual feature vector for images in MSRC is 503-D. NUS 634-D features are sequentially divided into five groups, i.e., 1: 64-D color histogram; 2: 144-D color correlogram; 3: 73-D edge direction histogram; 4: 128-D wavelet texture; and 5: 225-D blockwise color moments. 2) Evaluation Metrics: The area under the receiver operating characteristic curve (AUC) and the F1 score are used to measure the performance of image annotation. Since there are multiple labels (tags) in our experiments, to measure both the global performance across multiple tags and the average performance of all tags, according to [17] and [41], we use both the microaveraging and macroaveraging methods. Microaveraging values are calculated by constructing a global contingency table and then calculating the measures using these sums. In contrast, macroaveraging scores are calculated by calculating each measure for each category and then taking the average of these. Therefore, the evaluation metrics we used are micro-AUC, macro-AUC, micro-F1, and macro-F1. Furthermore, it is apparent that the esfor the test data is continuous rather timated indicator matrix than binary. In order to calculate the F1 score, we use the algorithm of calculation of threshold in [34] to learn thresholds and to binary. then quantize In particular, for the training data, we randomly sampled 300, 500, and 1000 samples from MSRC, MIML, and NUS data sets, respectively. The remaining data were used as the corresponding test data. For each data set, this process was repeated ten times to generate ten random training/test partitions. The average performance in terms of AUC and F1 score and standard deviations are evaluated. B. Parameter Setting and Tuning The parameters and in (4) need to be tuned. At the first training/test partition, we choose those parameters by a fivefold cross validation on the training data set. These chosen parameters were then fixed to train the feature selection model for all the ten partitions. Note that different features play different roles in the feature selection for different labels. Therefore, the parameter tuning process is performing separately for each tag. We depict three examples of parameter tuning by the fivefold cross validation with respect to micro-AUC in Fig. 3. We can observe that the highest performance on the three concepts is achieved and . at some intermediate values of Two parameters and in the tree-structured multilabel boosting process, i.e., in (5), need to be set and tuned. In fact, is determined by (or equally by , for the constraint ). For simplicity, we fix the value of for different . Parameters and are tuned by the same method and configuration as . In order to show different impacts of and , we depict the tuning results of and in Figs. 4 and 5, respectively. As shown in Fig. 5, better prediction performance

3074

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 6, JUNE 2012

Fig. 3. Parameter tuning examples of Bi-MtBGS for three labels: “bird” from MSRC data set, “sunset” from MIML data set, and “ocean” from NUS data set. The respective ranges of  and  for MSRC, MIML, and NUS are {0.1, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0}; {0.1, 0.5, 1.0, 1.25, 1.5, 1.75, 2.0, 2.5, 3.0}; and {0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.2, 0.3, 0.4, 0.5}. (a) Bird. (b) Sunset. (c) Ocean.

Fig. 4. Parameter tuning of in (5) for MIML, NUS-6, and NUS-16 data sets. The respective ranges of for MIML and NUS data sets are {1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100} and {1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10}. (a) MIML. (b) NUS-6. (c) NUS-16.

Fig. 5. Impact of weighting parameter s for the performance of tree-structured multilabel boosting. Micro-AUC for different values of s on three data sets MIML, NUS-6, and NUS-16. The scope of s is f0:1; 0:2; . . . ; 0:9g. Note that s + g = 1. (a) MIML. (b) NUS-6. (c) NUS-16.

Fig. 6. Sample images for labels “tree” and “bird” from MSRC data set.

is reached when the value of is smaller (or equally is bigger). As quantity represents the weight for selecting the labels asjointly, we can draw the sociated with the current node in conclusion that, generally, when more correlations are incorporated, we get better multilabel boosting performance. C. Heterogeneous Feature Selection We explored the differences of heterogeneous feature selection between group lasso, lasso, and Bi-MtBGS on MSRC and MIML data sets for each label, respectively. Sample images from MSRC and MIML are listed in Figs. 6 and 7, respectively. As can be seen, images with different labels (semantics) have different heterogeneous low-level features such as color and tex-

ture. Therefore, training a model for heterogeneous feature selection is crucial for understanding the semantic content in these images. In order to uncover the different mechanism of heterogeneous feature selection for group lasso, lasso and Bi-MtBGS, respectively from the three we output the coefficient vectors algorithms after 10-round repetition and investigate the results of group selection for each round. The results of heterogeneous feature selection for label “bird” in Fig. 6 are depicted in Fig. 8. We observe that, although group lasso and Bi-MtBGS can both within select groups of features, the coefficient values of groups are obviously different. Bi-MtBGS successfully induces sparsity of coefficient values, i.e., shrink to zeros, within

HAN et al.: IMAGE ANNOTATION BY INPUT–OUTPUT STRUCTURAL GROUPING SPARSITY

3075

Fig. 7. Sample images for labels “mountain” and “sunset” from MIML data set.

Fig. 8. Heterogeneous feature selection results from group lasso, lasso, and Bi-MtBGS for 10-round repetition of label “bird” in MSRC data set. Different colors of histograms indicate the regression coefficients that are obtained from different repetition rounds. Numbers “1”–“7” represent the number of heterogeneous group of features. (a) values (group lasso). (b) Groups (group lasso). (c) values (lasso). (d) values (Bi-MtBGS). (e) Groups (Bi-MtBGS).

groups, which is like lasso. On the contrary, group lasso intends to include all the coefficients into the model. Comparing the group selection results between group lasso and Bi-MtBGS, Bi-MtBGS produces more consistent group selection with respect to the 10-round repetition training process. Moreover, considering the heterogeneous feature selection for label “bird,” Bi-MtBGS adaptively includes more groups of heterogeneous features into the model since the low-level features in images with label “bird” are more variant. Furthermore, we explore the results of heterogeneous feature selection by Bi-MtBGS for different sizes of training data. Let us see Figs. 9 and 10. We depict the results by Bi-MtBGS for images of labels “mountain” and “sunset” in the MIML data set with different sizes of training data, respectively. Note that the coefficient values are plotted from one round and the group selection is plotted from 10-round repetition. As can be seen, the heterogeneous feature selection is more consistent and interpretable when the size of training data is increasing. For example, the most discriminative features for label “mountain” (see sample images in Fig. 7) are color and shape. The corresponding extracted feature groups in our experiments are: 1: 256-D color histogram; 2: 6-D color moments; 3: 128-D color coherence; 8: 135-D SBN; and 7: 80-D edge orientation histogram. Comparing the results in Fig. 9, noisy feature groups, i.e., the texture feature groups, are almost excluded from the models when the number of training data reaches 900. In particular, the coefficient vector output from the 900 training samples

is more sparse. For tag “sunset,” we can get a similar conclusion. See Fig. 10 for example, the feature groups such as textures and edges, which are not so discriminative for concept “sunset” (see sample images in Fig. 7), are nearly dropped out of the model when the size of training data is increasing. D. Performance of Multilabel Learning and Annotation To show the whole performance of multitag image annotation by the Bi-MtBGS framework, we compare Bi-MtBGS with five state-of-the-art algorithms: CCA-ridge, support vector machine (SVM), MTL-LS [17], MLSC [13], and C&W [26], [27]. Note that MTL-LS, MLSC, and C&W algorithms are multilabel learning methods and the other two algorithms perform binary classification for each label separately. We follow the configurations of C&W algorithm in [26]. Furthermore, as introduced in the end of Section II-A, MLSC conducts sparse reconstruction on input features and output multiple labels simultaneously, which is very similar as our Bi-MtBGS framework. The sparsity parameters and the reduced dimensionality of MLSC are tuned in {1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10} and {50, 100, 150, 200, 250, 300}, respectively. For details of CCA-ridge, SVM, and MTL-LS, please refer to [17]. The average performance and standard deviations from 10-round repetition for each algorithm are reported in Tables I and II. We can see that the Bi-MtBGS framework outputs the best image annotation results in terms of AUC and F1 score.

3076

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 6, JUNE 2012

Fig. 9. Heterogeneous feature selection results from Bi-MtBGS for different sizes of training data, i.e., 500, 600, and 900, respectively of concept “mountain” in MIML data set. Different colors of histograms indicate the regression coefficients that are obtained from different repetition rounds. Numbers “1”–“8” represent the number of heterogeneous group of features. (a) values (500). (b) Groups (500). (c) values (600). (d) Groups (600). (e) values (900). (f) Groups (900).

Fig. 10. Heterogeneous feature selection results from Bi-MtBGS for different sizes of training data, i.e., 500, 600, and 900, respectively of concept “sunset” in MIML data set. Different colors of histograms indicate the regression coefficients that are obtained from different repetition rounds. Numbers “1”–“8” represent the number of heterogeneous group of features. (a) values (500). (b) Groups (500). (c) values (600). (d) Groups (600). (e) values (900). (f) Groups (900).

We also compare the learning performance of Bi-MtBGS with C&W and MLSC, respectively, for different sizes of training data. This process is also repeated for ten rounds by randomly sampling ten times for each size of training data. For better illustration, we plot the learning performance of Bi-MtBGS and MLSC on the NUS data set in terms of AUC and F1 score in Fig. 12(c) and (d), respectively, which are different from other results with microevaluations and macroevaluations in separate figures, respectively. In Figs. 11 and 12, we can observe that the better learning performances of Bi-MtBGS and

C&W are reached with more training samples, which are more stable than MLSC. Furthermore, the performance of Bi-MtBGS is better than that of C&W and MLSC for all the training sizes. VII. CONCLUSION This paper has proposed a framework of multilabel learning for image annotation called the Bi-MtBGS. The Bi-MtBGS method is attractive due to its subgroup feature identification by structural grouping penalty in heterogeneous feature settings along with its tree-structured multilabel boosting

HAN et al.: IMAGE ANNOTATION BY INPUT–OUTPUT STRUCTURAL GROUPING SPARSITY

3077

TABLE I PERFORMANCE COMPARISON OF MULTITAG IMAGE ANNOTATION IN NUS-6 DATA SET. THE BEST PERFORMANCES ARE HIGHLIGHTED IN BOLD FACE (AVERAGE VALUE STANDARD DEVIATION)

6

TABLE II PERFORMANCE COMPARISON OF MULTITAG IMAGE ANNOTATION IN NUS-16 DATA SET. THE BEST PERFORMANCES ARE HIGHLIGHTED IN BOLD FACE (AVERAGE VALUE STANDARD DEVIATION)

6

Fig. 11. AUC and F1 score versus different sizes of training data. The respective ranges of training size for MIML and NUS-6 data sets are {500, 600, 700, 800, 900, 1000} and {1000, 1200, 1400, 1600, 1800, 2000}. “C&W” denotes C&W multilabel boosting in [26], and “Tree” denotes tree-structured multilabel boosting in Bi-MtBGS. (a) MIML (micro). (b) MIML (macro). (c) NUS-6 (micro). (d) NUS-6 (macro).

capability. Experiments show that the Bi-MtBGS not only is more interpretable for image annotation but also achieves better performance of multitag image annotation compared with state-of-the-art methods. APPENDIX A PROOF OF THEOREM 1 be the smooth objective function of only Proof: Let , which means that has a continuous one coordinate first-order derivative and a piecewise continuous second-order be an upper bound on the second derivative, and in the -vicinity of : derivative of

where and . Therefore, the minimum where of will be approximately located at

Fig. 12. AUC and F1 score versus different sizes of training data. The respective ranges of training size for MIML and NUS-6 data sets are {500, 600, 700, 800, 900, 1000} and {1000, 1200, 1400, 1600, 1800, 2000}. “MLSC” denotes the method in [13], and “Tree” denotes tree-structured multilabel boosting in Bi-MtBGS. (a) MIML (micro). (b) MIML (macro). (c) NUS-6 (AUC). (d) NUS-6 (F1 score).

By this approximation, if happens to fall inside the interval , it is taken as the next step of the algorithm and is reduced to the interval boundary otherwise. Turning to the objective function in (9), it is convex and except at for some . Now only smooth as , steps from 10 to 18 of consider the condition that Algorithm 1 guarantee that the root lies in the interval of defined above. Furthermore, the while interest as condition in step 8 of Algorithm 1 ensures that voil is satisfied for each of the for iteration (steps 10–18 in Algorithm 1) rounds, which means that the KKT condition in (12) does not hold. Therefore, the method of GSCD guarantees the objective function decrease. is 0, we try both direcIn case when the current value of tions and see if either of them works. Note that, since the objective function (9) is convex, it does not matter which direction is tried first for it is not possible for both of them to be successful.

3078

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 6, JUNE 2012

REFERENCES [1] Y. Rui, T. Huang, and S. Chang, “Image retrieval: Current techniques, promising directions, and open issues,” J. Visual Commun. Image Represent., vol. 10, no. 1, pp. 39–62, Mar. 1999. [2] A. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, “Contentbased image retrieval at the end of the early years,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 12, pp. 1349–1380, Dec. 2000. [3] R. Datta, D. Joshi, J. Li, and J. Wang, “Image retrieval: Ideas, influences, and trends of the new age,” ACM Comput. Surveys (CSUR), vol. 40, no. 2, p. 5, Apr. 2008. [4] Y. Liu, D. Zhang, G. Lu, and W. Ma, “A survey of content-based image retrieval with high-level semantics,” Pattern Recognit., vol. 40, no. 1, pp. 262–282, Jan. 2007. [5] D. Zhang, M. Islam, and G. Lu, “A review on automatic image annotation techniques,” Pattern Recognit., vol. 45, no. 1, pp. 346–362, Jan. 2012. [6] K. Barnard, P. Duygulu, D. Forsyth, N. De Freitas, D. Blei, and M. Jordan, “Matching words and pictures,” J. Mach. Learn. Res., vol. 3, pp. 1107–1135, Mar. 2003. [7] P. Duygulu, K. Barnard, J. De Freitas, and D. Forsyth, “Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary,” in Proc. ECCV, 2002, pp. 97–112. [8] D. Grangier and S. Bengio, “A discriminative kernel-based approach to rank images from text queries,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 8, pp. 1371–1384, Aug. 2008. [9] Y. Chen, J. Wang, and D. Geman, “Image categorization by learning and reasoning with regions,” J. Mach. Learn. Res., vol. 5, pp. 913–939, Dec. 2004. [10] O. Maron and A. Ratan, “Multiple-instance learning for natural scene classification,” in Proc. ICML, 1998, pp. 341–349. [11] L. Cao, J. Luo, F. Liang, and T. Huang, “Heterogeneous feature machines for visual recognition,” in Proc. IEEE ICCV, 2009, pp. 1095–1102. [12] K. Saenko and T. Darrell, “Unsupervised learning of visual sense models for polysemous words,” in Proc. Adv. Neural Inform. Process. Syst., 2008, vol. 13, pp. 1–8. [13] C. Wang, S. Yan, L. Zhang, and H. Zhang, “Multi-label sparse coding for automatic image annotation,” in Proc. IEEE Comput. Soc. Conf. CVPR, 2009, pp. 1643–1650. [14] F. Kang, R. Jin, and R. Sukthankar, “Correlated label propagation with application to multi-label learning,” in Proc. IEEE Comput. Soc. Conf. CVPR, 2006, pp. 1719–1726. [15] Y. Zhang and Z. Zhou, “Multi-label dimensionality reduction via dependence maximization,” in Proc. AAAI Conf., 2008, pp. 1503–1505. [16] P. Rai and H. Daumé, III, “Multi-label prediction via sparse infinite CCA,” in Proc. Adv. Neural Inform. Process. Syst., 2010, vol. 22, pp. 1518–1526. [17] S. Ji, L. Tang, S. Yu, and J. Ye, “Extracting shared subspace for multilabel classification,” in Proc. ACM SIGKDD Int. Conf. KDD, 2008, pp. 381–389. [18] J. Fan, Y. Gao, and H. Luo, “Integrating concept ontology and multitask learning to achieve more effective classifier training for multilevel image annotation,” IEEE Trans. Image Process., vol. 17, no. 3, pp. 407–426, Mar. 2008. [19] M. Srikanth, J. Varner, M. Bowden, and D. Moldovan, “Exploiting ontologies for automatic image annotation,” in Proc. 28th. Annu. Int. Conf. ACM SIGIR, 2005, pp. 552–558. [20] R. Shi, C. Lee, and T. Chua, “Enhancing image annotation by integrating concept ontology and text-based Bayesian learning model,” in Proc. 15th Int. Conf. Multimedia, 2007, pp. 341–344. [21] Z. Zhou and M. Zhang, “Multi-instance multi-label learning with application to scene classification,” in Proc. NIPS, 2007, pp. 1609–1616. [22] R. Tibshirani, “Regression shrinkage and selection via the Lasso,” J. Roy. Stat. Soc. Ser. B (Stat. Methodol.), vol. 58, no. 1, pp. 267–288, 1996. [23] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” J. Roy. Stat. Soc. Ser. B (Methodol.), vol. 68, no. 1, pp. 49–67, Feb. 2006. [24] L. Breiman and J. Friedman, “Predicting multivariate responses in multiple linear regression,” J. Roy. Stat. Soc. Ser. B (Methodol.), vol. 59, no. 1, pp. 3–54, 1997. [25] H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, no. 3/4, pp. 321–377, 1936.

[26] F. Wu, Y. Han, Q. Tian, and Y. Zhuang, “Multi-label boosting for image annotation by structural grouping sparsity,” in Proc. ACM MM, 2010, pp. 15–24. [27] Y. Han, F. Wu, and Y. Zhuang, “Multi-label image annotation by structural grouping sparsity,” in Social Media Modeling and Computing. New York: Springer-Verlag, 2011, pp. 97–118. [28] A. Argyriou, T. Evgeniou, and M. Pontil, “Convex multi-task feature learning,” Mach. Learn., vol. 73, no. 3, pp. 243–272, Dec. 2008. [29] A. Argyriou, T. Evgeniou, and M. Pontil, “Multi-task feature learning,” in Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2007, p. 41. [30] S. Kim and E. Xing, “Tree-guided group lasso for multi-task regression with structured sparsity,” in Proc. 27th Int. Conf. Mach. Learn., 2010, pp. 543–550. [31] X. Chen, S. Kim, Q. Lin, J. Carbobell, and E. Xing, “Graph-structured multi-task regression and an efficient optimization method for general fused Lasso,” 2010. [Online]. Available: http://arxiv.org/abs/ 1005.3579 [32] R. Jenatton, J. Audibert, and F. Bach, “Structured variable selection with sparsity-inducing norms,” 2009. [Online]. Available: http://arxiv. org/abs/0904.3523 [33] S. Zhang, J. Huang, Y. Huang, Y. Yu, H. Li, and D. Metaxas, “Automatic image annotation using group sparsity,” in Proc. IEEE Conf. CVPR, pp. 3312–3319. [34] Y. Han, F. Wu, J. Jia, Y. Zhuang, and B. Yu, “Multi-task sparse discriminant analysis (MtSDA) with overlapping categories,” in Proc. 24th AAAI Conf., 2010, pp. 469–474. [35] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” J. Roy. Stat. Soc. Ser. B (Stat. Methodol.), vol. 67, no. 2, pp. 301–320, Apr. 2005. [36] L. Clemmensen, T. Hastie, and B. Ersbll, “Sparse discriminant analysis,” 2008. [Online]. Available: http://www-stat.stanford.edu/~hastie/ Papers/ [37] J. Friedman, T. Hastie, and R. Tibshirani, “A note on the group Lasso and a sparse group lasso,” 2010 [Online]. Available: http://www-stat. stanford.edu/~tibs/ftp/sparse/grlasso.pdf [38] S. Shevade and S. Keerthi, “A simple and efficient algorithm for gene selection using sparse logistic regression,” Bioinformatics, vol. 19, no. 17, pp. 2246–2253, 2003. [39] T. Zhang and F. Oles, “Text categorization based on regularized linear classification methods,” Inf. Retrieval, vol. 4, no. 1, pp. 5–31, Apr. 2001. [40] T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “NUS-WIDE: A real-world web image database from national university of Singapore,” in Proc. ACM Int. Conf. Image Video Retrieval, 2009, pp. 1–9. [41] D. Lewis, “Evaluating text categorization,” in Proc. Speech Nat. Lang. Workshop, 1991, pp. 312–318.

Yahong Han received the Ph.D. degree from Zhejiang University, Hangzhou, China. He is currently an Associate Professor with the School of Computer Science and Technology, Tianjin University, Tianjin, China. His current research interests include multimedia analysis, retrieval, and machine learning.

Fei Wu received the B.S. degree from Lanzhou University, Lanzhou, China, the M.S. degree from the University of Macau, Taipa, Macau, and the Ph.D. degree from Zhejiang University, Hangzhou, China. He is currently a Professor with the College of Computer Science and Technology, Zhejiang University. His current research interests include multimedia analysis, retrieval, statistic learning, and pattern recognition.

HAN et al.: IMAGE ANNOTATION BY INPUT–OUTPUT STRUCTURAL GROUPING SPARSITY

Qi Tian (SM’04) received the Ph.D. degree in electrical and computer engineering from the University of Illinois at Urbana-Champaign, Urbana, in 2002. He is currently an Associate Professor with the Department of Computer Science, The University of Texas at San Antonio, San Antonio. He has authored or coauthored more than 100 refereed journal and conference papers. His research interests include multimedia information retrieval and computer vision. Dr. Tian is a member of the Association for Computing Machinery (ACM). He is an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY and has served as a Guest Coeditor of the IEEE TRANSACTIONS ON MULTIMEDIA. He has served in various capacities for more than 120 IEEE and ACM conferences. He has been a Guest Coeditor of the Journal of Computer Vision and Image Understanding, ACM Transactions on Intelligent Systems and Technology, and EURASIP Journal on Advances in Signal Processing. He is a member of the Editorial Board of the Journal of Multimedia.

3079

Yueting Zhuang (M’92) received the B.S., M.S., and Ph.D. degrees from Zhejiang University, Hangzhou, China, in 1986, 1989, and 1998, respectively. Currently, he is a Professor and a Ph.D. Supervisor with the College of Computer Science and Technology, Zhejiang University. His current research interests include multimedia databases, artificial intelligence, and video-based animation.

Suggest Documents