This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2409693, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. **, NO. **, 2014
1
Uniforming Residual Vector Distribution for Distinctive Image Representation Zhen Liu, Houqiang Li, Wengang Zhou, Ting Rui, Qi Tian, Senior Member, IEEE
Abstract—Recently, image representation by vector of locally aggregated descriptors (VLAD) has been demonstrated to be super efficiency in image representation. Due to the coarse division in the feature space, its discriminative power is limited. To address this issue, one intuitive way is to construct VLAD descriptor with a larger vocabulary. But this will lead to a higher dimensional VLAD and suffer more computational complexity when learning the PCA parameters used to project VLAD onto a low dimensional space. In this paper, we propose a hierarchical scheme to build the VLAD descriptor. In our approach, by generating some sub-words to each visual word of a coarse vocabulary, a hidden layer visual vocabulary is constructed. With the hidden layer visual vocabulary, the feature space is divided finer. Then we aggregate the residues in the hidden layer vocabulary to the coarse layer to obtain an image descriptor which is the same dimension as the original VLAD. Besides, we reveal that performing the whitening operation to local descriptor can further enhance the discriminative power of the VLAD descriptor. We validate our approach with experiments mainly conducted on three benchmark datasets, i.e., Holidays dataset, UKBench dataset and Oxford Building dataset with Flickr1M as distractors and make comparison with the related algorithms on VLAD. The experimental results demonstrate the effectiveness of our algorithm. Index Terms—Image Representation, Hierarchical VLAD, Hidden Layer, Image Search, Image Descriptor, Local Features
I. I NTRODUCTION In recent years, content based image search has attracted much attention from both multimedia and computer vision communities [?], [1]–[23]. Many approaches are based on the Bag-of-Visual-Words (BoVW) model and adopt the invariant local features for image representation [24], [25]. In the BoVW model, an image is represented into a visual word vector based on the quantization results of local features. For quantization, a visual codebook is usually trained beforehand, with techniques such as standard k-means, hierarchical k-means (HKM) [6] and approximate k-means (AKM) [26]. Local features are quantized to visual words with nearest neighbor search or approximate nearest neighbor search approach. To obtain full representation of an image, hundreds of thousands or even millions of visual words are used which results in a very high dimensional vector representation. Z. Liu, H. Li and W. Zhou are with the CAS Key Laboratory of Technology in Geo-spatial Information Processing and Application System, Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, 230027 China (e-mail:
[email protected],
[email protected],
[email protected]). T. Rui is with the Information Technology Department in the PLA University of Science and Technology, Nan-jing, China. Q. Tian is with the Computer Science Department, University of Texas at San Antonio, San Antonio, TX 78249 USA (e-mail:
[email protected]).
Compared to image representation based on the BoVW model, recent research results reveal that VLAD descriptor is a more efficient alternative [27]–[30]. Instead of counting the feature number quantized to each visual word in the BoVW model, VLAD aggregates the residual vectors between local features and their visual words. To obtain a more compact representation, VLAD descriptor is dimensionally reduced to a low dimensional space by PCA, for example 128-D. Then an image can be represented with a 128-D vector which is the same dimension as a single local SIFT feature [25], [31]. This is very attractive in large-scale vision applications, since the corresponding memory overhead is very low. In VLAD, a small visual vocabulary is adopted, typically 64 words [27], [30], which leads to a coarse feature space division and limits the discriminative power of the final representation. The most direct way to address this issue is to construct VLAD descriptor with a larger vocabulary, which yields a finer division in the feature space. But this will result in a higher dimensional descriptor and more computational complexity when learning the PCA parameters used to project the full VLAD descriptor onto a low dimensional space [28]. In this paper, we propose a hierarchical method to construct VLAD descriptor (HVLAD) to inherit the benefit of finer division bringing by larger vocabulary while preserving the same dimension with the original VLAD descriptor. In our HVLAD descriptor, by generating sub-words to each words of the coarse vocabulary that is adopted to build original VLAD descriptor, we construct a hidden layer visual vocabulary. With the hidden layer vocabulary, the feature space is finely divided. The residual vectors between local features and sub-words are first aggregated at the hidden layer and then are aggregated to the coarse layer. Usually the discriminability of each dimension in local descriptor is different, which is shown by the different variance of each dimension. In our experiments we found that if the variance of each dimension of local descriptor is balanced (whitened), we can obtain better retrieval accuracy in VLAD representation. The rest of the paper is organized as follows. We first introduce the related works in Section II. Then in Section III, we present our hierarchical VLAD scheme and discuss the motivations. Besides, the local descriptor whitening operation is also discussed. After that our experimental results are provided in Section IV. Finally, we conclude the paper in Section V.
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2409693, IEEE Transactions on Circuits and Systems for Video Technology 2
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. **, NO. **, 2014
II. R ELATED W ORKS In this section, as our paper is focused on VLAD image descriptor in large-scale image search application, we will discuss some related works on this image representation approach in this scenario. VLAD descriptor is first proposed in [27] as a compact image representation approach. Instead of representing image with visual word histogram in the BoVW model [1], it aggregates the residual vectors between the SIFT [25], [31] features and their quantized visual words. Then these residual vectors are concatenated to form a super vector to represent an image. As SIFT descriptor is 128-D, the dimension of the final image representation will be 128 times of the vocabulary size. Due to the loss of sparsity in this representation, it is infeasible to plug it into the traditional image search framework with inverted indexing structure [26], [32], [33]. To address this problem, PCA is conducted to reduce the dimension of the aggregated vector, and each image can be represented by a low dimensional vector, for example 128-D, which is the same dimension as a local SIFT feature. As a compact global representation, it inherits the invariance property of local SIFT feature and significantly reduced the memory cost. Such merits make it attractive for large-scale computer vision applications. The improved version of VLAD is proposed in [28] with two additional pre- and post-processing operations, namely performing PCA to SIFT descriptors before aggregating them and doing power normalization after the aggregation. The power normalization can suppress the burstness [34] in local features. Due to the nonlinear property of power normalization, the effect of power normalization will vary with different coordinate systems, which is demonstrated in [30]. Hence [30] proposes a visual word based local coordinate system to enhance the effect of power normalization in VLAD descriptor. In [28], it is also revealed that VLAD descriptor is a non probabilistic fisher vector [35]. Fisher vector transforms a set of high dimensional vectors into a single vector representation based on the generative model in the parameter space. It is suitable for image representation based on local features. In [36], multiple vocabularies and signal whitening approach are used to suppress the co-occurrences and overcounting problem between visual words. Feature centering is important in VLAD representation [29]. As the visual vocabulary is trained with independent data, it usually suffers bias to the dataset used to implement experiments. To address this issue, [29] proposes a vocabulary adaptation approach. The visual vocabulary is updated by the local features in the database. In [29] and [30], the impact of normalization is also well studied, including intra normalization [29] and residue normalization [30], which improves the original VLAD descriptor. With normalization, the residual vectors share the same magnitude and their directions become the dominant factor to the discriminability of the generated VLAD descriptor. III. H IERARCHICAL VLAD In this section we will first review the original VLAD descriptor in details and then introduce our algorithm to construct
the hierarchical VLAD from the perspective of uniforming residue vector distribution. Our motivation is also discussed. A. VLAD review Given an image with some local features, we can generate its VLAD descriptor [28] with a pre-trained visual vocabulary. First a visual vocabulary with K words is constructed offline by k-means approach, with the following optimization formulation, K−1 X X ||y j − ci ||2 (1) min {ci }
i=0 y j ∈C i
with: C i = {y j | ||y j − ci0 ||2 < ||y j − ci ||2 , ∀i0 6= i}. {y j } is a set of randomly selected training samples. The visual vocabulary can be denoted by {ci |i = 0, 1, 2, ..., K − 1}. With the vocabulary, the feature space is divided into Voronoi cells {C i |i = 0, 1, 2, ..., K − 1}. For each Voronoi cell C i , its centroid is ci . For an image with N local features, it can be denoted by {xn |n = 0, 1, 2, ..., N − 1}. The basic idea of the VLAD descriptor is to accumulate the differences xn − ci for each feature xn belonging to Voronoi cell C i and denote it as vi , X vi = x n − ci (2) xn ∈C i
Then, all accumulated vectors are concatenated to generate the VLAD descriptor, V = [v i ] with d×K dimensions, where d is the dimension of local features. Besides, an L2 -nomalization operation is usually conducted on the VLAD descriptor for distance comparison. B. Hierarchical VLAD Since the dimension of the full VLAD descriptor is d times of the vocabulary size (d = 128 for SIFT feature), it will be difficult to perform the PCA analysis to generate compact image representation from the full VLAD descriptors if they are constructed with a large visual vocabulary. As shown in [27], [28], [30], the vocabulary used to construct VLAD is usually no larger than 256. However, there are two benefits to build VLAD descriptor with a larger vocabulary. First, larger vocabulary has more discriminative power since the feature space is divided into more Voronoi cells. Second, the local features are better centered which can alleviate the bias of coarse vocabuary. The impact of feature centering is illustrated in Fig. 2. We propose to construct some sub-words to each visual word to obtain a finer feature space division and a better feature centering. By aggregating the residual vectors from sub-words to visual words, we build a VLAD descriptor from a finer feature space division while keeping the same dimension with the original VLAD descriptor. To obtain a finer division of the feature space based on a small visual vocabulary, we divide the Voronoi cell C i into M sub-cells {C m i |m = 0, 1, 2, ..., M − 1}, then Eq. 2 can be rewritten as X X vi = xn − ci (3) xn ∈C i xn ∈C m i
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2409693, IEEE Transactions on Circuits and Systems for Video Technology LIU et al.: UNIFORMING RESIDUAL VECTOR DISTRIBUTION FOR DISTINCTIVE IMAGE REPRESENTATION
3
θ(ξm ) i
θ(ξm ) i
θ(vm ) i
θ(vm ) i
?? -??
?? -???
0
0.5
1
1.5
2
2.5
3
0
0.5
1
(a) d=2
1.5
2
2.5
θ(ξm ) i
θ(ξm ) i
(a)
θ(vm ) i
θ(vm ) i
(b)
Fig. 1. The solid polygon stands for a Voronoi cell and the dashed polygon stands for the sub-cell. The blue cross stands for the local features. (a) The purple circle stands for the centroid of Voronoi cell, ci . (b) The purple circle stands for the centroid of the sub-cell, cm i .
0
0.5
1
1.5
2
2.5
3
0
0.5
1
(c) d=8
M −1 X
ξm i
1.5
2
2.5
3
(d) d=32 θ(ξm ) i
P If we treat xn ∈C m xn − ci in above equation as a stochastic i variable and denote it as ξ m i , the Eq. 3 can be rewritten as vi =
3
(b) d=4
θ(ξm ) i
θ(vm ) i
θ(vm ) i
(4)
m=0
If the feature space is divided coarsely, in each sub-cell C m i , the local features will not be well centered since the anchor point ci is not the centroid of the sub-cell C m i , which leads to the bias in the distribution of ξ m i . As shown in Fig. 1(a), given a sub-cell C m i , the direction of the residual vector xn − ci has a bound which is illustrated by the green lines. Based on the rules of the vector addition, the direction of ξ m i will also locate between the two green lines. Vice versa, as shown m in Fig. 1(b), if the features in C m i are well centered, ξ i will m have probability to locate along any direction. If ξ i distributes more uniformly, from perspective of information theory, it will carry more information. It is reasonable to deduce that v i will be more discriminative if ξ m i carries more information. To make ξ m i more discriminative, the centroid of the subm cell C m i , which is denoted as ci in Fig. 1(b), is chosen as the anchor point in Eq. 3 rather than ci . Base on above analysis, we propose to construct a hidden layer vocabulary {cm i } to build the VLAD descriptor. The new VLAD descriptor is denoted by HVLAD in the following. To construct the hidden layer vocabulary, we generate M sub-words {cm i |m = 0, 1, 2, ..., M − 1} to ci with k-means approach, which can be illustrated by min m
{ci }
M −1 X
X
i=0
y j ∈C i ∧y j ∈C m i
2 ||y j − cm i ||
(5)
0
2 with: C m = {y j | y j ∈ C i ∧ ||y j − cm < ||y j − i i || m 2 0 ci || , ∀m 6= m}. These sub-words will divide the Voronoi cell C i into M sub-cells {C m i } and be the centroids of each sub-cell C m i . Then we have X vm x n − cm (6) i = i xn ∈C m i
vm i
where becomes
plays the same role with ξ m i in Eq. 4. And Eq. 3 vi =
M −1 X m=0
vm i
(7)
0
0.5
1
1.5
2
2.5
3
0
0.5
(e) d=64
1
1.5
2
2.5
3
(f) d=128
m Fig. 2. The distributions of θ(ξm i ) and θ(v i ) when the SIFT descriptor is dimensionally reduced to d dimensions by PCA. Note that as d increases, the angle between two randomly selected vectors has a trend to be π2 , as shown in Fig. 3.
d=128 d=64 d=32 d=8 d=4 d=2
0
Fig. 3. space.
0.5
1
1.5
2
2.5
3
The distribution of two randomly selected vectors in d dimensional
Then the proposed HVLAD descriptor, V = [v i ], has the same dimension as the original VLAD descriptor in Eq. 2. To validate our observasion that the direction of v m i distributes more uniformly than ξ m , we conduct an experiment i on the first 100K images of Flickr1M [33]. With a randomly generated direction p as the basis vector, we compute the m angles between p and ξ m i and v i , which can be denoted by p · ξm pi m ) = arccos( p 2 ||p|| ||ξ i ||2 p · vm pi θ(v m ). i ) = arccos( p 2 ||p||2 ||v m i || θ(ξ m i )
(8)
m The distributions of θ(ξ m i ) and θ(v i ) are shown in Fig. m 2. It can be seen that θ(v i ) distributes more uniformly than θ(ξ m i ) in all tested dimensional space. There is a phenomenon
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2409693, IEEE Transactions on Circuits and Systems for Video Technology 4
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. **, NO. **, 2014
TABLE I m T HE ENTROPY OF THE DISTRIBUTIONS OF θ(v m i ) AND θ(ξi ) WITH d = 128 SIFT FEATURES . K → M → θ(ξm i ) θ(v m i )
16 64 3.0052 3.5128
64 256 3.145 3.5322
64 3.2846 3.5285
256 256 3.2371 3.5181
64 3.3141 3.5191
256 3.3204 3.5181
that as d increases the bound becomes narrow for both θ(ξ m i ) and θ(v m i ). This is because in higher dimensional space the similarity between two randomly generated vectors becomes smaller. As shown in Fig. 3, as d increases the angle between two randomly generated vectors has a trend to converge to π2 . In Table I, we show the entropy of the distributions of θ(v m i ) m and θ(ξ m ) with respect to K and M . It can be seen that θ(v i i ) has larger entropy than θ(ξ m ). i To improve the discriminative power of HVLAD, we enhance it with the following processing. 1) Residue normalization and Intra normalization: To make the residual vectors in Eq. 2 with equal contribution to the final aggregating result, the residue normalization is proposed in [30]. With residue normalization, Eq. 6 becomes vm i =
X xn ∈C m i
x − cm i p n 2 ||xn − cm i ||
(9)
The intra normalization is proposed in [29] to further improve the discriminability. With intra normalization, Eq. 7 becomes PM −1 m v v i = q Pm=0 i (10) M −1 m 2 || m=0 v i || In the following Eq. 9 and Eq. 10 will be used to construct our basic HVLAD descriptor. 2) Multiple Assignment in Hierarchical VLAD: When local features are quantized to visual words in a hard manner, quantization error (namely similar features are quantized to different words) will be unavoidably involved. As an effective alternative, multiple assignment strategy can be adopted to address this issue [32], [34]. Multiple assignment means that each local feature is quantized to several visual words. In the following, we present how to integrate the multiple assignment [34] into our hierarchical VLAD scheme. Considering that setting the vocabulary size as no large than 256 is a coarse division of the feature space, the multiple assignment is only performed in the sub-words. Then, with multiple assignment strategy, Eq. 9 can be rewritten as vm i =
X Cm i ∈NN2γ (xn
x − cm i p n m ||2 ||x − c n i )
(11)
where NN2γ (xn ) represents γ nearest sub-cells of xn (γ ≤ M ). Then, with multiple assignment strategy, our HVLAD descriptor can be constructed by Eq. 11 and Eq. 10.
3) Local Descriptor Whitening in Hierarchical VLAD: The discriminability of each dimension of local descriptor is different, which is shown by different variances of each dimension of local descriptor. This will affect the discriminability of the final image descriptor after the local descriptors are aggregated. In this section, we propose to perform the whitening operation to balance the dimensional discriminability of local descriptors to improve the representability of the image descriptor. [36] also adopted whitening operation but to reduce the co-occurrence over-counting problem jointly with the dimensional reduction. Here we use whitening operation to balance the discriminability of each dimension of local descriptor, which is also demonstrated effective by a most recently published work [37]. As the local descriptor has been assigned to different words in the quantization step, the whitening operation is performed based on the descriptors belonging to each visual word. Then Eq. 11 becomes vm i =
[ √1λ ] · Rid · (xn − cm i )
X
i
q Cm i ∈NN2γ (xn )
||[ √1λ ] i
2 · Rid · (xn − cm i )||
(12)
where Rid is a matrix whose rows correspond to d eigenvectors with the largest eigenvalues learned with PCA for all the features in each word cell and [ √1λ ] represents a diag matrix i which consists of the square root of all the eigenvalues corresponding to Rid . From the above equation, it can be seen that the variance of each dimension of SIFT descriptors will be balanced before aggregating. Then, with the local descriptor whitening operation integrated, our HVLAD descriptor can be constructed by Eq. 12 and Eq. 10. C. Generating a Compact Representation To obtain the compact representation, the standard principal component analysis (PCA) is usually adopted to project the full image descriptor constructed above onto a lower dimensional space. But to suppress the burstiness (namely a single pattern appears many times in an image), the power normalization operation is usually adopted [28]. Since it is more effective to suppress the burstiness after reducing the correlations [30], the power normalization is performed in the low dimensional space, which can be illustrated by T
= RVD · (V − V )
VD
= sign(T ) · |T |α
(13)
D
The L2 -normalized V is the final image representation with D dimensions. RVD consists of the first D eigenvectors corresponding to the D largest eigenvalues obtained by the PCA analysis. V is the mean of the samples used in PCA analysis. sign(·) and | · |α are operated on each component of T . The impact of α will be studied in Section IV-D. IV. E XPERIMENTAL R ESULTS In this section, we evaluate our approach and make comparisons with the recent related algorithms on VLAD. We adopt three public datasets, i.e., Holidays [33] and UKBench [6] and
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2409693, IEEE Transactions on Circuits and Systems for Video Technology LIU et al.: UNIFORMING RESIDUAL VECTOR DISTRIBUTION FOR DISTINCTIVE IMAGE REPRESENTATION
0.68 0.66 0.64
K=256 K=128 K=64 K=32 K=16
mAP
0.62 0.6 0.58 0.56 0.54 0.52 1
4
8
16
32
64
128
256
512
M
Fig. 4. The impact of number of sub-words for each visual word, M , in Eq. 7 on Holidays.
Oxford Building [26], for evaluation. Holidays dataset contains 1491 high-resolution images containing personal holiday photos with 500 queries. The SIFT descriptors of Holidays images are publicly available. A training dataset, i.e., Flickr60K, is also publicly released with Holidays, which contains 60K images downloaded from Flickr and is used as an independent dataset to train the visual vocabulary off-line. The retrieval performance is measured in terms of mean average precision (mAP) [26]. UKBench dataset contains 2550 objects or scenes, each with four images taken under different views or imaging conditions, resulting in 10200 images in total. In terms of accuracy measurement, top-4 accuracy [6] is used as evaluation metric. For top-4 accuracy, for each query, the retrieval accuracy is measured by counting the number of correct images in top4 returned results. Then the retrieval performance is averaged over all test queries. We use the SIFT descriptors of UKBench images provided by J´egou. Oxford Building dataset [26] consists of 5062 images of buildings and 55 query images corresponding to 11 distinct buildings in Oxford. The retrieval performance is measured in terms of mean average precision (mAP). For this dataset, the vocabulary is usually learned from Paris6K dataset [29] [30] with the SIFT descriptor of [38]. We also adopt this setting on Oxford Building dataset. For large scale experiments, we download the pre-computed local features of Flickr1M [27], [33] database as distractors, which are Hessian Affine SIFT [31] and are the same as the descriptors of Holidays and UKBench [27]. Our experiments are implemented on a server with 32GB memory and 2.4GHz CPU of Intel Xeon. A. The performance of Hierarchical VLAD In this section we will illustrate the impact of the number of sub-centers, namely the parameter M in Eq. 10. Five vocabularies with different size, K, are tested, 16, 32, 64, 128 and 256. The results are shown in Fig. 4. Note that when M = 1 our HVLAD descriptor degrades to the VLAD descriptor, which means that the feature space of each visual word is not further divided into sub-spaces. It can be seen that the best accuracy can be achieved when M is greater than 1 for all the tested vocabularies. This is because the local features
5
can be better centered in the sub-spaces resulting in more uniforming residual vectors, as analyzed in Fig. 2 and Table I of Section III. For example, when K = 256, on Holidays, the VLAD descriptor obtains 0.6035 mAP (M = 1) while we can obtain 0.671 (improved 11.9%) mAP (M = 512). With K = 256, M = 512, the feature space is divided into 256×512 sub-cells which is 512 times finer division than the original VLAD strategy with a vocabulary of 256 words. More detailed comparisons will be given after the multiple assignment is integrated to alleviate the quantization error resulted from the hard quantization manner. B. The performance of Multiple Assignment In this section we will illustrate the performance of HVLAD descriptor after the multiple assignment strategy is integrated in Eq. 11. The impact of multiple assignment is controlled by parameter γ. In Table II and Table III, we illustrate the accuracy performance with respect to the vocabulary size K, the number of sub-words M and the multiple assignment parameter γ. Note that when M = 1, and γ = 1, our HVLAD is the same as the VLAD descriptor, namely it is the baseline accuracy. With a larger M , the feature space is divided finer but similar features are easier to be quantized to different subwords (quantization error). The quantization error will affect the performance of finer feature space division. For example in Table III when K = 256,andγ = 1 the top-4 accuracy decreases from 3.5117 to 3.494 by increasing M from 8 to 256. However, as γ increases, the quantization error can be alleviated. For example, in Table III the top-4 accuracy increases from 3.494 to 3.5415 by increasing γ from 1 to 8. Then as γ increasing, the top-4 accuracy decreases because of suppressing the discriminative power of the finer feature space division. From Table II, it can be seen that the baseline accuracy is improved from 0.5265 mAP, 0.5747 mAP, and 0.6035 mAP to 0.5943 (improved 12.9%) mAP, 0.6405 (improved 11.4%) mAP, and 0.6754 (improved 11.9%) mAP, when the vocabulary size is 16, 64, and 256, respectively, on Holidays. Similarly, from Table III, it can be seen that the baseline accuracy is improved from 3.2237, 3.3943, and 3.5008 to 3.3444 (improved 3.7%), 3.4590 (improved 1.9%), and 3.5415 (improved 1.2%) for the top-4 score of UKBench, when the vocabulary size is 16, 64, and 256, respectively. Obviously our HVLAD descriptor demonstrates consistent gains over the baseline. In the following we set M = 256, γ = 8. From Table II and Table III, it can be seen that for both Holidays and UKBench the good accuracy results can be obtained with this setting. C. The performance of Whitened SIFT In this section we will illustrate the impact of whitening operation in Eq. 12. In Table IV, we demonstrate the performance of our HVLAD after the whitening operation is integrated. We show the performance when the SIFT descriptor is dimensionally reduced to different dimensions. It can be seen that higher accuracy can be obtained with higher dimensional
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2409693, IEEE Transactions on Circuits and Systems for Video Technology 6
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. **, NO. **, 2014
TABLE II T HE M AP PERFORMANCE OF THE OUR HVLAD DESCRIPTOR WITH MULTIPLE ASSIGNMENT STRATEGY, E Q . 11 AND E Q . 10, ON H OLIDAYS DATASET. K γ↓ M
→ →
1 4 8 16 30 50 80 120 160 200
16 64 256 1 8 16 64 256 1 8 16 64 256 1 8 16 64 256 0.5265 0.5648 0.5740 0.5706 0.5713 0.5747 0.6110 0.6069 0.6034 0.6081 0.6035 0.6362 0.6401 0.6482 0.6616 0.5552 0.5693 0.5840 0.5941 0.6015 0.6086 0.6248 0.6276 0.6228 0.6449 0.6552 0.6737 0.5226 0.5582 0.5843 0.5943 0.5678 0.6018 0.6267 0.6388 0.5991 0.6292 0.6521 0.6754 0.5271 0.5755 0.5864 0.5684 0.6189 0.6405 0.6018 0.6461 0.6717 0.5608 0.5875 0.6090 0.6345 0.6230 0.6704 0.5426 0.5819 0.5996 0.6299 0.6482 0.6726 0.5732 0.6213 0.6663 0.5646 0.6140 0.6556 0.5527 0.6106 0.6466 0.5421 0.6062 0.6346
TABLE III T HE TOP -4 PERFORMANCE OF THE OUR HVLAD DESCRIPTOR WITH MULTIPLE ASSIGNMENT STRATEGY, E Q . 11 AND E Q . 10, ON UKB ENCH DATASET. K γ↓ M
→ →
1 4 8 16 30 50 80 120 160 200
16 64 256 1 8 16 64 256 1 8 16 64 256 1 8 16 64 256 3.2237 3.3066 3.3117 3.3025 3.2822 3.3943 3.4362 3.4284 3.4188 3.4114 3.5008 3.5117 3.5106 3.5040 3.4940 3.2949 3.3224 3.3383 3.3353 3.4315 3.4448 3.4538 3.4535 3.5172 3.5248 3.5347 3.5401 3.2326 3.2982 3.3339 3.3444 3.3974 3.4310 3.4544 3.4583 3.4993 3.5179 3.5351 3.5415 3.2342 3.3194 3.3439 3.3956 3.4446 3.4590 3.4993 3.5303 3.5383 3.2956 3.3341 3.4283 3.4524 3.5210 3.5344 3.2619 3.3216 3.4123 3.4431 3.5104 3.5296 3.3067 3.4328 3.5233 3.2865 3.4227 3.5198 3.2719 3.4159 3.5157 3.2547 3.4099 3.5107
TABLE IV T HE PERFORMANCE OF LOCAL DESCRIPTOR WHITENING IN E Q . 12 ON H OLIDAYS AND UKB ENCH . d→ Holidays (mAP) UKBench (top-4)
32 0.6606 3.4926
64 0.6984 3.5458
96 0.7153 3.5564
128 0.7207 3.5577
TABLE V T HE FINAL COMPARISON OF OUR HVLAD AND VLAD ON H OLIDAYS AND UKB ENCH AND OXFORD B UILDING .
Holidays (mAP) UKBench (top-4) Oxford Building (mAP)
K→ VLAD HVLAD VLAD HVLAD VLAD HVLAD
16 0.5265 0.6235 3.2237 3.3929 0.4123 0.4862
64 0.5747 0.6657 3.3943 3.4828 0.5132 0.5762
256 0.6035 0.7207 3.5008 3.5577 0.5741 0.6378
SIFT descriptor. And by balancing the dimensional variance of SIFT descriptor in whitening operation, the retrieval accuracy is further improved. As illustrated in Table II, Table III, Table IV, when d = 128, the accuracy is improved from 0.6754 to 0.7207 on Holidays and from 3.5415 to 3.5577 on UKBench. The final comparison of our HVLAD descriptor with the baseline (VLAD descriptor) is given in Table V. It can be seen that the VLAD descriptor has been improved by about 18% for the mAP accuracy on Holidays, and about 3% for the top-4 score on UKBench, and about 12% for the mAP accuracy on Oxford Building. Two query examples are shown in Fig. 5. It can be seen that those images with cluttered background are easily retrieved as false positive. The SIFT features extracted from those cluttered background are less distinctive, which
TABLE VI T HE COMPARISONS BETWEEN THE PROPOSED HVLAD DESCRIPTOR AND SEVERAL NEWEST VLAD VERSIONS . Method
K
D
BoVW [28] 20000 20000 BoVW [28] 200000 200000 Fisher [28] 16 1024 VLADimproved [28] 16 1024 VLAD [27] 16 2048 Fisher [28] 64 4096 VLADimproved [28] 64 4096 VLAD [27] 64 8192 VLADIntra [29] [30] 64 8192 VLADLCS+RN [30] 64 8192 Fisher [28] 256 16384 VLADimproved [28] 256 16384 VLADIntra+Adapt [29] 256 32768 16 2048 HVLAD 64 8192 256 32768
Holidays UKBench Oxford Building (mAP) (top-4) (mAP) 0.437 2.87 0.354 0.54 2.81 N/A 0.54 N/A N/A 0.52 N/A N/A 0.496 3.07 N/A 0.595 3.35 0.418 0.556 3.28 0.378 0.526 3.17 N/A 0.565 N/A 0.448 0.658 N/A 0.517 0.625 N/A N/A 0.587 N/A N/A 0.646 N/A 0.555 0.624 3.39 0.486 0.666 3.48 0.576 0.721 3.56 0.638
cast negative impact to the final aggregated image representations. To alleviate this problem, the discriminability of the aggregation model needs to be improved. With the improved discriminative power of the residual vectors in the proposed HVLD algorithm, the impact of these images with cluttered background is effectively suppressed.
D. The performance of Compact Image Representation In this section, we demonstrate the performance after our HVLAD descriptor is dimensionally reduced to lower dimensional space with Eq. 13.
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2409693, IEEE Transactions on Circuits and Systems for Video Technology LIU et al.: UNIFORMING RESIDUAL VECTOR DISTRIBUTION FOR DISTINCTIVE IMAGE REPRESENTATION
7
BoVW
VLAD
Fisher
HVLAD
(a)
BoVW
VLAD
Fisher
HVLAD
(b) Fig. 5.
Two query examples for different methods. Only a few top search results are shown here.
TABLE VII T HE COMPARISONS WHEN THE DESCRIPTORS MENTIONED ABOVE ARE PROJECTED ONTO THE 128-D SUBSPACE . O UR 128-D DESCRIPTOR IS GENERATED WITH K=256. Method
D
BoVW [28] VLAD [27] VLADimproved [28] Fisher [28] Multivoc-VLAD [36] VLADIntra+Adapt+Multivoc [29] HVLAD
128 128 128 128 128 128 128
Holidays UKBench (mAP) (top-4) 0.452 2.95 0.51 3.15 0.557 3.35 0.565 3.33 0.614 3.36 0.625 N/A 0.64 3.4
To get a compact image representation, as shown in Eq. 13, the PCA with the power normalization is adopted. We train our PCA matrix, RVD , with the pre-computed local features of
the first 100K images in Flickr1M dataset [28]. The accuracy performance with different α and D of Eq. 13 under different vocabulary size, K, is demonstrated in Fig. 6. It can be seen that the accuracy is better when more dimensions are used to represent the image for all the tested vocabulary sizes. The power normalization is a kind of nonlinear transformation. In Eq. 13, when α = 0, the elements of V D will be discreted to {−1, 0, 1}. When α = 1, V D will keep unchanged. When 0 < α < 1, the values of V D will be stablized. Since V D has been deccorrelated by PCA, the repeatedly occured patterns and cooccurences can be effectively suppressed. For example when K = 256, D = 512, the mAP accuracy increases from 0.635 to 0.68 by decreasing α from 1.0 to 0.6 on Holidays. Considering the performance on both Holidays and UKBench, it is a good choice to set α as 0.6.
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2409693, IEEE Transactions on Circuits and Systems for Video Technology 8
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. **, NO. **, 2014
0.66
0.62
0.68 0.67
D=512 D=256 D=128 D=64
0.65 0.63
0.6
mAP
0.56
0.64
0.6
0.57
0.58
0.55
0.56
0.52
0.53
0.54
0.5
0.51
0.52
0.48
0.49
3.45 3.42 3.37 3.32 3.27 3.22 3.17 3.12 3.07 3.02 2.97 2.92 2.87 2.82 2.77 2.72 0.2
0.3
0.4
0.5
0.6 α
0.7
0.8
0.9
0.47 0.2
1
3.5 3.45 3.4 3.35
D=512 D=256 D=128 D=64
0.5 0.3
0.4
0.5
0.6 α
0.7
0.8
0.9
1
0.48 0.2
1
3.54 3.5 3.45 3.4 3.35 3.3 3.25 3.2 3.15 3.1 3.05 3 2.95 2.9 2.85 2.8 0.2
D=512 D=256 D=128 D=64
3.3 3.25 3.2
top−4
0.46 0.2
3.15 3.1 3.05 3 2.95 2.9 2.85
0.3
0.4
0.5
0.6 α
0.7
0.8
0.9
1
(a) K = 16
2.8 0.2
D=512 D=256 D=128 D=64
0.62
0.59
0.54
top−4
mAP
0.66
0.61
0.58
top−4
D=512 D=256 D=128 D=64
mAP
0.64
0.3
0.4
0.5
0.6 α
0.7
(b) K = 64
0.8
0.9
0.3
0.4
0.5
0.6 α
0.7
0.8
0.9
1
0.5
0.6 α
0.7
0.8
0.9
1
D=512 D=256 D=128 D=64
0.3
0.4
(c) K = 256
Fig. 6. The impact of α with different dimensions D final compact image representation. The top row figures are the results on Holidays and the bottom row figures are the results on UKBench.
E. Comparisons In this section, we present the comparison results between the proposed HVLAD descriptor and several latest variants of the VLAD descriptor. As shown in Table VI, it can be seen that our HVLAD descriptor obtain the best retrieval accuracy compared to other comparison algorithms. In Table VII, we compare the results when the descriptors are dimensionally reduced to 128-D in which the PCA parameters are learned with the pre-computed SIFT descriptors of Flickr1M dataset. To demonstrate the scalability of our approach for image search, we mix the Holidays dataset with the Flickr1M database and follow the settings in [29], [36]. The HVLAD descriptor is projected onto the 128-D subspace and exhaustive nearest neighbor search is performed to find those relevant images. For Holidays with Flickr1M, HVLAD obtains 0.43 mAP and some variant versions of VLAD obtain 0.335 mAP (VLAD∗ ) [30], 0.370 mAP (Multivoc-VLAD) [29], 0.378 mAP (Intra+Adapt+Multivoc-VLAD) [29], 0.392 mAP (LCS+RN-VLAD) [30]. Video Search. To evaluate the performance of our approach for video search, we download the CC Web Video [39] dataset which contains 13129 videos and 24 queries. For each video, we extract the SIFT descriptors with [33] on its frames and construct the image descriptors to these frames. The video descriptor is obtained by averaging its frame descriptors. We obtain 0.883 mAP with the HVLAD algorithm and 0.855 mAP with the VLAD algorithm. V. C ONCLUSIONS By aggregating residual vectors, the VLAD descriptor achieves very efficient representation of images. However the distribution of the residual vectors will impact the discriminability of constructed VLAD descriptor. In this paper, by generating sub-words to visual word, we obtain a better
feature centering resulting in a more uniform distribution of the residual vectors. With a more uniform residual distribution, we can obtain a more discriminative image representation. In the future, we will explore building the middle level image features with the HVLAD algorithm, which are proved to be very promising by many works [40], [41], for large scale vision applications such as image classification [42], and image annotations, and video representation [43], [44]. VI. ACKNOWLEDGMENT This work was supported in part to Prof. Houqiang Li by 973 Program under contract No. 2015CB351803, NSFC under contract No. 61325009 and No. 61390514, in part to Dr. Wengang Zhou by NSFC under contract No. 61472378, the Fundamental Research Funds for the Central Universities under contract No. WK2100060014 and WK2100060011, and in part to Prof. Ting Rui by NSFC under Grant 61472444 and 61472392., and in part to Prof. Qi Tian by ARO grant W911NF-12-1-0057 and Faculty Research Awards by NEC Laboratories of America, respectively. This work was supported in part by NSFC under contract No. 61429201. R EFERENCES [1] J. Sivic and A. Zisserman, “Video google: A text retrieval approach to object matching in videos,” Proceedings of the IEEE International Conference on Computer Vision, pp. 1470–1477, 2003. [2] Z. Lu, X. Yang, W. Lin, H. Zha, and X. Chen, “Inferring user imagesearch goals under the implicit guidance of users,” Proceedings of IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 3, pp. 394–406, 2014. [3] M. Kan, D. Xu, S. Shan, and X. Chen, “Semi-supervised hashing via kernel hyperplane learning for scalable image search,” Proceedings of IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 4, pp. 704–713, 2014. [4] B. C. Song, M. J. Kim, and J. B. Ra, “A fast multiresolution feature matching algorithm for exhaustive search in large image databases,” Proceedings of IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, no. 5, pp. 673–678, 2001.
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2409693, IEEE Transactions on Circuits and Systems for Video Technology LIU et al.: UNIFORMING RESIDUAL VECTOR DISTRIBUTION FOR DISTINCTIVE IMAGE REPRESENTATION
[5] K.-H. Yap and K. Wu, “A soft relevance framework in content-based image retrieval systems,” Proceedings of IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 12, pp. 1557–1568, 2005. [6] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2161–2168, 2006. [7] R. Arandjelovic and A. Zisserman, “Three things everyone should know to improve object retrieval,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2911–2918, 2012. [8] G. Tolias, Y. Avrithis, H. J´egou et al., “To aggregate or not to aggregate: Selective match kernels for image search,” in Proceedings of the International Conference on Computer Vision, 2013. [9] W.-L. Zhao, H. J´egou, and G. Gravier, “Sim-min-hash: An efficient matching technique for linking large image collections,” in Proceedings of the 21st ACM international conference on Multimedia, 2013, pp. 577– 580. [10] L. Xie, Q. Tian, W. Zhou, and B. Zhang, “Fast and accurate nearduplicate image search with affinity propagation on the imageweb,” Computer Vision and Image Understanding, vol. 124, pp. 31–41, 2014. [11] L. Zheng, S. Wang, and Q. Tian, “Coupled binary embedding for largescale image retrieval,” Proceedings of IEEE Transactions on Image Processing, vol. 23, no. 8, pp. 3368–3380, 2014. [12] S. Zhang, M. Yang, X. Wang, Y. Lin, and Q. Tian, “Semantic-aware coindexing for image retrieval,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1673–1680. [13] W. Zhou, M. Yang, H. Li, X. Wang, Y. Lin, and Q. Tian, “Towards codebook-free: Scalable cascaded hashing for mobile image search,” Proceedings of IEEE Transaction on Multimedia. [14] Z. Liu, H. Li, W. Zhou, and Q. Tian, “Embedding spatial context information into inverted file for large-scale image retrieval,” Proceedings of the ACM international conference on Multimedia, pp. 199–208, 2012. [15] Z. Liu, H. Li, L. Zhang, W. Zhou, and Q. Tian, “Cross-indexing of binary SIFT codes for large-scale image search,” IEEE Transaction in Image Processing, vol. 23, no. 5, 2014. [16] Z. Liu, H. Li, W. Zhou, R. Zhao, and Q. Tian, “Contextual hashing for large-scale image search,” IEEE Transactions on Image Processing, vol. 23, no. 4, pp. 1606–1614, 2014. [17] X. Yang, X. Gao, D. Tao, and X. Li, “Improving level set method for fast auroral oval segmentation,” IEEE Transactions on Image Processing, vol. 23, no. 7, pp. 2854–2865, 2014. [18] W. Zhou, Y. Lu, H. Li, Y. Song, and Q. Tian, “Spatial coding for large scale partial-duplicate web image search,” Proceedings of the ACM international conference on Multimedia, pp. 511–520, 2010. [19] W. Zhou, H. Li, R. Hong, Y. Lu, and Q. Tian, “Bsift: towards dataindependent codebook for large scale image search,” 2015. [20] W. Zhou, Q. Tian, Y. Lu, L. Yang, and H. Li, “Latent visual context learning for web image applications,” Pattern Recognition, vol. 44, no. 10, pp. 2263–2273, 2011. [21] X. Yang, X. Gao, D. Tao, X. Li, and J. Li, “An efficient mrf embedded level set method for image segmentation,” IEEE transactions on image processing, vol. 24, no. 1, pp. 9–21, 2015. [22] S. Zhang, Q. Tian, K. Lu, Q. Huang, and W. Gao, “Edge-sift: Discriminative binary descriptor for scalable partial-duplicate mobile search,” IEEE Transactions on Image Processing, vol. 22, no. 7, pp. 2889–2902, 2013. [23] L. Zheng and S. Wang, “Visual phraselet: Refining spatial constraints for large scale image search,” IEEE Signal Processing Letters, vol. 20, no. 4, pp. 391–394, 2013. [24] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” Proceedings of the European Conference on Computer Vision, pp. 404–417, 2006. [25] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. [26] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, 2007. [27] H. J´egou, M. Douze, C. Schmid, and P. P´erez, “Aggregating local descriptors into a compact image representation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 3304–3311. [28] H. J´egou, F. Perronnin, M. Douze, C. Schmid et al., “Aggregating local image descriptors into compact codes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 9, pp. 1704–1716, 2012. [29] R. Arandjelovic and A. Zisserman, “All about vlad,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
9
[30] J. Delhumeau, P.-H. Gosselin, H. J´egou, P. P´erez et al., “Revisiting the vlad image representation,” in Proceedings of the ACM International Conference on Multimedia, 2013. [31] K. Mikolajczyk and C. Schmid, “An affine invariant interest point detector,” Proceedings of the European Conference on Computer Vision, pp. 128–142, 2002. [32] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost in quantization: Improving particular object retrieval in large scale image databases,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8. [33] H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and weak geometric consistency for large scale image search,” Proceedings of the European Conference on Computer Vision, pp. 304–317, 2008. [34] H. J´egou, M. Douze, and C. Schmid, “On the burstiness of visual elements,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1169–1176, 2009. [35] F. Perronnin, Y. Liu, J. S´anchez, and H. Poirier, “Large-scale image retrieval with compressed fisher vectors,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 3384–3391. [36] H. J´egou and O. Chum, “Negative evidences and co-occurences in image retrieval: The benefit of pca and whitening,” in Proceedings of the European Conference on Computer Vision, 2012, pp. 774–787. [37] H. J´egou and A. Zisserman, “Triangulation embedding and democratic aggregation for image search,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Apr. 2014. [38] M. Perd’och, O. Chum, and J. Matas, “Efficient representation of local geometry for large scale object retrieval,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 9– 16. [39] X. Wu, C.-W. Ngo, A. G. Hauptmann, and H.-K. Tan, “Real-time nearduplicate elimination for web video search with content and context,” Proceedings of IEEE Transactions on Multimedia, vol. 11, no. 2, pp. 196–207, 2009. [40] M. Juneja, A. Vedaldi, C. Jawahar, and A. Zisserman, “Blocks that shout: Distinctive parts for scene classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 923– 930. [41] A. Jain, A. Gupta, M. Rodriguez, and L. S. Davis, “Representing videos using mid-level discriminative patches,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2571–2578. [42] L. Xie, Q. Tian, M. Wang, and Z. Bo, “Spatial pooling of heterogeneous features for image classification,” Proceedings of IEEE Transactions on Image Processing, vol. 23, no. 5, pp. 1994–2008, 2013. [43] M. Wang, R. Hong, X.-T. Yuan, S. Yan, and T.-S. Chua, “Movie2comics: Towards a lively video content presentation,” Proceedings of IEEE Transactions on Multimedia, vol. 14, no. 3, pp. 858–870, 2012. [44] R. Hong, M. Wang, Y. Gao, D. Tao, X. Li, and X. Wu, “Image annotation by multiple-instance learning with discriminative feature mapping and selection,” Proceedings of IEEE Transactions on Cybernetics, vol. 44, no. 5, pp. 669–680, 2014.
Zhen Liu received his B.E. degree in Electronic Information Engineering from Dept. of Electronic Engineering and Information Science (EEIS) at University of Science and Technology of China (USTC), Hefei, P.R.China, in 2010. He is currently working toward a Ph.D. degree in Signal and Information Processing at the same university. His research interests include image/video processing, multimedia information retrieval and computer vision.
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2409693, IEEE Transactions on Circuits and Systems for Video Technology 10
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. **, NO. **, 2014
Houqiang Li (S12) received the B.S., M.Eng., and Ph.D. degree from University of Science and Technology of China (USTC) in 1992, 1997, and 2000, respectively, all in electronic engineering. He is currently a professor at the Department of Electronic Engineering and Information Science (EEIS), USTC. His research interests include multimedia search, image/video analysis, video coding and communication, etc. He has authored or co-authored over 100 papers in journals and conferences. He served as Associate Editor of IEEE Transactions on Circuits and Systems for Video Technology from 2010 to 2013, and has been in the Editorial Board of Journal of Multimedia since 2009. He has served on technical/program committees, organizing committees, and as program cochair, track/session chair for over 10 international conferences. He was the recipient of the Best Paper Award for Visual Communications and Image Processing (VCIP) in 2012, the recipient of the Best Paper Award for International Conference on Internet Multimedia Computing and Service (ICIMCS) in 2012, the recipient of the Best Paper Award for the International Conference on Mobile and Ubiquitous Multimedia from ACM (ACM MUM) in 2011, and a senior author of the Best Student Paper of the 5th International Mobile Multimedia Communications Conference (MobiMedia) in 2009.
Qi Tian (M96-SM03) received the B.E. degree in electronic engineering from Tsinghua University, China, in 1992, the M.S. degree in electrical and computer engineering from Drexel University in 1996 and the Ph.D. degree in electrical and computer engineering from the University of Illinois, UrbanaChampaign in 2002. He is currently a Professor in the Department of Computer Science at the University of Texas at San Antonio (UTSA). He took a one-year faculty leave at Microsoft Research Asia (MSRA) during 2008-2009. Dr. Tian’s research interests include multimedia information retrieval and computer vision. He has published over 230 refereed journal and conference papers. His research projects were funded by NSF, ARO, DHS, SALSI, CIAS, and UTSA and he also received faculty research awards from Google, NEC Laboratories of America, FXPAL, Akiira Media Systems, and HP Labs. He received the Best Paper Awards in PCM 2013, MMM 2013 and ICIMCS 2012, the Top 10% Paper Award in MMSP 2011, the Best Student Paper in ICASSP 2006, and the Best Paper Candidate in PCM 2007. He received 2010 ACM Service Award. He is the Guest Editors of IEEE Transactions on Multimedia, Journal of Computer Vision and Image Understanding, Pattern Recognition Letter, EURASIP Journal on Advances in Signal Processing, Journal of Visual Communication and Image Representation, and is in the Editorial Board of IEEE Transactions on Multimedia (TMM), and IEEE Transactions on Circuit and Systems for Video Technology (TCSVT), Multimedia Systems Journal, Journal of Multimedia(JMM) and Journal of Machine Visions and Applications (MVA).
Ting Rui Ting Rui received the M.S. degree and Ph.D from PLAUniversity of Science and Technology, Nan-jing, China ,in1998 and 2001respectively. Ting RUI is Professor of Information Technology Department in the PLA University of Science and Technology. He mainly applies computer vision, machine learning, multimedia, and video surveillance. He has authored and co-authored more than 80 scientificarticles.
Wengang Zhou received the B.E. degree in electronic information engineering from Wuhan University, China, in 2006, and the Ph.D. degree in electronic engineering and information science from University of Science and Technology of China, China, in 2011 . He was a research intern in Internet Media Group in Microsoft Research Asia from December 2008 to August 2009. From September 2011 to 2013, he works as a post-doc researcher in Computer Science Department in University of Texas at San Antonio. He is currently an associate professor at the Department of Electronic Engineering and Information Science, University of Science and Technology of China. His research interest is mainly focused on multimedia information retrieval. He received the best paper award in ACM ICIMCS 2012.
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.