Multimedia Systems (2016) 22:127–136 DOI 10.1007/s00530-014-0406-9
SPECIAL ISSUE PAPER
BHoG: binary descriptor for sketch-based image retrieval Haiyan Fu • Hanguang Zhao • Xiangwei Kong Xianbo Zhang
•
Published online: 9 August 2014 Ó Springer-Verlag Berlin Heidelberg 2014
Abstract Due to the popularity of devices with touch screens, it is convenient to match images with a handdrawn sketch query. However, existing methods usually care little about memory space and time efficiency thus is inadequate for the rapid growth of multimedia resources. In this paper, a BHoG descriptor is proposed for sketch-based image retrieval. Firstly, the boundary image is detected from natural image using Berkeley boundary detector, and then divided into many blocks. Secondly, we calculate the gradient feature of each block, and find the principal gradient orientation. Finally, the principal gradient orientation is encoded to binary codes, which is proved to be efficient and discriminative. We evaluated the performance of BHoG on a large-scale social media dataset. The experimental results have shown that BHoG not only has a better performance on flexibility and efficiency, but also occupies small memory. Keywords Sketch-based image retrieval Sketch Image HoG BHoG Principal gradient orientation
H. Fu H. Zhao X. Kong (&) X. Zhang School of Information and Communication Engineering, Dalian University of Technology, Dalian 116023, China e-mail:
[email protected] H. Fu e-mail:
[email protected] H. Zhao e-mail:
[email protected] X. Zhang e-mail:
[email protected]
1 Introduction Due to the explosive growth of web images, Image retrieval based on visual features has been a major research field due to the many applications such as the web and mobile image search [1, 6]. Nowadays, with the popularity of devices with touch screens, searching for images to match with a hand-drawn sketch query has become a highly desired application. In sketch-based image retrieval (SBIR) framework, the sketch image is used as the query and images with similar sketch are returned to users. Due to its simplicity and conveniency, the hand-drawn sketch could be used in many fields, such as enhancing traditional keyword-based image search, and enlightening designers’ drawing. Extensive study of sketch-based image search started from the 1990s. Many researches have already been performed on this field. These methods are divided into two groups: methods based on statistical features and template matching methods. Statistical feature methods may suffer in bad results when dealing with the images with complex background. Template matching methods is challenging when there are lots of objects to match, and they are very time-consuming. In [11], the authors propose a system called Sketch2Photo which could compose a realistic picture from hand-drawn sketch annotated with text labels. Yang Cao et al. [12] proposed an index structure which uses the local feature of the hand-drawn sketch and makes it possible and accurate to search on millions of images. A system called MindFinder has been developed based on the method and it allows tiny translation, scaling, and deformation. Mathias Eitz [13] introduces a benchmark for evaluating the performance of large-scale SBIR systems and proposes a sketch-based feature named SHoG (Sketched feature lines in the Histogram of Oriented Gradients).
123
128
The benchmark could objectively evaluate the accuracy of hand-drawn sketch systems. In [14], the authors propose a way to search for face images according to the facial attributes and the face similarity. The existing methods mainly focus on image match based on global edge and local edge. In practice, with the rapid development of Web2.0 and social network, such as Flickr, Twitter, and Renren etc., up to millions or billions images can be collected on the Internet, the necessity of an efficient compact descriptor makes sketch-based image search more challenging [2, 4, 18]. In this paper, we propose a binary descriptor for SBIR. The global feature is extracted to describe the contour images, and then it is encoded to binary code for image matching. Hamming distance is used to measure the similarity between binary features, which is much faster than Euclidean distance. The paper is organized as follows. In Sect. 2, we show some related methods and systems of SBIR. In Sect. 3, we present the implementation of our method and the specific details. Sect. 4 is the experiments and we conclude in Sect. 5.
2 Sketch extraction Sketch extraction is the basis of SBIR methods. For any image, we obtain its sketch image using Berkeley boundary detection method [5]. The main idea of Berkeley boundary detection algorithm is to transform color image into intensity space and Lab color space, then extract intensity, color, and texture information. After that, logistic regression is used to train the combination coefficients of the feature. After calculating the probability of a pixel on the boundary, we obtain the graph of boundary probability. From the probability graph, we could obtain the boundary by binarization or any other quantization methods. We use Berkeley boundary detector to extract sketch image since it has has the following advantages: (1) the boundary detector returns a probability map, where the
H. Fu et al.
value indicates the probability of each pixel as a boundary point. This probability map is consistency with human subjective. (2) By modeling texture and combining various local cues in a statistically optimal manner, the boundary detector describes the boundary accurately. In Fig. 1, for a given image, we show the boundary results detected using Canny detector and Berkeley detector. As shown in the figure, compared with Canny detector, the sketch image generated using Berkeley detector is more smoothly.
3 Binary code of principal gradient orientation In this section, we will discuss the implementation of our SBIR. The method is shown in Fig. 2. First of all, the sketch image of any image is generated using Berkeley boundary detector, and is divided into small patches. For each image patch, we extract the HoG feature and compute the dominate orientation of gradient. Cyclic encoding is then used to encode the feature. In Sect. 3.1, we introduce the extraction of HoG feature and the selection of the principal component. In Sect. 3.2, we describe the binary encoding of the feature. 3.1 Feature extraction Generally speaking, most sketch images are binary images or gray-scale images, and there is no color information. Thus, we focus on the relative position and its gradient direction to describe a sketch image. As shown in Fig. 3, the overall sketch image is taken as an HoG detection window, we propose a binary descriptor to describe sketch images based on histogram of gradient (HoG) feature. In Fig. 3, the region shown in the figure is the interest window of the image. The region surrounded by red solid lines is the first block in which we will extract our partial BHoG feature. The red dot lines indicate the horizon increment of the block and the green ones indicate vertical increment. Wh and Ww are the height and width of
Fig. 1 The image and its sketch images. a Digital image, b sketch image generated using Canny detector, c sketch image generated using Berkeley detector
123
BHoG: binary descriptor for sketch-based image retrieval
129
Fig. 2 The framework of BHoG feature extraction. After detecting the boundary of the image, we use block shifting to extract the BHoG feature of each block. For each block, we select the dominate
orientation of gradient and encode them into binary codes. The final BHoG feature is composed by the binary codes of the block
aðx; yÞ ¼ tan1 2.
3. Fig. 3 HoG detection window
the detection window. Bh and Bw are the height and width of the HoG block. Sh and Sw indicate the vertical and horizon sliding incremental of the block, respectively. D is the dimension of the HoG feature, the gradient is quantified as Bin directions. The feature extraction includes following steps: 1.
Calculate the gradient let Iðx; yÞ be the intensity value on ðx; yÞ in the image. Let Gðx; yÞ be the amplitude of the oriented gradient, and aðx; yÞ be the orientation. It is not necessary to process gamma normalization on the amplitude because there is no illumination influence in the sketch image. The computation of Gðx; yÞ and aðx; yÞ are in Eq. 1. Gðx; yÞ ¼ ððIðx þ 1; yÞ Iðx 1; yÞÞ2 1
þ ðIðx; y þ 1Þ Iðx; y 1ÞÞ2 Þ2
ð1Þ
4.
Iðx; y þ 1Þ Iðx; y 1Þ Iðx þ 1; yÞ Iðx 1; yÞ
ð2Þ
Build the histogram a cell unit is a rectangular region of the image. The sketch image is divided into fixed number of cell units; thus, the size of the cell unit changes as the size of image changes. As a result, the images with different sizes will have the same dimension of feature. In our method, the sketch image is divided into 36 cell units, then Cw ¼ Ww =6 and Ch ¼ Wh =6. The range of the gradient orientation is ½0; 180 degree, we quantify the orientation as 8 bins, so the interval is 22.5°. Normalize the feature Since the feature has a wide range, the histogram of gradient orientation should be normalized. We combine 2 2 cell units into a block, thus Bw ¼ Cw 2 and Bh ¼ Ch 2. In the block, the cells are series connected and the normalization is applied. It is worthwhile to note that the blocks have overlapped areas, and the length of shift step is the length of a cell unit. Generate the constraint parameter combing the features of blocks, we obtain the HoG feature of the sketch image. To make the dimension of the feature to an integer, the parameters should subject to the following constraints: ( Bw mod Cw ¼ 0 ð3Þ Bh mod Ch ¼ 0 ( ðWw Bw Þ mod Sw ¼ 0 ð4Þ ðWh Bh Þ mod Sh ¼ 0
123
130
H. Fu et al.
Thus, the dimension of the HoG feature is: SwSh W w bw þ1 D ¼ Bin Cw Ch Sw W h bh þ1 Sh
ð5Þ
In 5, we have Bin ¼ 8; Cw ¼ Ww =6; Ch ¼ Wh =6; Bw ¼ Cw 2; Bh ¼ Ch 2, so the dimension of our HoG feature is 800. Assuming that the float type occupies 4 bytes of memory, and then the memory of a sketch image is 3.125 KB. The HoG feature of sketch image proposed in this section not only keeps the invariance of translation and scaling, but also has a lower dimension, thus it is faster and more suitable for sketch images. 3.2 Binary code Among all gradient orientations, we find that there always exists an orientation whose amplitude is obviously higher than others when the patch is not empty, as shown in Fig. 4. Based on this fact, there is no need to save all the gradient orientation. We select the orientation which has the largest amplitude of gradient as the HoG principal component of the cell unit. Thus, each cell unit is represented as a feature with one dimension, which is the index of the dominant gradient orientation. The dimension of the feature will then be reduced to 1=Bin. The principal component XðiÞ of cell i is defined as: 8 < h if maxðFh Þ 6¼ 0 h2H XðiÞ ¼ ð6Þ : 0 otherwise where H is the collection of the 8 orientations, Fh indicates the amplitude in orientation h. There is no principal component if the cell is empty while XðiÞ equals to 0.
Let X1 and X2 be denoted as the principal component of gradient orientation from two sketch images. The angle of the ith dimension is: jX1 ðiÞ X2 ðiÞj; jX1 ðiÞ X2 ðiÞj Bin=2 Disti ¼ Bin jX1 ðiÞ X2 ðiÞj; jX1 ðiÞ X2 ðiÞj\Bin=2 ð7Þ And the distance between the ith dimension is: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uD=Bin uX Dist ¼ t Dist2i
ð8Þ
i¼0
Generally, most spaces of the sketch image is empty, which causes the principal component equals to 0. The differences between empty cell and cells with other orientation should be the same intuitively. Therefore in Eq. 8, it is not reasonable to use Euclidean distance to measure the similarity. In the meantime, the Euclidean distance is computationally expensive. To solve this problem, we encode the principal component to binary codes as shown in Table 1. The distance between any of the orientation belongs to 0, 2, 4, 6, 8, each two adjacent orientations have a Hamming distance of 2. The empty cell unit has an all-0 encoding and its Hamming distance to any orientation is 4. As a result, we solve the problem while the computational complexity of Hamming distance is much lower than that of Euclidean distance. To generalize, the length of binary code should equal to the number of orientations we use, and the number of continuous 1 should be half of the length of the binary code. Finally, the dimension of BHoG is D=Bin 8. For example, in our experiment, each image is represented as a set of 800-bit binary codes, which occupies 0.1KB memory and costs only 1=32 memory compared with the HoG of sketch image.
4 Experiments 4.1 Benchmark We use rank correlation as the benchmark in the experiments [13]. This benchmark compares the rank of image Table 1 Cyclic binary code
Fig. 4 Orientation gradients of the edge
123
Index
Encoding
Index
Encoding
1
11110000
6
10000111
2
01111000
7
11000011
3 4
00111100 00011110
8 0
11100001 00000000
5
00001111
BHoG: binary descriptor for sketch-based image retrieval
131
Fig. 5 Benchmark sketch-based images
with human subjective feelings. 31 sketches are drawn as the benchmark images, as shown in Fig. 5. Each sketch image corresponds to a set of 40 images with different matching degree. The matching degree is ranked from 1 to 7 by experimenters. The lowest degree means the highest similarity. Finally, take the mean of the 28 experimenters’ rank as the reference rank. Kendall’s rank coefficient is used to judge if the result of the SBIR system corresponds to the feeling of human subjective. Let xi ; yi be denoted as the rank of the ith element in x and y, respectively. For a pair of rank ðxi ; yi Þ and ðxj ; yj Þ, the rank is considered consistent if ðxi xj Þðyi yj Þ [ 0, and vise versa. Given n pairs of ranks in x and y, the number of pairs which are consistency is nc , the number of other pairs is nd , then the Kendall’s rank coefficient of x and y is defined as: s¼
nc nd nðn 1Þ=2
ð9Þ
From Eq. 9, the range of s is ½1; 1. s ¼ 0 means the two ranking are independent, while s ¼ 1 means the two ranking are completely consistent. Thus, the larger s is, the better the system performs. 4.2 Parameters setting In the experiment, a large social image dataset containing 100,000 images from Flickr is used. The image labels include buildings, landscapes, portraits, animals, plants, etc. Size of the block and number of the cell units will have an influence on the retrieval performance. The rank coefficient with different parameter is shown in Fig. 6. As the
Fig. 6 Rank coefficient with different parameters. The coefficient is influenced by the combination of the proportion of the block in the image and the number of cell units in the block
blocking turning smaller, the retrieval performance will first increase and then decrease. The method achieves the best performance when the size of block is 1=9 of the image. The number of cell units also has an influence on the retrieval performance. For large size of block, a smaller cell unit will have a better performance because the feature describes the sketch information more meticulously. For small size of block, a small cell unit will bring in noise. In summary, the performance is best when the block is 1=9 of the image, and each block contains 2 2 unit cells. In this case, the rank coefficient is 0.298. 4.3 Comparison of rank coefficient In this section, we compare BHoG with GIST [15], EHD [16], HoG [17] and SHoG [13]. For global features, we calculate the rank coefficients directly depend on the
123
132
Fig. 7 The rank coefficient of five methods
retrieval results. For local features, we combine the benchmark images with the image dataset to calculate the rank coefficient after training the local feature retrieval structure model. The result is shown in Fig. 7. The performance of BHoG is 16 % higher than GIST, 15 % higher than EHD, 8 % higher than HoG, 3 % higher than SHoG. In Fig. 7, GIST is a global feature, which performs badly on describing the sketch information. Although EHD describes the sketch information, it has a histogram of only Fig. 8 Examples of BHoG retrieval results
123
H. Fu et al.
five orientations. The rough division of the image and the limited ability to describe the sketch information result in the normal performance of retrieval. HoG performs well in describing the sketch and local information. However, it is proposed aimed at detecting human and has a mediocre performance on describing the sketches of other objects. SHoG does better in describing the local feature of the object, and it is invariant to rotation and scaling. But as other local features, SHoG feature pay no attention to the position of the sketch. The BHoG and SHoG have some improvement on HoG. BHoG is a global feature and has the ability to describe the position information of the sketch, as well as the local information. BHoG has the best retrieval performance, but it is inferior to SHoG at the aspect of the invariant to rotation and translation. The ranked result of BHoG algorithm is shown in Fig. 8. The left column are sketch images we drawn, and the right four columns are the returned images. 4.4 Comparison of precision In Ref. [12] the authors proposed a hand-drawn sketch retrieval structure EI-S, in which the precision is chosen as the evaluation item. In this section, we compare different SBIR methods based on precision evaluation item. The
BHoG: binary descriptor for sketch-based image retrieval
same as EI-S, images in the dataset are scaled to 200 200 before extracting feature. 4.4.1 Basic shape sketch retrieval Take seven basic shapes which are horizontal line, vertical line, diagonal, reverse diagonal, rectangle, triangle, and
133
circle as sketch query. Most complex sketches could be composed by these basic shapes. For each shape, the retrieval is done 4 times with four different users. The user should rank the returned images depends on whether the main contents are similar with sketch query. The precision of the first 20 images returned is shown in Fig. 9. In Fig. 9, the precision of BHoG is higher than EI-S for most basic shapes. EI-S regards the triangle as three lines and lack of the overall position information of sketch image, thus its precision decreases when the triangle has a high bias. HoG is robust to the bias of the basic shapes, and has a better performance than EHD and GIST. 4.4.2 Specific sketch retrieval
Fig. 9 Performance comparison for basic shapes search: horizontal line (H), vertical line (V), diagonal (L), reverse diagonal (R), rectangle (Rect), triangle (Tri), circle (Cir)
We conduct experiment to search specific images. There are 50 target images shown in Fig. 10. For each target image, the user draws its sketch image, as shown in Fig. 11. These sketch images are used as query images, and the similar images are searched in the dataset.
Fig. 10 50 target images
123
134
H. Fu et al.
Fig. 11 Hand-drawn sketches of target images
Fig. 12 Performance comparison for specific image search
For each query sketch image, we calculate the precision of top N returned image. The mean precision of 50 query sketch is shown in Fig. 12. In Fig. 12, the precision of BHoG is slightly lower than EI-S when returned only one image, but the precision of BHoG is higher when returned more image. In the top 50 returned images, the precision of BHoG is 11.4 % higher than EI-S, and it is
123
Fig. 13 Performance comparison for similar image search
52.8, 66, and 68 % higher than HoG, GIST and EHD respectively. 4.4.3 Similar sketch retrieval In practice, the users generally draw the sketch image based on the concept in their minds. Compared to specific images retrieval, it is difficult to measure the correctness of this
BHoG: binary descriptor for sketch-based image retrieval
135
Fig. 14 Search results of BHoG method Table 2 Compare five methods of memory consumption and runtime Feature
Memory (MB)
Run-time (s)
EHD
30.52
1.883
GIST
195.31
2.054
HoG
878.91
2.458
EI-S
316.95
0.154
9.54
0.127
BHoG
task. In this section, we conduct extend experiment to evaluate the precision of the entire top N returned images at the aspect of structural and semantic. The result is shown in Fig. 13. The precision of BHoG is 86 %, it is 12, 38, 48 and 60 % higher than EI-S, HoG, GIST, and EHD, respectively. Part of retrieval results of BHoG is shown in Fig. 14. 4.5 Comparison of efficiency To evaluate the efficiency of BHoG method, we compare the above methods in memory cost and computation speed. The Matlab codes are run on computer with Intel i53450(3.1 GHz) and 4G memory. The memory and run-time performance are shown in Table 2. The memory consumption of BHoG is 1=33 of EI-S, and BHoG is much faster than other methods. The result in Table 2 shows that BHoG is effective for large-scale dataset.
5 Conclusion In this paper, we proposed an SBIR method based on BHoG descriptor. The method takes advantage of the principle orientation of gradient and uses binary encoding for acceleration and better measurement. The experiments show that the method has a better performance than the state-of-theart methods in a variety of aspects and the retrieval results correspond to the human perception. It needs less memory and retrievals much faster, thus it is applicable to datasets with a huge quantity of data. And the method could be easily tuned for applications of different fields. For the future work, we intend to focus on represent the image invariant to translation and rotation. The retrieval method could also be improved by approximate nearest neighbor search or some other methods. Acknowledgments The work is supported by the Fundamental Research Funds for the Central Universities DUT14QY03.
References 1. Bao, B.K., Zhu, G., Shen, J., Yan, S.: Robust image analysis with sparse representation on quantized visual features. IEEE Trans. Image Process. 22(3), 860–871 2. Fu, H., Kong, X., Lu, J.: Large-scale image retrieval based on boosting iterative quantization hashing with query-adaptive reranking. Neurocomputing 122, 480–489 (2013)
123
136 3. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mac. Intell. 8(6), 679–698 (1986) 4. Fu, H., Kong, X., Guo, Y., Lu, J.: Weakly principal component hashing with multiple tables. MMM 293–304 (2013) 5. Martin, D.R., Fowlkes, C.C., Malik, J.: Learning to detect natural image boundaries using local brightness, color, and texture cues. PAMI. 26(5), 530–549 (2004) 6. Bao, B.K., Li, T., Yan, S.: Hidden-concept driven multilabel image annotation and label ranking. IEEE Trans. Multimedia 14(1), 199–210 (2012) 7. Calonder, M., Lepetit, V., Strecha, C. et al.: BRIEF: binary robust independent elementary features. Computer Vision ECCV 2010, pp. 778–792. Springer, Heidelberg (2010) 8. Leutenegger, S., Chli, M., Siegwart, R.Y.: BRISK: Binary robust invariant scalable keypoints. Computer Vision (ICCV). IEEE International Conference on IEEE, pp. 2548–2555 (2011) 9. Rublee, E., Rabaud, V., Konolige, K., et al.: ORB: an efficient alternative to SIFT or SURF. Computer Vision (ICCV). IEEE International Conference on IEEE, pp. 2564–2571 (2011) 10. Alahi, A., Ortiz, R., Vandergheynst, P.: Freak: fast retina keypoint. Computer Vision and Pattern Recognition (CVPR). 2012 IEEE Conference on IEEE, pp. 510–517 (2012)
123
H. Fu et al. 11. Chen, T., Cheng, M., Tan, P., Shamir, A., Hu, S.: Sketch2Photo: internet image montage[J]. ACM Trans. Gr. 28(5), 1–10 (2009) 12. Cao, Y., Wang, H., Wang, C.: MindFinder: interactive sketchbased image search on millions of images. Proceedings of the international conference on Multimedia pp. 1605–1608. New York, USA (2010) 13. Eitz, M., Hildebrand, K., Boubekeur, T.: Sketch-based image retrieval: benchmark and bag-of-features descriptors. IEEE Trans. Vis. Comput. Gr. 17(11), 1624–1636 (2011) 14. Lei, Y., Chen, Y., Chen, B., Su, H., et al.: Photo search by face positions and facial attributes on touch devices, pp. 651–654. ACM Multimedia, Scottsdale (2011) 15. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV 42(3), 145–17 (2001) 16. Won, C.S., Park, D.K., Park, S.J.: Efficient use of MPEG-7 edge histogram descriptor. ETRI J. 24(1), 35–42 (2002) 17. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. CVPR, San Diego (2005) 18. Zhou, J., Fu, H., Kong, X.: A balanced semi-supervised hashing method for CBIR. ICIP, pp. 2481–2484 (2011)