Visual Words Sequence Alignment for Image Classification Paweł Drozda
Krzysztof Sopyła
Department of Mathematics and Computer Sciences, University of Warmia and Mazury, Olsztyn, Poland Email:
[email protected]
Department of Mathematics and Computer Sciences, University of Warmia and Mazury, Olsztyn, Poland Email:
[email protected]
Przemysław G´orecki
Piotr Artiemjew
Department of Mathematics and Computer Sciences, University of Warmia and Mazury, Olsztyn, Poland Email:
[email protected]
Department of Mathematics and Computer Sciences, University of Warmia and Mazury, Olsztyn, Poland Email:
[email protected]
Abstract—In recent years, the field of image processing has been gaining a growing interest in many scientific domains. In this paper, the attention is focused on one of the fundamental image processing problems, that is image classification. In particular, the novel approach of bridging content based image retrieval and sequence alignment domains was introduced. For this purpose, the dense version of the SIFT key point descriptor, k-means for visual dictionary construction and the Needleman-Wunsch method for sequence alignment were implemented. The performed experiments, which evaluated the classification accuracy, showed the great potential of the proposed solution indicating new directions for development of new image classification algorithms.
I. I NTRODUCTION CBIR domain has recently been of major importance in many fields of everyday life. It has been widely studied and used in such areas as radiology [1], microscopy [18], astronomy [5], mineralogy [19] or e-commerce [15]. For example, in radiology the search of visually similar parts of medical images is performed in order to facilitate the assignment to a particular category of diseases. Among the main tasks of CBIR domain, this paper focuses on the classification. This method assigns each of the queried images to one of the predefined categories, which contains the most visually similar images. The quality of the results obtained during the classification process largely depends on the features that will be taken into account during comparisons. In the literature, many different ways were proposed to visually describe any image. In particular, color and texture can represent the image globally while for local description the visual keypoints are introduced. On the basis of these characteristics the feature descriptors are formed [21]. For purposes of this paper, the SIFT [16] algorithm was chosen as it is one of the best state of the art image descriptors. On the basis of local image descriptors the
different image representations can be considered, which are frequently the vector with a predefined number of features. One of the most commonly used for classification purposes is the Bag of Visual Words (BoVW) image representation, which derives from text categorization and encodes the image as the frequencies of visual local descriptors. In this paper, the main aim is to perform the classification for the e-Commerce domain purposes with the use of the dataset containing the categorized shoe images. In particular, the idea of bridging BoVW image representation with sequence alignment algorithm for determining image similarities is introduced. This allows for the enrichment of the classification process by capturing the spatial relationships in the images and not only the similarity of local and global features. For this reason, the SIFT descriptors in the structured form are employed. However, each image is represented by the sequence of visual words arranged in a linear order. To determine the similarity between each pair of images the Needleman - Wunsch algorithm was introduced, which is one of the most commonly used sequence alignment methods for finding the best global matching of two sequences. The idea of bridging the sequence alignment and CBIR was constituted in Hung-sik et al [12], [13], where the authors use the biological methods for determining image similarities. This paper extends this approach by introducing the BoVW image representation and loosening the biological restrictions in sequence alignment. The performed experiments proved the correctness of the presented method and indicated new potential solutions to the image classification problem. The rest of this paper is organized as follows. The next section describes existing approaches in the Content Based Image Retrieval field which use the spatial information. In
section III we propose the novel method for image classification, where the sequence alignment and CBIR are merged. Then, the accuracy results of classification are evaluated in the experimental section. Finally, in section V we conclude the paper and point out future work. II. R ELATED W ORK The idea of the Bag of Visual Words (BoVW) approach has its origins in the Information Retrieval discipline. The first experiments with BoVW mimic the procedure of obtaining a numerical representation derived from the IR field. The most computationally intensive and crucial part of this procedure is extracting the visual words, which consists in detecting the interest points on pictures, forming the local descriptors around interest points and quantizing them into visual words. In the literature many detectors and descriptors were proposed like ASIFT [17], [26], SIFT [16], SURF [4], MSER [8], each method tries to be as invariant to translation, rotation, scale changing, illumination and view point as possible. Simple histograms of quantized descriptors works quite well in many CBIR tasks, but pictures has far richer spatial structure than text and exploiting this additional information its beneficial for different image understanding tasks. The first attempt to use spatial information can be seen in Lazebnik et al [14]. Their spatial pyramid matching algorithm was motivated by Grauman et al. [10] pyramid matching which was designed for finding correspondence between sets of high-dimensional points. Lazebnik uses this idea to find correspondence between sets of visual words, by repeatedly subdividing the image and computing histograms of local features at increasingly coarse resolutions. Utilization of SIFT descriptors on a dense grid with the spatial pyramid matching kernel improves classification accuracy on Caltech 101 dataset. Saverse et al [20] try to exploit the relative spatial locations and build the correlograms of visual words like Huang et al [11] use correlograms of quantized colors. Yang et al [24], [25] go a step further and instead of using the absolute spatial location, they use the relative spatial co-occurrence. Firstly, the particular image is partitioned into a sequence of increasingly coarser grids and in each grid cell the co-occurrence matrix is computed. To get the final similarity score between two images the weighted histogram intersection kernel was used. Another approach utilizing spatial relations for image matching was suggested by Hung-sik et al [12], [13]. In their work the image is represented as a DNA or protein sequence converted from features extracted from colors and texture. Matching was implemented with use of the BLAST [2] sequence alignment algorithm. Their solution suffers from two limitations. The first one is connected with the utilized WU-BLAST implementation, which works only with 4(DNA) or 23(Protein) element alphabets. In order to create such an alphabet they had to perform translation from the feature space to the particular alphabet. To perform such a conversion the ”Composite Conversion Table” was developed, but it is hard to create universal mapping for different features. The second issue was connected with the substitution matrix, which
determines the similarity between two alphabet letters and penalty when one is substituted by the other. They use a simple Uniform matrix (1 in diagonal, -1 for all other elements) which means that letters are only similar to themselves and the Gaussian distributed matrix which tries to express similarity between the mapped features, for example color red is more similar to orange than blue. Our work improves the solution proposed by Hung-sik by using the BoVW approach which in our opinion is more suitable for sequence alignment framework. Firstly, it is more natural to obtain the substitution matrix by finding the distance between each pair of vocabulary centroids, secondly our solution is not limited to particular alphabet lengths. All the implementation details are described in section III. III. M ETHODOLOGY This section precisely describes the whole image classification process proposed in this paper. In particular, two main phases can be pointed out. The first involves creating the sequence image representation with the use of dense SIFT descriptors and the second incorporates two different methods for image classification basing on the similarities of images from the considered dataset. The similarities are obtained by means of the Needleman-Wunsch sequence alignment algorithm. A. Image as a Sequence of Visual Words In this study image is represented as a sequence of letters from predefined alphabet, in this context each visual word is treated as a letter. In order to obtain such representation the standard methodology of creating Bag of Visual Words approach was used. To describe local image patches the SIFT algorithm, was chosen since it was proven to be resistant to scale, rotation and viewpoint [16], [21], [23]. In particular, the SIFT descriptor on the dense regular mesh with the predetermined grid size (8,16,24,32) was applied for the images from the classified dataset, the DSIFT implementation from VLFeat library [22] was utilized. It should be noticed, that all images were scaled to the same size (wider egde equals 256px) in order to provide the same number of grid cells, this makes the classification process more reliable. As a result, for each image the set of the SIFT descriptors (128 dimensional vector enriched with the information about scale, orientation and location) was collected. An example of an image with dense SIFT descriptors is presented in Figure 1(b). On the basis of all the obtained descriptors in the whole dataset the visual dictionary is created. The dictionary is generated with the k-means algorithm, where each cluster is a visual word of the generated dictionary. The described process is performed, since the obtained SIFT descriptors are similar and it is worth to group them into clusters. Next step of the algorithm involves assigning the visual word to each cell descriptor, in such a way that the particular cell will contain the most similar visual word to its descriptor. At this stage the visual word can be treated as a single letter. Having obtained
(a) Original Shoe
(b) Shoe with SIFT descriptors on a dense grid
(c) Shoe with assigned visual words, vocabulary size 50 Fig. 1. Procedure of creating sequences
the visual word grid representation of the image (illustrated in Figure 1(c)), the different ordering can be applied to obtain the image sequence representation, needed for NeedlemanWunsch algorithm. In particular, the visual word can be set in the linear order, where grid cells are ordered row by row or column by column, the Hilbert order, where cells are addressed according to Hilbert space-filling curve or Z-order (Morton order). For purpose of this study only the linear ordering is taken under consideration. B. Sequence Alignment for Classification The first phase of the algorithm generates the sequences of visual words for all images from the given dataset which will be processed in the second phase by the NeedlemanWunsch sequence alignment algorithm in order to determine the similarities of any pair of images. This method was originally designed for bioinformatic tasks such as aligning protein or nucleotide sequences. It finds the best possible global matching between two sequences A = A1 A2 ...An and B = B1 B2 ...Bm , where Ai and Bj are the letters from the predefined alphabet W . In our case the alphabet is the visual dictionary generated in the first phase, which consists of visual words. Besides of two sequences of visual words for images, the Needleman - Wunsch algorithm takes as an input the scoring, substitution matrix S(Wi , Wj ) = {sij }. Each value of the matrix indicates the similarities between two alphabet letters. The higher value of the {sij } means that words i and j
are more similar and the substitution of that words during alignment entails less penalty. In bioinformatics domain the creation of the substitution matrix is considered as a hard problem and as yet there has not been proposed any optimal matrix for the protein and DNA sequence alignment. There has been proposed many different types of matrices from which the BLOSUM [7] and PAM [6] matrix should be mentioned as a leading. However, in the case of the visual sequence alignment, we construct the similarity matrix on the basis of the distance between two given visual words(precisely, between the clusters from the k-means algorithm). This allows for capturing the correlation between visual words which in case of the standard BoVW approach becomes difficult. In particular, the distances between k-means clusters are normalized to the < −1; 1 > interval, where −1 means the maximum distance between clusters and 1 means equal clusters. In addition, the elements of the similarity matrix S created for the purposes of the Needleman-Wunsch algorithm are calculated by the following formula: S(Wi , Wj ) = (−2) ∗
d(Wi , Wj ) + 1, max(d(Wi , Wj ))
(1)
where d(Wi , Wj ) is the distance between two visual words i and j. The idea of Needleman-Wunsch method can be illustrated by the recursion algorithm 1. In this algorithm p is the penalty parameter, which decreases the total similarity, when instead of matching there occurs a
Algorithm 1 Needleman-Wunsch algorithm 1: Fi0 = p ∗ i, i ∈ {0, ..., n}, j = 0 2: F0j = p ∗ j, i = 0, j ∈ {0, ..., n} 3: Fij = max(Fi−1,j−1 + S(Ai , Bj ), Fi,j−1 + p, Fi−1,j + p).
gap in one of the sequences. The value of this parameter is set for each case individually and in this paper several different values were tested. The output similarity value for any given pair of sequences is saved in Fnm where n and m are the lengths of sequences. As the obtained similarities strongly depend on the length of sequences, they should be normalized in order to weaken the correlation. The normalization is performed by the following formula: SimA,B =
SimA,B , (length(A) + length(B))
(2)
where SimA,B is the similarity score obtained from the Needleman-Wunsch algorithm for sequences A and B. As a result of the described procedure the matrix of normalized similarities of all possible pairs of sequences is obtained. As the last step of the proposed method the classification is performed. For this purpose two different algorithms were implemented. Firstly, the standard k-NN algorithm was used with 5-fold cross-validation. The second method for each image finds average similarities for each class and classifies it to the class with the highest average. The results of classification are described in details in the experimental section. IV. E XPERIMENTS The goal of the experimental session was to verify the effectiveness of the proposed methodology for image classification. In particular, we wanted to evaluate the method proposed in this paper over the dateset containing the visually similar categories. For this reason, the dataset consisted of 200 images of shoes, assigned to five distinctive visual categories was proposed. In particular, each category had 59, 20, 34, 29, and 58 images. Sample images from each of the categories are presented in Figure 2. To investigate the importance of the grid size selection on classification accuracy, dense sampling over the grid patches of size 8, 16, 24 and 32 pixels were considered. It should be noted that the choice of grid size significantly affect the length of the visual word sequence for each image which apart from accuracy determines the classification time. Intuitively, a smaller grid size should increase the level of accuracy but also it will be extended in time. Successively, densely sampled SIFT descriptors were quantized into visual dictionaries of size k = {5, 10, 20, 50, 100, 200, 500, 1000}, which was to determine the optimal size of the visual dictionary for the classification process. The penalty parameter of NeedlemanWunsch algorithm is identified as the next factor which can greatly affect the level of accuracy. In particular, our method was evaluated for various values of the penalty parameter
Grid size 8 16 24 32 5 0.97 0.9 0.84 0.77 10 0.975 0.92 0.86 0.82 20 0.995 0.945 0.875 0.87 50 0.985 0.96 0.915 0.815 100 0.98 0.96 0.895 0.87 200 1 0.955 0.945 0.915 500 1 0.975 0.925 0.91 1000 1 0.98 0.93 0.905 TABLE I C LASSIFICATION ACCURACY OF 1-NN CLASSIFIER ( PENALTY PRARAMETER = .5). Vocab. size
Grid size 8 16 24 32 5 0.96 0.885 0.845 0.815 10 0.97 0.91 0.825 0.875 20 0.99 0.935 0.9 0.83 50 0.98 0.96 0.92 0.87 100 0.99 0.955 0.895 0.89 200 0.99 0.98 0.91 0.9 500 0.99 0.97 0.93 0.905 1000 0.99 0.975 0.94 0.925 TABLE II C LASSIFICATION ACCURACY OF 3-NN CLASSIFIER ( PENALTY PRARAMETER = .01) Vocab. size
p = {0.01, 0.1, 0.2, 0.4, 0.5}. Since our approach relies on the k-NN classifier, we calculate classification accuracy for 1, 3 and 5 neighbors. Tables I, II and III present the results we have obtained using the k-NN classifiers with 5-fold cross validation. Precisely, for each value of the parameter k = 1, 2, 3 only the best set of results for chosen penalty parameter is presented. Although, in the case of different choices of penalty parameter, the accuracy changes slightly (up to 3 percent).It proves as well as the data presented in the tables I, II, III that the selection of both the k-NN classifier and penalty parameter has a very little impact on the accuracy of classification. On the other hand, it can be noted that the accuracy is very high for very dense grids (grid size of 8) and dictionaries with more than 20 visual words and it significantly decreases with increasing the grid size. The best reported accuracy is equal to 100% for 1-NN classifier with 200 or more visual dictionary size and in many other cases the accuracy is 98% or more. Our previous experiments on the same dataset [3], [9], which used classical visual word frequencies, k-NN Grid size 8 16 24 32 5 0.965 0.885 0.82 0.82 10 0.97 0.905 0.86 0.845 20 0.985 0.94 0.88 0.84 50 0.97 0.935 0.895 0.86 100 0.975 0.945 0.89 0.905 200 0.96 0.955 0.91 0.92 500 0.99 0.95 0.915 0.91 1000 0.99 0.96 0.91 0.91 TABLE III C LASSIFICATION ACCURACY OF 5-NN CLASSIFIER ( PENALTY PRARAMETER = .2) Vocab. size
(a)
(b)
(d)
(c)
(e) Fig. 2. Sample images from each of 5 shoe categories (a-e)
Vocab. size 5 10 20 50 100 200 500 1000
Grid 8 16 0.895 0.815 0.885 0.835 0.935 0.81 0.93 0.825 0.915 0.835 0.92 0.84 0.935 0.855 0.95 0.875 TABLE IV
size 24 0.685 0.695 0.7 0.7 0.7 0.765 0.78 0.825
32 0.67 0.735 0.725 0.715 0.725 0.75 0.785 0.785
C LASSIFICATION ACCURACY OF THE HIGHEST CLASS AVERAGE METHOD . ( PENALTY PRARAMETER = .2)
and SVM classifiers, resulted in a 96% accuracy. Therefore, the classification based on sequences of visual words turns out to be more reliable than classical approaches. The sample results calculated from the second method, based on average similarities for each class, are presented in table IV. In this case the accuracy is lower (the best result is 95%) compared to k-NN classifiers. The grid size parameter has the biggest impact on the classification accuracy. The classification accuracy is very high for small grid patches and it degrades if patches get larger. However, the denser grid results in the longer visual sequences of the images. An average sequence length for grid of size 8 is 497.1(σ = 158.12), while it is 18.2(σ = 12.3) for the grid size of 32. As Needleman-Wunsch algorithm requires O(mn) time to calculate the similarity of two images with sequences of length m and n, high classification accuracy comes at the price of high computational cost. For example it takes roughly 746x less time to calculate the similarity of two images at the grid size of 32 than 8. V. C ONCLUSION AND F UTURE W ORK This paper presents the novel approach of the image classification by usage of the Needleman-Wunsch sequence alignment algorithm for image similarity finding. The accuracy obtained during the experimental session reaches 100% for properly chosen parameters, which is better than the standard
SVM classifier, where the highest accuracy obtained in SVM classification is equal to 96,5% for the same dataset. The experimental results showed the great potential of combining these two research areas and point the new directions in the Content Based Image Retrieval Field. In the future we are going to test our approach on the state of the art imaginary datasets as well as implement different sequence alignment algorithms. Moreover, we plan to test different methods for sequence ordering. ACKNOWLEDGMENTS . The research has been supported by grant N N516 480940 from The National Science Center of the Republic of Poland and by grant 1309-802 from Ministry of Science and Higher Education of the Republic of Poland. We would like to thank Geox and Salomon companies for providing a set of shoes images for scientific purposes. R EFERENCES [1] Akg¨ul, C.B., Rubin, D.L., Napel, S., Beaulieu, C.F., Greenspan, H., Acar, B.: Content-based image retrieval in radiology: Current status and future directions. J. Digital Imaging 24(2), 208–222 (2011) [2] Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990) [3] Artiemjew, P., G´orecki, P., Sopyła, K.: Categorization of similar objects using bag of visual words and k nearest neighbour classifier. Technical Sciences (2012) [4] Bay, H., Tuytelaars, T., Gool, L.V.: Surf: Speeded up robust features. In: In ECCV. pp. 404–417 (2006) [5] Csillaghy, A., Hinterberger, H., Benz, A.O.: Content-based image retrieval in astronomy (2000) [6] Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. Atlas of protein sequence and structure 5(suppl 3), 345–351 (1978) [7] Eddy, S.R.: Where did the BLOSUM62 alignment score matrix come from? Nature Biotechnology 22(8), 1035–1036 (Aug 2004), http://dx.doi.org/10.1038/nbt0804-1035 [8] Extremal, M.S., Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from. In: In British Machine Vision Conference. pp. 384–393 (2002) [9] G´orecki, P., Sopyła, K., Drozda, P.: Different svm kernels for bag of visual words. Studies and Proceedings of Polish Association for Knowledge Management (2012) [10] Grauman, K., Darrell, T.: The pyramid match kernel: Discriminative classification with sets of image features. In: In ICCV. pp. 1458–1465 (2005)
[11] Huang, J., Kumar, S.R., Mitra, M., Zhu, W.J., Zabih, R.: Image indexing using color correlograms. In: Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR ’97). pp. 762– . CVPR ’97, IEEE Computer Society, Washington, DC, USA (1997), http://dl.acm.org/citation.cfm?id=794189.794514 [12] Kim, H.s., Chang, H.W., Lee, J., Lee, D.: Basil: effective near-duplicate image detection using gene sequence alignment. In: Proceedings of the 32nd European conference on Advances in Information Retrieval. pp. 229–240. ECIR’2010, Springer-Verlag, Berlin, Heidelberg (2010) [13] Kim, H.s., Chang, H.W., Liu, H., Lee, J., Lee, D.: Bim: Image matching using biological gene sequence alignment. In: ICIP. pp. 205–208. IEEE (2009) [14] Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2. pp. 2169–2178. CVPR ’06, IEEE Computer Society, Washington, DC, USA (2006) [15] Li, J.: The application of cbir-based system for the product in electronic retailing. Computer-Aided Industrial Design and Conceptual Design (CAIDCD), 2010 IEEE 11th International Conference pp. 1327–1330 (November 2010) [16] Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 91–110 (November 2004) [17] Morel, J.M., Yu, G.: Asift: A new framework for fully affine invariant image comparison. SIAM J. Img. Sci. 2(2), 438–469 (Apr 2009) [18] Najgebauer, P., Nowak, T., Romanowski, J., Ryga, J., Korytkowski, M., Scherer, R.: Novel method for parasite detection in microscopic samples. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds.) Artificial Intelligence and Soft Computing, pp. 551–558. Lecture Notes in Computer Science, Springer Berlin / Heidelberg (2012) [19] Painter, T.H., Dozier, J., Roberts, D.A., Davis, R.E., Green, R.O.: Retrieval of subpixel snow-covered area and grain size from imaging spectrometer data (2003) [20] Savarese, S., Winn, J., Criminisi, A.: Discriminative object class models of appearance and shape by correlatons. In: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2. pp. 2033–2040. CVPR ’06, IEEE Computer Society, Washington, DC, USA (2006) [21] Tuytelaars, T., Mikolajczyk, K.: Local invariant feature detectors: a survey. Found. Trends. Comput. Graph. Vis. 3, 177–280 (July 2008) [22] Vedaldi, A., Fulkerson, B.: VLFeat: An open and portable library of computer vision algorithms (2008) [23] Yang, J., Jiang, Y.G., Hauptmann, A.G., Ngo, C.W.: Evaluating bagof-visual-words representations in scene classification. In: Proceedings of the international workshop on Workshop on multimedia information retrieval. pp. 197–206. MIR ’07, ACM, New York, NY, USA (2007), http://doi.acm.org/10.1145/1290082.1290111 [24] Yang, Y., Newsam, S.: Bag-of-visual-words and spatial extensions for land-use classification. In: Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems. pp. 270–279. GIS ’10, ACM, New York, NY, USA (2010), http://doi.acm.org/10.1145/1869790.1869829 [25] Yang, Y., Newsam, S.: Spatial pyramid co-occurrence for image classification. In: Metaxas, D.N., Quan, L., Sanfeliu, A., Gool, L.J.V. (eds.) ICCV. pp. 1465–1472. IEEE (2011) [26] Yu, G., Morel, J.M.: ASIFT: An Algorithm for Fully Affine Invariant Comparison. Image Processing On Line 2011 (2011)