Application of Diffusion Kernel in Multimodal Image Retrieval Rajeev Agrawal 1, William Grosky 2, Farshad Fotouhi 1, Changhua Wu 3 1 Wayne State University, 431, State Hall, Detroit, MI 48202, USA
[email protected],
[email protected] 2 The University of Michigan – Dearborn, 4901 Evergreen Road, Dearborn, MI 48128, USA
[email protected] 3 Kettering University, 1700, West Third Av., Flint, MI 48504, USA
[email protected] Abstract In this paper, we propose an approach to narrow down the gap between low level image features and the human interpretation of the image. To take the cue from text-based retrieval techniques, we construct “visual keywords” using vector quantization of small sized image tiles. Both visual and text keywords are combined and used to represent an image as a single multimodal vector. This multimodal image vector is similar to a term vector in text document representation and helps in unfolding the hidden inherent relationships between image to image, text to text and text to image. We use a diffusion kernel based non-linear approach to identify the modality relationship between visual and text modalities. By comparing the performance of this approach with lowlevel features based approach, we demonstrate that the visual keywords, when combined with the textual keywords, improve the image retrieval results significantly.
1. Introduction
In the earlier years of content based image retrieval (CBIR), the low level features such as color and texture [1, 2, 3, 4] have been used either in narrow domain to identify two sets of images in one of the two categories or in a broad domain where images are classified into different categories using clustering techniques. The low level features have been successful for specific applications such as face, fingerprint, and object recognition. But the researchers have identified the limitations of using low level features in querying and browsing images in huge set of image collection. Many techniques have also been developed to apply information retrieval techniques in the context of image retrieval. One of the immediate advantages is the ability to use text information retrieval techniques in image retrieval. These techniques have been applied solely on the text
annotations and also in conjunction with low-level features of the objects. The biggest problem in this approach is to have a good quality of annotations, which is time consuming and highly subjective. To extend the annotation list, WordNet [5] has been used in several systems, which further improved the results. The annotations have also been used to group images into a certain number of concept bins, but the mapping from image space to concept space is not one-to-one, since it is possible to describe an image using a large number of words. No generic, direct transformation exists to map low-level representation into high level concepts. The gap between low level features and text annotations have been identified as “semantic gap”. In this paper, we propose a technique based on Latent Semantic Analysis (LSA), [6] to overcome the deficiencies of term-matching retrieval by treating the unreliability of observed term-document association data as a statistical problem. The information retrieval problem is posed as ‘matching of queries with words to documents with words’. The document collection is organized as a term-document matrix, in which rows represent the textual keywords and columns represent the documents. Then, LSA is applied on this termdocument matrix to discover the latent relationships between correlated words and documents. However, it is not straightforward to generalize information retrieval to image retrieval. What is a “keyword” in an image? One simple answer is to consider each pixel a word, but this method is same as using low level contents of an image. Therefore, we need a proper definition of a “keyword”, which is simple and independent of context and content. In our approach, “keyword” is a tile of any size in an image. The number of keywords can vary depending on the chosen template size of a tile, but it is preferred to have a constant template size in any image collection to have similar resolution of all the “keywords”. Defining “keyword” just as a tile makes our approach completely unsupervised and does not rely on image
segmentation techniques. In our approach we combine these “keywords” with text keywords and use this multimodal approach in querying and retrieving the images from an image collection. The rest of the paper is organized as follows: Section 2 describes the related work in the area of multimodal keywords and fusion of modalities. Our framework for multimodal based image retrieval is given in section-3 and section-4 has the experimental results that evaluate our method under several variations of image data and finally, in Section 5 we offer some conclusions and discuss future work.
2. Related Work
In one of the earliest papers, Swain and Ballard [7] use the color for high-speed image location and identification. They use color histograms of multicolored objects, which provide a robust and efficient representation for indexing the images. In [8], an image is represented by three 1-D color histograms in R, G, B space, while a histogram of the directions of the edge points is used to represent general shape information. A so-called blobworld representation [9] is used to retrieve images and it recognizes the images as a combination of objects to make both query and learning in the blobworld more meaningful to the user. In all these work, the color is basically the fundamental unit to represent an image, which is very similar to the keywords in the text document. However, the representation of the color features may vary in the different systems ranging from histogram approach to indexing. The idea of using visual keyword or visual thesaurus first seems to appear in [10]. A "visual thesaurus" works with pictures, not words. It helps in recognizing visually similar events, "visual synonyms" including both spatial and motion similarity. Visual keyword can be based on color, texture, pattern, and objects in the image or any other user-defined features depending on the domain. The visual keywords are created by cropping domain-relevant regions from sample images and these regions are assigned labels and sub-labels to form thesaurus and vocabulary respectively [11]. Additionally, the low level features are extracted from these cropped regions to form a feature vector to represent that visual keyword. A keyblock-based approach [12] encodes each image as a set of one-dimensional index codes linked to the keyblocks in the codebook, analogous to considering a text document as a linear list of keywords. In this approach, for each semantic class, a corresponding codebook is generated. However this approach does not have any invariant properties and requires domain knowledge while encoding the images. More recently,
visual keyword approach is used for visual categorization using SVM and Naïve Bayes classifiers [13], discovering objects and their location in images by using a visual analogue of a word, formed by vector quantizing low level features [14], and extracting a large number of overlapping, square sub-windows of random sizes and at random positions from training images [15]. In biomedical domain, ViVos (Visual Vocabulary) is developed to summarize an image automatically, to identify patterns that distinguish image classes and to highlight interesting regions in an image [16]. In all the above approaches, only the low level features have been used to construct the visual keywords. We augment visual keyword approach by utilizing the text keywords to form a multimodal vector representation of an image [17]. The important question to answer in multimodal approach is what constitutes “modalities”. In any multimedia element, it is possible to have the features of that element to be extracted from different sources e.g. a video shot may have visual features such as color and texture, audio features, text features; each source being one modality. In the most of the existing image retrieval research work, all features are considered to belong to one modality, but recently multimodality has been used for more effective image retrieval [18, 19, 20]. In the presence of many modalities, it is important to identify the best way to fuse them to represent the images. More discussion on modality independence, curse of dimensionality and fusion- modal complexity can be found in [21]. In [19], many popular fusion strategies such as product combination, weighted-sum, voting, and min-max aggregation have been discussed. An iterative similarity propagation approach is proposed to explore the relationship between web images and their textual annotations for image retrieval [22]. In our method, we combine visual keywords as explained before and text keywords to use the resultant multimodal vector to represent an image for image retrieval. In this work, we use a diffusion kernel based non-linear approach to identify the modality relationship between different modalities.
3. A Framework for Multimodal based Image Retrieval 3.1 Visual Keyword Construction This section describes the approach to generate the visual keywords from the low level feature space to quantized visual keyword space. For clustering and retrieval applications, it is crucial to use proper visual semantics to represent images. The visual keywords are generated from low level feature space to a
quantized visual keyword space. Let I be a set of n images, each image is divided into non-overlapping tiles, which results into a tile matrix T = {t1i,…t1j, t2i,…,t2j,……,tni,…,tnj | Rk}, where k is the number of features to represent each tile. Let V be the desired number of visual keywords. The tile matrix is mapped to the V bins (visual keywords) to find the membership of each tile to a bin using the following criteria function.
V
maximize
sim(v, u) i = 1 v, u ∈ t i
The proposed visual keyword treats each tile like a word in a text document. The images have been represented using simple low-level features, such as color histograms, textures, etc., and also using the more sophisticated Scale Invariant Feature Transforms (SIFT) descriptors [23] and MPEG-7 descriptors. The SIFT descriptors are multi-image representations of an image neighborhood and are Gaussian derivatives computed at 8 orientation planes over a 4x4 grid of spatial locations, giving a 128-dimension vector. We prefer to use MPEG-7, which is a standard for describing multimedia content data that supports some degree of interpretation of semantics determination, and which can be passed onto, or accessed by, a device or computer code [24]. The procedure to create visual keywords is completely unsupervised and does not involve any image segmentation. The granularity of the number of visual keywords is the parameter selected by the user and may vary depending on the domain. In a narrow domain, a small number of visual keywords are appropriate, because of the similarity in the tiles. But in a broad domain, a large number of visual keywords may be desired. Another important parameter to consider is the selection of template size to create tiles, since this size has direct effect on the computation cost. The small size of the template will result in large number of tiles and hence higher computation cost. This is a trade off between quality and speed. A template size of 32 x 32 pixels is appropriate while using MPEG-7 descriptors. We use the scalable color descriptor (SCD) with 64 coefficients, which are good enough to provide reasonably good performance [25], the color layout descriptor (CLD) with 12 coefficients, found to be best tradeoff between the storage cost and retrieval efficiency, and the color structure descriptor (CSD) with 64 coefficients, sufficient enough to capture the important features of a tile. Hence, a tile vector has 140 coefficients. We note that all three MPEG-7 descriptors have different feature spaces sizes; therefore they are normalized within their own feature
space. We use a high-dimensional clustering algorithm vcluster [25] to cluster the tile matrix into desired number of visual keywords. In the final step, an image vector, whose size is equal to the number of clusters, is created for each image. The j-th element of this vector is equal to the number of tiles from the given image that belongs to the j-th cluster. The visual keywordimage matrix is then formed, using the image vectors as columns. Finally, we normalize each column vector to unit length and generate the final visual keywordimage matrix Tvis.
3.2 Multimodal Image Representation using fusion of visual and text keywords There is a variety of information associated with the images in addition to low-level features. They may be in the form of content-independent metadata, such as time stamp, location, image format or content-bearing metadata, which describes the higher level concepts of the image. The text associated with images has been found to be very useful in practice for image retrieval; for example, newspaper archivists index largely on captions [26]. Smeaton and Quigley [27] use Hierarchical Concept Graphs (HCG) derived from Wordnet [5] to estimate the semantic distance between caption words. In this paper, we use text keywords and fuse them with visual keywords to create multimodal image representation. We first create an initial termdocument matrix (Ttex). To control for the morphological variations of words, we use Porter’s stemming algorithm [28]. The minimum and maximum term (word) length thresholds are set as 2 and 30, respectively. Ttex is then normalized to unit-length. We obtain a large number of modalities (features) in the form of visual and text keywords. These modalities are not completely independent of each other; we need to find an effective strategy to fuse them. PCA and ICA have been used for this purpose, but they have their limitations. They need a good estimate of the number of independent components and they perform best under some error-minimization criteria [21]. In our approach, we adopt the diffusion kernel proposed by Lafon [29] for the diffusion of multi-modalities of the matrix Tvis-tex, which is obtained after concatenating visual keywords matrix Tvis and text keyword matrix Ttex . Below is a brief description of the nonlinear diffusion kernel used in our approach. Let Ω represent the set of columns in Tvis-tex and x, y be any two vectors in Ω. Then we can define a finite graph G = (Ω, Wσ) with n nodes, where the weight matrix Wσ(x, y) is defined as: Wσ (x, y ) = exp −
x− y
σ
2
(1)
where |x - y| is the L2 distance between vector x and y. Let qσ ( x ) =
y∈Ω
Wσ (x, y ).
Then we can have a new kernel: W ( x, y ) Wσα (x, y ) = α σ α (2) qσ (x ) qσ ( y ) The parameter α is used to specify the amount of influence of the density in the infinitesimal transitions of the diffusion. More description of α can be found in [29]. We can obtain the anisotropic transition kernel pσ(x,y) after applying the normalized graph Laplacian α construction to Wσ ( x, y ) :
the spectral fall-off of the eigenvalues. This diffusion mapping represents an effective fusion of visual and text keywords and is a low-dimensional representation of the image set. The values of σ and α in the diffusion kernel are set as 10 and 1 respectively for all of our experiments.
3.3 Image Retrieval using Multimodal image presentation
The LSA has been used for document retrieval, where documents are represented as vectors, each dimension corresponding to a term. The main idea of LSA is to lower the dimension of documents and dσ ( x ) = Wσα ( x, y ) reconstruct the matrix using this lower dimension. In (3) y ∈Ω searching relevant documents, the rank k and approximation Ak with the smallest acceptable error of Wσα ( x, y ) original matrix A is used. This approximation pσ ( x, y ) = . (4) dσ ( x ) translates the term and document vectors into a concept space. We write this approximation as Ak = Uk Sk VkT. Matrix pσ(x,y) can be viewed as the transitional In this equation, U and V are the matrices of the left kernel of a Markov chain on Ω. The diffusion distance and right singular vectors and S is the diagonal matrix Dt between x and y at time t of a random walk is of singular values. The query vector q then has k defined as entries, giving the number of occurrences of each of k 2 2 Dt (x, y) = pt ( x,.) − pt ( y,.) 1 / φ (5) 0 concepts. We have to transform the query, qnew = qT Uk Sk-1, before we compare it, via the cosine similarity, ( pt (x, z) − pt ( y, z))2 (6) = with the document vectors in the concept space. For φ0 ( z) z∈Ω the purpose of image retrieval, the diffusion kernel where φ0 is the stationary distribution of the Markov based, low dimensional representation is like a term chain. document matrix in reduced dimension space of The diffusion distance can be represented by the information retrieval. right eigenvectors and eigenvalues of matrix pσ(x,y): It has been argued in [30] that the ability to identify 2 Dt2 ( x, y ) ≅ λ2j t (ψ j ( x ) − ψ j ( y )) pairs of related terms is the core concept of spectral (7) j ≥1 retrieval. In almost all versions of LSA, while creating ψ0 does not show up because it is a constant. Since the reduced rank term-document matrix, a fixed low the eigenvalues tend to 0 and have a modulus strictly sub-dimensional subspace is selected. Therefore, the less than 1, the above sum can be computed to a preset qualities of these schemes depend on this selection. While varying the dimensions and looking for each accuracy δ > 0 with a finite number of terms. If we define term-pair the curve of relatedness score, it is found that the shape of this curve indicated the term-pair s(δ, t) = max{ j ∈ N such that |λj|t > δ |λ1|t}, (8) relatedness. Therefore, any fixed choice of dimension up to relative precision δ, then we have s (δ ,t ) will not be appropriate for all the term pairs. In the 2 Dt2 (x, y ) ≅ λ2j t (ψ j (x ) −ψ j ( y )) (9) algorithm proposed by the authors, for each term pair, j =1 the number of dimension varies depending on when the Therefore, we can have a family of diffusion maps relatedness score is at or below zero. We have adopted ψt, t∈N, given by this approach in our multi-modal keyword based T Ψt : x → λ1t ψ 1 ( x ), λt2 ψ 2 ( x ),..., λts (δ , t )ψ s (δ , t ) (x ) (10) retrieval. The mapping Ψt: Ω → R s(δ,t) provides a 4. Experiments parameterization (fusion) of the data set Ω; in other words, a parameterization of the graph G in a lowerIn this section, we discuss various experiments dimensional space R s(δ,t), where the rescaled conducted to investigate the effectiveness of eigenvectors are the coordinates. The dimensionality multimodal keyword framework on the well-known reduction and the weight of the relevant eigenvectors Corel image dataset and the LabelMe collection, are dictated by both the time t of the random walk and available through the MIT AI Lab [31]. We selected
(
)
999 images belonging to 10 categories from the Corel dataset and 658 images belonging to 15 categories from LabelMe. The details of the datasets are given in Table 1. The MILOS software [32], based on the MPEG-7 XM model, is used to extract the color descriptors SCD, CSD, CLD. Table 1. Details of the datasets Dataset #Imag #Tiles #Visual #Text es Keywords Keywords Label 658 165750 1500 506 Me Corel 999 95904 900 924 For both the datasets, we measure the precision at 10% and 30% recall and the average precision at 10%, 20%, …, and 100% recall. The values vary between 0 and 1. We considered the entire collection of images in our database as the query data set to avoid favoring certain query results. We are interested in answering the following two questions: 1. Is there any improvement in the image retrieval results while using multimodal keyword representation over the low level feature representation? 2. Is diffusion kernel an effective approach to fuse different modalities? To answer first question, we conduct experiments using the entire image as one visual keyword and compare it with multimodal keyword. To answer second question, we use the diffusion kernel before retrieval experiments. Additionally, to examine the effectiveness of diffusion kernel, we also use Principal Component Analysis (PCA) to fuse the keywords in low level feature space and also in multimodal feature space. Here is the list of all variations of data, we have used in the experiments: • Full size images using PCA (fspca)/ Tiles of each image using PCA (tspca). •
Full size images (fsdk)/ Tiles of each image (tsdk) using the diffusion kernel.
•
Full size images + text keywords (fstkpca)/ Tiles of each image + text keywords using PCA (tstkpca).
•
Full size images + text keywords (fstkdk) / Multimodal (Tiles of each image + text) keywords using diffusion kernel (tstkdk).
•
Only text keywords (txt) using diffusion kernel.
For the diffusion kernel experiments, we set the number of modalities to 30. We find that beyond this number, the curse of dimensionality will start having its effect. The LSI dimensionless algorithm of [30] was used for all the retrieval experiments. The results are shown in tables 2 and 3. The first column has the results for LabelMe and second column for Corel datasets, respectively.
Table 2. Precision results (full size images)
30% Experiment 10% recall recall type .67 .74 .51 .59 fspca .68 .75 .51 .60 fsdk .81 .83 .64 .69 fstkpca .84 .85 .67 .70 fstkdk .77 .81 .68 .69 Txt Table 3. Precision results (tiles) 30% Experiment 10% recall recall type .69 .72 .54 .57 tspca .69 .75 .57 .61 tsdk .80 .87 .68 .76 tstkpca .81 .86 .69 .77 tstkdk
Av. Prec. .41 .42 .49 .55 .52
.45 .46 .53 .54 .51
Av. Prec. .44 .45 .56 .59
.45 .50 .62 .63
The above results show that multimodal keywords give the best results at average precision level. In most cases, they also show improved performance at 10% recall level. As we increase the recall level, multimodal keywords consistently show better performance. The next best result is obtained when the full-size image and text keywords are combined. Using the text keywords alone is closely behind this. But using only low-level image features provides worse results as compared to multimodal keywords. We also observe that there is not a significant difference in the case of using only low-level features as to whether we use the entire image or the tiles. But using text keywords alone is better than the use of low-level features of the fullsize image, which indicates that at a higher recall value, the utility of using the low-level features diminishes. These results are consistent for both the datasets used in the experiments. Another important observation is that visual keywords alone are better than using low-level features for the entire image. Based on the experimental results, we can answer the questions listed in the beginning of this section: 1. There is a significant improvement in the retrieval results, when we use multimodal keywords, as is evident from table 2 and table 3. This proves that the multimodal representation is better than low level feature or text representation. 2. The diffusion kernel saves computation cost, avoid high dimensionality and find the modalities, which are the most representative of a large feature set.
5. Conclusions In this paper, we have presented a framework for content based image retrieval, which is based on multimodal keywords. The experiments show that visual keywords and text annotations, if used together, can improve the quality of the retrieval results to a great extent. The latent semantic analysis remains a
key to our approach and helps in establishing the relationship between visual and textual keywords. This model utilizes a diffusion kernel-based method to explore the relationship between different modalities of the data and extract the most representative ones. We do not use any complex image segmentation technique to create visual keywords; a simple division of the image into tiles provides good results. The visual keywords describe the image more comprehensively and are semantically-rich. We would like to extend this work to find a set of visual keywords to represent semantic concepts in the image collection, which can be used to compare the tiles extracted from a query image and retrieve the similar images very fast. We would also like to explore the effect of template size on the retrieval results and the selection of number of modalities during the fusion of different type of keywords.
6. References
[1] J. Huang, S. R. Kumar, M. Mitra, W. J. Zhu, and R. Zabih, “Image Indexing Using Color Correlograms”, CVPR, San Juan, Puerto Rico, 762-768, 1997. [2] M. Stricker and M. Orengo, “Similarity of Color Images”, SPIE: Storage and Retrieval for Image and Video Databases, pp. 381-392, 1995. [3] W. Niblack Wayne Niblack, R. Barber, W. Equitz, M. Flickner, E. H. Glasman, D. Petkovic, P. Yanker, C. Faloutsos, G. Taubin, “The QBIC Project: Querying Images by Content Using Color, Texture, and Shape”, SPIE: Storage and Retrieval for Image and Video Databases, pp. 173–187, 1993. [4] A. Pentland, R. Picard, and S. Sclaroff, “Photobook: Content-Based Manipulation of Image Databases”, SPIE: Storage and Retrieval for Image and Video Databases, pp. 34–47, 1994. [5] C. Fellbaum, (Ed.), WorldNet: An Electronic Lexical Database. MIT Press, Massachusetts, USA, 1998. [6] A. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, R. A. Harshman, “Indexing by latent semantic analysis”, Journal of the American Society of Information Science, 41(6):391-407, 1990. [7] M. Swain, and D. Ballard, “Color Indexing”, Int. Journal of Computer Vision, 7(1): 11-32, 1991. [8] A. K. Jain and A. Vailaya, “Image Retrieval Using Color and Shape”, Pattern Recognition, 29(8): 1233-1244, 1996. [9] C. Carson, S. Belonge, H. Greenspan, J. Malik, “Blobworld: Image Segmentation using ExpectationMaximization and its application to image querying”, PAMI, 24(8): 1026-38, 2002. [10] R. W. Picard, “Toward a Visual Thesaurus”, MIRO, Glasgow, 1995. [11] J. H. Lim, “Building Visual Vocabulary for Image Indexation and query Formulation”, Pattern Analysis and Applications 4: 125-139, 2001. [12] L. Zhu, A. Rao, A. Zhang, “Theory of keyblock-based image retrieval”, ACM Trans. Inf. Syst. 20(2): 224-257, 2002.
[13] G. Csurka, C. Dance, L. Fan, J. Willamowski, C. Bray, “Visual categorization with bags of keypoints”, ECCV Workshop on Statistical Learning in Computer Vision, 2004. [14] J. Sivic, B. Russell, A. A. Efros, A. Zisserman, and B. Freeman, “Discovering Objects and Their Location in Images”, ICCV, 2005. [15] R. Maree, P. Geurts, J. Piater, and L. Wehenkel, “Random Subwindows for Robust Image Classification”, CVPR Vol. 1: 34-40, 2005. [16] A. Bhattacharya, V. Ljosa, J. Pan, M. R. Verardo, H. Yang, H., C. Faloutsos, A. K. Singh, “ViVo: Visual Vocabulary Construction for Mining Biomedical Images”, ICDM 2005. [17] R. Agrawal, W. I. Grosky, F. Fotouhi, “Image Clustering Using Multimodal Keywords”, SAMT, pp. 113123, 2006. [18] M. Flickner , H. Sawhney , W. Niblack , J. Ashley , Q. Huang , B. Dom , M. Gorkani , J. Hafner , D. Lee , D. Petkovic, D. Steele, and P. Yanker, “Query by image and video content: the QBIC system”, Intelligent multimedia information retrieval, MIT Press, Cambridge, MA, 1997. [19] J. Kittler, M. Hatef, and R. P. W. Duin, “Combining classifiers”, Intl. Pattern Recognition, pages 897-901, 1996. [20] R. Agrawal, W. I. Grosky, F. Fotouhi, “Image Retrieval Using Multimodal Keywords”, ISM, pp. 817-822, 2006. [21] Y. Wu, E. Y. Chang, K. C. Chang, and J. R. Smith, “Optimal multimodal fusion for multimedia data analysis”, ACM MM, pages 572-579, 2004. [22] X. Wang, W. Ma, G. Xue, and X. Li, “Multi-model similarity propagation and its application for web image retrieval”, ACM MM, Pages 944-951, 2004. [23] D. G. Lowe, “Object Recognition from local scale invariant features”, ICCV, 1999. [24] B. S. Manjunath, P. Salembier, and T. Sikor, Eds., “Introduction to MPEG-7 Multimedia Content Description Interface”, John Wiley & Sons, Indianapolis, Indiana, 2002. [25] G. Karypis, “Cluto: A clustering toolkit”, release 2.1.1. Technical Report 02-017, University of Minnesota, Department of Computer Science, 2003. [26] M. Markkula, E. Sormunen, “Searching for photos --journalists’ practices in pictorial IR”, In the Challenge of Image Retrieval. Electronic Workshops in computing, 1988. [27] A. F. Smeaton, I. Quigley, “Experiments on Using Semantic Distances Between Words in Image Caption Retrieval”, SIGIR, 174-180, 1996. [28] C. J. van Rijsbergen, S. E. Robertson, and M. F. Porter, “New models in probabilistic information retrieval”, British Library Research and Development Report, no. 5587, 1980. [29] R.R. Coifman, and S. Lafon, “Diffusion maps”, Applied and Computational Harmonic Analysis, 12, 1, pp:5-30, 2005. [30] H. Bast and D. Majumdar, “Why Spectral Retrieval Works”, SIGIR, Salvador, Brazil, August 2005, pp. 11-18. [31] A. Torralba, K.P. Murphy, W.T. Freeman, and B.C. Russell, “LabelMe: A database and web-based tool for image annotation”, MIT AI Lab, 2005. [32] G. Amato, C. Gennaro, P. Savino, and F. Rabitti, “Milos: a Multimedia Content Management System for Digital Library Applications”, ECDL, Vol. 3232: 14-25, 2004.