Mobile Netw Appl DOI 10.1007/s11036-014-0561-4
Multi-modal Similarity Retrieval with Distributed Key-value Store David Novak
© Springer Science+Business Media New York 2015
Abstract We propose a system architecture for large-scale similarity search in various types of digital data. The architecture combines contemporary highly-scalable distributed data stores with recent efficient similarity indexes and also with other types of search indexes. The system enables various types of data access by distance-based similarity queries, standard term and attribute queries, and advanced queries combining several search aspects (modalities). The first part of this work describes the generic architecture and similarity index PPP-Codes, which is suitable for our system. In the second part, we describe two specific instances of this architecture that manage two large collections of digital images and provide content-based visual search, keyword search, attribute-based access, and their combinations. The first collection is the CoPhIR benchmark with 106 million images accessed by MPEG7 visual descriptors and the second collection contains 20 million images with complex features obtained from deep convolutional neural network. Keywords Similarity search · Multi-modal search · Big Data · Scalability · Distributed hash table
1 Introduction and motivation The nature and volume of data has changed dramatically in recent years, which makes permanent pressure on data management and retrieval techniques. The volume and velocity of Big Data requires that the database systems be highly scalable in terms of data volumes and query throughput, and the variety of contemporary data types requires novel D. Novak () Masaryk University, Brno, Czech Republic e-mail:
[email protected]
search paradigms for a meaningful access to data of semistructured or unstructured character. For these kinds of data, it is often suitable or even essential that the access methods are based on mutual similarity of the data objects because it corresponds to the human perception of the data or because exact matching would be too restrictive. The similarity-based data retrieval has two fundamental aspects that can be, to a certain extent, treated separately: effectiveness and efficiency. The effectiveness of the retrieval refers to the way in which the data objects are compared and to quality of this comparison with respect to the application or to the human notion of the similarity; the efficiency represents various performance aspects of the retrieval process. Effective similarity comparison often requires approaches that are strongly data- and applicationspecific; typically, the data is preprocessed to extract some descriptors (features, stimuli) that capture the data characteristics important for given application. Effective methods of similarity comparison for various data types have been studied intensively; for instance, utilization of deep neural networks recently resulted in a revolution in the area of computer vision and content-based image retrieval (CBIR) [17, 29]. Further in this work, we focus primarily on the efficiency aspect of the similarity retrieval motivated by latest application scenarios. We believe that if a similarity retrieval system should be practically usable, it must be also able to efficiently combine different types of data access; this should cover direct combination of multiple similarity modalities, filtering the query results by attribute values, or re-ranking of a query result by different criteria. Thus, the objective of this work is to propose a universal distributed data management and retrieval system that should 1) be highly scalable in terms of data volume and query throughput, 2) provide an online data access
Mobile Netw Appl
based on a widely applicable similarity model, and 3) that should also allow traditional attribute- or keyword-based access and efficient combination of the conventional and similarity search. The similarity model we adopt is very generic; it treats the data as unstructured objects together with a distance function to assess dissimilarity between each pair of objects from the data domain. The area of distance-based similarity indexing [30] covers many nontrivial tasks that have been subject of research for many years, leading to a number of interesting results. In Section 2, we analyze current state of the research in this area including scalable distributed index structures; the general characteristic of current distributed structures is that they organize the data collection according to similarity relationships of the data items. As we argue, this is not very convenient for efficient combination of search modalities. The fundamental feature of the architecture proposed in this paper (Section 3) is that the data objects are stored only once in a central distributed key-value store and they are directly accessible by object ID. Various (similarity) indexes are built “around” this central store and they process incoming queries, typically generating a candidate set of IDs that is then post-processed in the data store in a distributed way. The post processing can be either simple refinement of the candidate set, re-ranking with the aid of a different modality, or, e.g., refinement combined with additional attribute-filters. We assume that the search indexes store only metadata and are capable of managing large collections; recently, a few similarity indexes were proposed that are well suited for our needs because they generate a very small set of candidate objects identified by their IDs [1, 25]. This architecture is generic and very flexible allowing many access patterns and their combinations; we mention a few general options and describe in detail two instances of this architecture that enable large-scale multiaspect image retrieval (Section 4). Scalability of the system is mainly assured by the central distributed store,
Fig. 1 General schema of an approximate similarity search index IX
which can be any mature key-value store which provides efficient data distribution, replication and dynamic utilization of HW resources. Also, this store can manage data from different collections, each having own search indexes; this approach results in an effective resource utilization and is suitable for running the application as a service.
2 Preliminaries and related work In this section, we better formalize the similarity model and we describe current achievements in the field of similarity search and other related areas. 2.1 Distance-based similarity search We adopt a broad similarity model based on mutual object distances, specifically, D is a domain of data and δ is a total distance (dissimilarity) function δ : D × D −→ R+ 0 ; we further assume that this space satisfies metric postulates of identity, symmetry and triangle inequality [30]. A similarity index IX organizes the data collection X ⊆ D so that it can be searched efficiently using the query-by-example paradigm; we mainly focus on the nearest neighbors query k-NN(q), which returns k objects from X with the smallest distances to given q ∈ D: δ(q, x), x ∈ X (ties broken arbitrarily). The field of distance-based similarity search has been studied for almost two decades now [30]. In general, the problem of efficient distance-based search is nontrivial and for large data collections it is necessary to assume approximate search, which means that the search result may be an approximation of the precise k-NN answer as defined above. A number of authors have turned their attention in this direction [26]. A typical general approach of an approximate distance-based index IX is that dataset X is split into partitions that are stored on the disk. Given a query k-NN(q), index IX determines “query-relevant” partitions and data
Mobile Netw Appl Fig. 2 Schema of similarity search using metadata index IX
from these partitions form a candidate set C(q) ⊆ X which is retrieved from the disk and refined by explicit evaluation of δ(q, c), c ∈ C(q) – see Fig. 1. There are many examples of efficient indexes of this kind [13, 22, 26, 31]. Recently, there emerged a few techniques that propose to maintain a memory index which does not store the similarity objects but just rich metadata [1, 25]; given a query, the index determines (a very small) set of candidate objects identified by their IDs and these objects are to be retrieved from an external storage and checked one by one – see Fig. 2 for a schema of this process. In this work, we exploit this type of indexes, specifically, we use the PPP-Index [25], which is briefly described in Section 3.3. 2.2 Distributed distance-based search Besides efficient centralized indexes, a number of distributed structures for generic similarity search were proposed during the last decade [5, 14, 19, 23, 24]. The general basis of these techniques is similar as sketched in Fig. 1, only in distributed environment: Data collection X is partitioned according to the distance δ and distributed; given a query point, the index can access selectively the queryrelevant data partitions in the distributed system and refine them. One of our objectives is to propose a system that would allow efficient access based on different search modalities and their combinations. A general disadvantage of partitioning and distributing data according to one search modality is an inefficient access by other ones – via object ID, attributes, or other similarity; the similarity-based data partitioning is typically dynamic which complicates building efficient indexes on other modalities. An available solution is to keep the data replicated and organized separately for each modality but this introduces consistency and efficiency issues and also combined multi-aspect search is relatively difficult. We should also consider the popular MapReduce approach to distribution of data processing jobs. In the area of similarity indexing and searching, it has been successfully applied to processing of similarity joins [18, 28] and also on similarity operations executed in large batches of jobs [20]. Our primary focus is on distributed systems that
can process individual online-issued similarity queries; the approach of distributed file systems and MapReduce processing with its overhead seems to be unsuitable for this kind of task. 2.3 Multi-modal search The area of multi-modal search is very well studied with well-established terminology and many indexing and searching techniques [2]. From this variety of approaches, we focus on query-by-example (k-NN) queries whose answer objects are ranked by fusion of different search aspects; especially, we use a technique called (asymmetric) late fusion [2], where a candidate set is determined by one search modality but this set is then refined using combination of modalities. A special case of such fusion would be a query where the result objects are filtered by a given attribute. There are works that integrate similarity indexes with centralized relational databases in order to enable selected multi-modal operations [9, 27]. We take these techniques into account when designing our distributed system. 2.4 Distributed key-value and document stores Recently, there emerged number of horizontally scalable data stores that often provide attribute or keyword access to the data objects. These systems typically endorse the Amazon Dynamo system [11] and are classified as keyvalue stores (for instance, Riak1 , Redis2 , Voldemort3 , or Infinispan4 ) or are designed to manage sets of documents (typically JSON-structured) and are referred to as document stores (MongoDB5, CouchDB6 , etc.). Both these types primarily manage and distribute the data objects according to their unique IDs and they often provide secondary
1 http://basho.com/riak/ 2 http://redis.io 3 http://www.project-voldemort.com 4 http://www.jboss.org/infinispan/ 5 http://www.mongodb.org 6 http://couchdb.apache.org
Mobile Netw Appl
attribute or full-text indexes to speedup selective access to the data. For instance, MongoDB also provides geo-spatial indexes, but we are not aware of any work that would integrate generic similarity indexes into these distributed stores. Nevertheless, we exploit the strong features of these systems in our proposal.
3 Distributed system for multi-modal similarity search In this section, we clearly specify our objectives and the data model and then we describe the generic distributed architecture for multi-aspect similarity retrieval on Big Data. Finally, we sketch the principles of the similarity index PPP-Codes which is key to the proposed distributed system. 3.1 Objectives and data model From our experience, there exist more and more complex data types for which it is meaningful to manage diverse types of meta information: textual data (annotations, keywords, descriptions), content descriptors that enable distance-based similarity access as described in Section 2.1, and standard “attributes”. We model such compound data types by assuming that each object x is composed of several fields, each x.field being from data domain Dfield . Every object has a unique identifier x.ID and we can represent the whole object in JSON format as in the following example: x1 = { "ID": "image_1",
Fig. 3 Schema of a distributed system for multi-modal similarity retrieval
"keywords": "summer, beach, ocean, sun", "color_histogram": [25, 36, 0, 17, 69,...], "shape_descriptor": [0.35, 1.24, 0.1,...], "author": "David Novak", "date": 20140327 }
In general, different objects can have different fields. We assume that if domain Dfield is to be explored via similarity, it has a corresponding distance function δfield ; other fields can be used for a different type of data access. In general, we assume three types of indexes: similarity index denoted as sim I , keyword (text) index denoted as text I , and attribute index attr I . Further, the overall indexed set X can be partitioned into several collections X = X1 ∪ X2 ∪ · · · ∪ Xs . Then, –
– –
field
similarity index sim IXi can process query kNN(q.field), q.field ∈ Dfield that returns the k objects from Xi that are the closest according to distance δfield (q.field, x.field), x ∈ Xi ; alternatively (and this is field the case we actually investigate), the sim IXi returns a candidate set C(q.field) that is further to be refined (see below); text I field2 indexes a text field field2 of objects from X i Xi and it allows standard keyword-based (full text) search; attr I field3 is assumed to be standard attribute index for Xi exact or interval queries on field3 from collection Xi .
The area of keyword and attribute search is well established and we will further focus especially on the similarityoriented indexing and searching.
Mobile Netw Appl
3.2 Generic system schema At this point, we can describe the proposed architecture – see Fig. 3 for its schema. The core of the system is a distributed key-value store that manages the whole data collection X according to object IDs – the ID-object pairs are distributed among cooperating nodes according to a hashing function, typically using consistent hashing [16]; also the concept of virtual nodes can be used to ensure balanced sharding of the data. Further, data replication methods can be employed to overcome both temporary and fatal failures of individual nodes and to increase read/write query throughput. Practically any of the structures described in Section 2.4 can be used, such as Riak, Infinispan or others. To this core system component, various search indexes are connected, each built on a sub-collection Xi (or on several sub-collections); these indexes can be of the types described above. 3.2.1 Similarity queries field
Having a similarity index sim IXi , we do not require that it can fully evaluate k-NN(q.field) queries, but we rather assume usage of the “meta-data” index as described in Section 2.1 and as sketched in Fig. 2. The processing of query k-NN(q.field) then follows these steps (which are also depicted in Fig. 3): 1. the index generates candidate set C(q.field) composed of object IDs from Xi ; 2. this candidate set is then refined in a distributed way with the aid of the key-value core: The set C(q.field) is partitioned according to the IDs and resent to respective nodes that store parts of C(q.field); the partial candidate sets are refined on these nodes by evaluation of δ(q.field, c), c ∈ C(q.field); 3. partial answers are gathered back to the initiating node, merged, and the final k-NN(q.field) result is returned. In this way, we do not only simply attach an independent similarity index to a key-value store, but we exploit the store to effectively distribute reading of the candidate set from the disk and its refinement, which is typically the most demanding part of the similarity query processing. Generation of the candidate set by the similarity index can be a bottleneck of the process, but we assume that the index can cope with large data collections (see Section 3.3). We assume that the index applies both the intra-query and inter-query parallelism to locally speedup the process of candidate set generation. Alternatively, the index can be replicated or it can be actually a distributed structure like M-Chord [24]. It is clear that scalability of the core of the system is very high – a new worker node can be added if the data volume or
query traffic grow. This all should lead to a very good scalability of the whole system both in terms of stored data and search query traffic. 3.2.2 Attribute and keyword queries First, the system can naturally answer ID-object queries, which is actually very useful for query-by-example search (like the k-NN(q.field)) since it inherently requires having the example q.field at hand. In real systems, the query object q is often from within the dataset X and is specified by the user only by q.ID; in this case, the core key-value store of our system can be directly used to retrieve object q and initiate the similarity search with q.field. Evaluation of keyword and attribute queries is a straightforward usage of corresponding text I or attr I indexes. Some contemporary key-value or document stores directly provide such “secondary” attribute or full-text indexes to speedup selective access to the data; these built-in indexes can be directly incorporated in our architecture. We do not formalize these queries nor their results. 3.2.3 Multi-modal queries Our system can also process several types of queries that combine multiple search modalities. Their processing exploits the fact that nodes of the core data store have access to the whole compound data objects; given any set of object IDs, the system can, in a distributed way, retrieve and rank this set according to any ranking function which combines any data fields (aspects, search modalities). In the following, we describe three specific types of multi-modal queries and their processing in the proposed system. k-NN query with filtering The system can directly process query k-NN(q.field) where the answer objects must field match some additional attribute filter. The index sim IXi is employed to generate candidate set C(q.field) (as described in Section 3.2.1) and then the filtering is applied during the distributed candidate set refinement since all attributes are available during this phase. Additional costs of the filtering are negligible in comparison to other processing phases. Fusion query For many data types, e.g. multimedia, direct combination of several search modalities seems to be necessary to achieve satisfactory results. As mentioned in Section 2.3, the principles of multimodal search have been studied in literature and our system inherently allows so-called late fusion approach [2]. Let us assume processing of a k-NN(q)f (field1,field2) query, which should retrieve k objects x ∈ Xi that are the most similar to q according to function f which combines distances δfield1 (q.field1, x.field1) and
Mobile Netw Appl
δfield2 (q.field2, x.field2). Having similarity indexes sim I field1 and sim I field2 , our system can retrieve separately Xi Xi candidate sets C(q.field1) and C(q.field2) and refine them according to the combination function f . Again, this refining ranking can be done efficiently in a distributed way. This approach does not give guarantees on the completeness of the answer – it is an approximation of the precise answer.
Re-ranking Another useful search mode is to allow user to initiate re-ranking of a query result by a different (combined) modality; this can be realized in our system efficiently in a distributed manner in the same way as described above. Naturally, the insert and delete operations in the proposed system must be realized on search indexes on corresponding sub-collection Xi and the compound object x is inserted/delete into/from the central store according to x.ID. Any updates on the stored data objects (adding, removal, or modification of individual fields) are realized directly in the central store (the data is stored only once) and, if need be, the search index on the particular field is updated. 3.3 PPP-codes similarity index The key to efficient similarity search in the proposed system is the similarity index sim I that determines a candidate set of objects that are close to given query object q. Recently, we have proposed a technique called PPP-Codes [25] that is well suited for this purpose. Let us briefly describe this technique; to simplify the notation, we use the problem formulation from Section 2.1 with a simple search domain (D, δ). The basic task of a distance-based search technique is to partition the indexed collection X ⊆ D only with the aid of the black-box pair-wise distance δ : D × D −→ R. Majority of the distance-based approaches use pivots – objects selected from X (or from D) that form certain anchors for data space partitioning and search space pruning [30]. The PPP-Codes use a static set of pivots and apply recursive Voronoi partitioning of the data space [22, 25]; such partitioning is sketched in Fig. 4 (top) for four pivots. The thick solid lines depict borders between standard Voronoi cells (points x ∈ D for which pivot pi is the closest one) and the dashed lines further partition each cell using the other pivots. This principle is used in several techniques [10, 13, 22] and it is usually formalized as pivot permutations (PPs) and pivot permutation prefixes (PPPs), specifically, each recursive Voronoi cell is identified by the indexes of the closest pivots; for instance, cell C4,1 contains all points for which pivot p4 is the closest one and p1 the second closest. These vectors of indexes (PPPs) are the foundation of the
Fig. 4 Top: Example of second level Voronoi Partitioning using four pivots. Bottom: Rank aggregation by percentile of candidate ranks
PPP-Codes approach. Based on this principle, a hierarchical index structure is proposed that organizes the data and that supports the search algorithms; this index structure does not actually store the data objects x ∈ X but only their PPPs and corresponding object IDs. Given such space partitioning, another task is to identify cells relevant to given query object q ∈ D. This is done based on the query-pivot distances (depicted also in Fig. 4, top); in complex data spaces, it is not easy to decide which cells are the closest ones from the query point and thus a relatively large number of objects x ∈ X must be accessed (they form the candidate set) and then refined by direct evaluation of distance δ(q, x). The complexity of the search task is caused by the fact that the data partitions typically span relatively large areas of the space and thus the candidate sets are either large or imprecise. The key idea of the PPP-Codes is to use several independent partitionings of the data space; given a query, each partitioning generates a ranked candidate set and the PPP-Codes index has a way to effectively and efficiently aggregate these rankings. This aggregation is exemplified in Fig. 4 (bottom); each of five space partitionings generates a j candidate ranking of indexed objects ψq , j ∈ {1, . . . , 5} and the final aggregated rank of each object is determined as a certain percentile of its candidate ranks. In Fig. 4 (bottom), the median (0.5-percentile) of the five ranks of object x is 3 and thus the final rank 0.5 (q, x) = 3.
Mobile Netw Appl
The PPP-Codes propose an indexing structure and an algorithm to efficiently compute individual candidate ranks and their aggregation [25]. This aggregation results in a significant shrinkage of the candidate set while preserving its accuracy; the exact numbers are data dependent but, for instance, to achieve 90 % recall with respect to precise 10-NN search, the average candidate set size must be about 0.01 % of the dataset (measured on three diverse datasets) [25]. Such result is about two orders of magnitude smaller than results from a single pivot space partitioning [13, 22, 25].
4 Large-scale content-based image management Let us now instantiate the general system architecture proposed in the previous section by two specific applications that manage large collections of digital images and provide various types of access to this data with a primary focus on content-based image retrieval (CBIR). The first variant of the system is realized on a very large benchmark dataset CoPhIR (Section 4.1) and the other one is on our own dataset, which seems to be very interesting from the effectiveness point of view but very challenging from the efficiency one (Section 4.3).
operations on those nodes that manage certain subsets of keys; this is exactly the operation we need for postprocessing of the candidate sets from the search indexes (see Section 3.2). Each worker node of the distributed structure is composed of two layers: One participates in the core ID-object store and the other contains search index(es). In the figure, the specific indexes are magnified and they are described below. Visual Similarity Index The CoPhIR objects contain fields with five MPEG-7 global visual descriptors [21] exploited for content-based visual similarity search in our system; the first three descriptors capture the color characteristics of the image and the other two are texture descriptors. There is a similarity measure recommended for each of the descriptor [3, 21] and we build a single search index on a combination of these descriptors; this combination is realized as a weighted sum of distances between corresponding descriptors from query and data objects [3, 4]. We use the PPP-Codes index (Section 3.3) to build a similarity search index on such combination – it is denoted as sim I mpeg7 in Fig. 5. As mentioned in Section 3.2, the index generates a candidate set of image IDs and, during its refinement, it can be further filtered and/or re-ranked by a combination with other search modality (e.g. annotations).
4.1 100M CoPhIR visual search The CoPhIR dataset [3, 7] is a benchmark in large-scale CBIR. It contains rich metadata for 106 million images downloaded from Flickr7 ; especially it has five content descriptors suitable for global visual similarity comparison of the images. An image record can be represented as follows: { "ID": "002561195", "title": "Me & my wife on Gold Coast", "tags": "summer, beach, ocean, sun, sand", "mpeg7_scalable_color": [25, 6, 0, 69,...], "mpeg7_color_layout": [[25,...], [32,...]], "mpeg7_color_structure": [0, 0,1, 25,...], "mpeg7_edge_histogram": [5, 1, 2, 6,...], "mpeg7_hom_texture": [232, 201, [198,...]], "GPS_coordinates":[45.50382, -73.59921], "flickr_user": "david_novak"
Annotation Search The CoPhIR dataset contains several text annotation fields; we have built a Lucene9 index on the tags and titles. This index text I tags,title is used for 1) standard annotation-based search on the images and for 2) search that combines text and visual search in the following way: Given a text search result, the user can issue a query that re-ranks this result with respect to visual similarity from given selected example image. This re-ranking is done in a distributed manner on the Infinispan nodes. GPS Location Search About 8 % of the CoPhIR images contain information about their GPS location [7]. We can build the PPP-Codes index sim I GPS on the geographic distance for this subset of images. Again, the GPS location information can be also used during processing of candidate sets from other modalities.
}
First, let us have a look at the overall architecture of the system as sketched in Fig. 5. The core of the system is a distributed ID-object store implemented using Infinispan8 – a Java-based project that provides API for executing
Attribute Search In order to demonstrate the universality of our approach, we can build any standard B+ -tree or hash table index on attribute fields like flickr_user; such index attr I flicker user would contain object IDs that can be further processed by the Infinispan store.
7 http://www.flickr.com 8 http://www.jboss.org/infinispan/
9 http://lucene.apache.org
Mobile Netw Appl
worker PPP-Codes index simI mpeg7
Lucene text tags,title I
index
Ispn node
worker
worker
Ispn node
PPP-Codes index simI GPS
index
Ispn node
Infinispan (Ispn) ID-object store worker
worker
index
index
Ispn node
Ispn node
B+-tree attr flickr_user I
Fig. 5 Schema of distributed system for large-scale image retrieval on CoPhIR
4.2 Performance of 100M CoPhIR visual search Let us have a closer look at the performance of the above mentioned image management system with a focus on the MPEG-7 similarity search. The whole system has been implemented in Java with the aid of implementation framework for similarity search MESSIF [6]. The system is very flexible in terms of hardware infrastructure it can run on, ranging from a single node to large clusters. Settings The whole CoPhIR dataset occupied over 100 GB of space on disk, where the five MPEG7 descriptors constitute over half of this volume. For the purpose of basic efficiency evaluation, we were running the system on a single server with 8-core Intel Xeon @ 2.0 GHz with 12 GB of main memory and SATA SSD disk (measured transfer rate about 300 MB/s with random accesses). The PPP-Codes index sim I mpeg7 occupied around 4.5 GB in memory having five independent pivot spaces each with 512 pivots (see Section 3.3 or the PPP-Codes paper [25]). The two main measures that we observe are the recall, which is the percentage of the precise k-NN answer returned by the approximate k-NN search, and response time – wall-clock time of the whole query processing pipeline. All results are averaged over 1000 queries randomly selected from the data collection (and excluded from the actually indexed dataset); the disk caches are dropped after each batch of 1000 queries. Search Efficiency Results The key parameter of the similarity search, which influences both recall and response time, is the size of the candidate set C(q.mpeg7) obtained from
the sim I mpeg7 index. Figure 6 shows development of k-NN recall for several values of k (left vertical axis) and search time (right vertical axis) with respect to candidate set size (horizontal axis). We can see that our approach can achieve very high recall while accessing thousands of objects out of 106 million. The recall grows very steeply in the beginning, achieving almost 90 % for 1-NN and 10-NN around |Cq| = 5000; as expected, the time grows practically linearly. For instance, for |C(q)| = 5000, the average response time is about 750 ms and the system can resolve about three queries per second. 4.3 Visual search with deep neural networks Let us describe another CBIR application built using the proposed generic architecture. Unlike in the first example,
Fig. 6 Recall and search times of k-NN(q.mpeg7) queries on the system with 106M CoPhIR images
Mobile Netw Appl Fig. 7 Examples of visual queries using the DeCAF descriptors. The copyright of the depicted images belongs to their authors; the images are used purely for research purposes according to Profiset usage agreement at http://disa.fi.muni. cz/results/software/profiset/
we do not use the standard MPEG7 global visual descriptors but the DeCAF features [12] – cutting edge technique for measuring visual similarity of images. These features are based on a very successful image classifier that exploits convolutional deep neural networks [17] and that gained significant attention by winning the 2012 ImageNet challenge, defeating other approaches by a significant margin [17]. This neural network was trained on about 1.2M images classified into 1000 categories. However, it was soon observed that intermediate outputs of hidden layers of the network can be used as features for assessment of general image similarity for various types of images – even without retraining the neural network anyhow [12, 17]. Specifically, we use the DeCAF7 feature produced by the last hidden layer of the neural network model provided
by Caffe10 [15], which has followed the training procedure described in the original paper [17]. We have extracted this descriptor from a collection called Profiset11 consisting of 20 million high quality images provided for research purposes by a microstock photography company [8]. Each of the image is accompanied with a set of keywords and we have also extracted its dominant color as an MPEG7 descriptor [21] which can serve for searching (or filtering) the images using color palette. An overall image entry has the following JSON representation: { "ID": "0009876549", "keywords": ["summer", "beach", "ocean"], "decaf7": [5.431, 0.0042, 0.0, 0.97,... ], "dom_color": [[0x9E, 0xC2, 0x13, 0.3],... ] }
The DeCAF7 descriptor of a an image is a 4096-dimensional vector of real numbers and thus the whole image record has almost 20 KB on the disk; the whole 20M dataset has thus some 400 GB of uncompressed data. Indexes and queries The overall system architecture is very similar as described in Section 4.1 and depicted in Fig. 5, only the specific indexes differ in correspondence to actual data fields:
Fig. 8 Recall of approximate k-NN search on collection of 20 million 4096-dimensional vectors using PPP-Codes index sim I decaf7
10 http://caffe.berkeleyvision.org 11 http://disa.fi.muni.cz/profiset/
Mobile Netw Appl
Fig. 9 Left: average times of candidate set generation using sim I decaf7 . Right: average search times of k-NN(q.decaf 7) on system with one node
–
–
–
sim I decaf7
is a PPP-Codes index on the decaf7 field; the similarity distance δdecaf7 is Euclidean distance, which is recommended by the original papers [12, 17]; sim I dom color index can match images according to their most significant colors; the MPEG7 standard [21] recommends a specific similarity distance δdom color , which we use and build a distance-based PPP-Codes index; the dominant color descriptor can be well used in combination with the decaf7 similarity search; text I keywords is a Lucene-based index that can be used in the same way as described in Section 4.1.
Application of the convolutional deep neural networks in the area of content-based image retrieval was a breakthrough because the results incredibly well match the human notion of general image similarity. See Fig. 7 for an example of results of four 6-NN similarity queries from the Profiset (the first image is always the query object). A proper evaluation of the similarity effectiveness is beyond the scope of this paper, but our experience confirms the extremely high quality of the search results achieved by this approach [15]; besides a “purely visual” information, the descriptors seem to carry a certain semantic information about the image.
collection). We can see that even though the search space is very complex, the PPP-Codes can achieve high recall values with candidate sets under 0.05 % of the dataset size (10,000 objects out of 20 million). The left graph in Fig. 9 shows processing times of the candidate set generation by the sim I decaf7 index. The horizontal axis shows the size of the generated set C(q.decaf7) and individual curves in the graph correspond to processing times for query load with various query frequency; the load is simulated by issuing queries in regular intervals according to the required frequency (e.g. 250 ms interval for 4 queries per second). We can see that for light query loads, the index search times grow linearly with |C(q.decaf7)| (as expected), and the index starts to be overloaded only with frequency of 6 queries per second and more and only for larger candidate sets. The PPP-Codes index is in memory and it exploits multi-threading for both intraquery and inter-query parallelism [25] and thus the query throughput depends only on the number of CPU cores of the machine.
Performance Management of such dataset for efficient k-NN search is a nontrivial task because of the data volume (400 GB of uncompressed data) and because of the complexity of the search space (4096 dimensions). The experiments were realized on one to four computational nodes, each with the same parameters as specified in Section 4.2. First, let us have a look at the efficiency of the PPP-Codes similarity index sim I decaf7 per se, which is measured in the same way as in the previous sections – dependence of the k-NN recall on the size of the candidate set C(q.decaf7). These results are depicted in Fig. 8 for different values of k (averages over random 1000 queries that are not in the data
Fig. 10 Average search times of approximate k-NN search on 20 million DeCAF descriptors for different number of refining nodes
Mobile Netw Appl
5 Conclusions
Fig. 11 Average search times of k-NN search on 20 million DeCAF descriptors with different query load; candidate set size 8000
The right graph in Fig. 9 shows the overall wall-clock times of the full k-NN(q.decaf 7) search on a system with one node; the node hosts both the sim I decaf7 index and the Infinispan storage. Again, individual curves correspond to different query frequencies (this time only up to 4 queries per second); the process of candidate set generation (the left graph) is also covered by these times. We can see that, for instance, for candidate set size of 8000, the system can handle up to two simultaneous queries per second and then the queries start to age, which results in unacceptable increase of average response times. The next graph in Fig. 10 shows the same search times, but this time individual curves correspond to different numbers of nodes in the distributed ID-object store, which forms the core of the system. The frequency of the queries is one per second, so this graph shows how the distributed candidate set refinement can speedup a single query processing. We can see that the influences of the distribution is significant only for larger candidate sets; this is caused by the fixed costs of the candidate set generation (Fig. 9, left) and by additional costs of inter node communication. In the next experiment, we set the candidate set to 8000, which corresponds to recall values almost 90 % for 10-NN and we test how the distribution of the system influence query throughput. Figure 11 shows average query response times for variable query frequencies (horizontal axis) and for different number of nodes in the Infinispan core. We can see that the distributed system can clearly speedup the overall performance so that higher query frequencies can be handled gracefully. If we double the number of nodes in the system, the query throughput is not fully doubled which is again caused by fixed costs of the query processing and by costs of communication among the nodes.
An efficient generic similarity search on the big scale is one of desired goals in the area of data management, since it would be very beneficial for many information retrieval tasks. We have described a flexible architecture that can combine latest similarity indexing techniques with contemporary highly scalable data stores. The core of the system is a distributed key-value store that organizes objects from one or more data collections based on unique object IDs. To this core, various secondary indexes can be connected with a focus on generic distance-based similarity search. Given a query, the index produces a candidate set of object IDs, which is refined in a distributed manner by the core data store. The data objects can have various fields of different types and the system supports especially the following types of search queries: 1) similarity queries by example on a single search modality, 2) similarity search enriched by filtering or post-ranking with the aid of other modalities (multi-modal search by so-called late fusion), 3) re-ranking of a search query result by different ranking criteria. We have built two specific systems that manage image collections for content-based image retrieval. The first collection is the CoPhIR benchmark with 106 million images together with five MPEG7 visual descriptors [7] and other metadata such as tags or geographic location. The second dataset is composed of 20 million images with complex and powerful visual DeCAF features obtained using deep convolutional neural networks [17]; every image record also contains keywords and information about dominant colors. Both prototype systems use the Infinispan data grid as the core distributed key-value store and the PPP-Codes index [25] for visual similarity search, for geographic distance search, and for similarity on the dominant color descriptors; further, we use Lucene for search on annotations and B+ -tree index for attribute data. The similarity search in our systems is very efficient, achieving online response times even on the very large and complex DeCAF neural network descriptors. If we issue the search queries in parallel with a high frequency, the refinement of the candidate set starts to slow the system down and this is the point when the distributed nature of the central data store can increase the query throughput of the system. According to our performance experiments, the speedup achieved by adding nodes (the scalability) is significant especially for larger query frequencies. The system for visual search on the DeCAF descriptors from 20 million images is available as an online demonstration application12 . 12 http://disa.fi.muni.cz/demos/profiset-decaf/
Mobile Netw Appl Acknowledgements This work was supported by the Czech Research Foundation project P103/12/G084.
References 1. Amato G, Gennaro C, Savino P (2012) MI-File: Using inverted files for scalable approximate similarity search. Multimed Tools Appl:1–30 2. Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: A survey. Multimed Syst 16:345–379 3. Batko M, Falchi F, Lucchese C, Novak D, Perego R, Rabitti F, Sedmidubsky J, Zezula P (2010) Building a web-scale image similarity search system. Multimed Tools Appl 47(3): 599–629 4. Batko M, Kohoutkova P, Novak D (2009) CoPhIR Image Collection under the Microscope. In: Proceedings of SISAP 2009, pp. 47–54. IEEE Computer Society 5. Batko M, Novak D, Falchi F, Zezula P (2006) On scalability of the similarity search in the world of peers. In: Proceedings of InfoScale ’06. ACM Press, New York, p 12 6. Batko M, Novak D, Zezula P (2007) MESSIF: Metric Similarity Search Implementation Framework. In: Digital Libraries: Research and Development, vol. LNCS 4877. Springer, pp 1–10 7. Bolettieri P, Esuli A, Falchi F, Lucchese C, Perego R, Piccioli T, Rabitti F (2009) CoPhIR: A Test Collection for Content-Based Image Retrieval. CoRR abs/0905.4 8. Budikova P, Batko M, Zezula P (2011) Evaluation Platform for Content-based Image Retrieval Systems. In: International Conference on Theory and Practice of Digital Libraries, LNCS. Springer Berlin, Heidelberg, pp 130–142 9. Budikova P, Batko M, Zezula P (2012) Query language for complex similarity queries. In: Advances in Databases and Information Systems, LNCS. Springer Berlin, Heidelberg, pp 85–98 10. Ch´avez E, Figueroa K, Navarro G (2008) Effective Proximity Retrieval by Ordering Permutations. IEEE Trans Pattern Anal Mach Intell 30(9):1647–1658 11. DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W (2007) Dynamo: Amazon Highly Available Key-value Store. ACM SIGOPS Oper Syst Rev 41(6):205–220 12. Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition 13. Esuli A (2012) Use of permutation prefixes for efficient and scalable approximate similarity search. Inf Process Manag 48(5): 889–902 14. Gil-Costa V, Marin M (2011) Approximate Distributed MetricSpace Search. In: Proceedings of LSDS-IR ’11, Glasgow, UK, October 28. ACM Press, New York, pp 15–20
15. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv:1408.5093 16. Karger D, Lehman E, Leighton T, Panigrahy R, Levine M, Lewin D (1997) Consistent hashing and random trees. In: Proceedings of STOC ’97. ACM Press, New York, pp 654–663 17. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet Classification with Deep Convolutional Neural Networks. Adv Neural Inf Process Syst:1106–1114 18. Lu W, Shen Y, Chen S, Ooi B (2012) Efficient processing of k nearest neighbor joins using mapreduce. Proceedings of the VLDB Endowment:1016–1027 19. Malkov Y, Ponomarenko A, Logvinov A, Krylov V (2012) Scalable Distributed Algorithm for Approximate Nearest Neighbor Search Problem in High Dimensional General Metric Spaces. In: Similarity Search and Applications, Lecture Notes in Computer Science, vol 7404. Springer Berlin, Heidelberg, pp 132–147 20. Moise D, Shestakov D, Gudmundsson G, Amsaleg L (2013) Terabyte-scale Image Similarity Search: Experience and Best Practice. In: 2013 IEEE International Conference on Big Data, pp. 674–682 21. MPEG-7 (2002) Multimedia content description interfaces. Part 3: Visual. ISO/IEC 2002:15938–3 22. Novak D, Batko M, Zezula P (2011) Metric Index: An Efficient and Scalable Solution for Precise and Approximate Similarity Search. Inf Syst 36(4):721–733 23. Novak D, Batko M, Zezula P (2012) Large-scale similarity data management with distributed Metric Index. Inf Process Manag 48(5):855–872 24. Novak D, Zezula P (2006) M-Chord: A Scalable Distributed Similarity Search Structure. In: Proceedings of InfoScale ’06. ACM Press, New York, pp 1–10 25. Novak D, Zezula P (2014) Rank Aggregation of Candidate Sets for Efficient Similarity Search. In: Database and Expert Systems Applications: 25th International Conference, DEXA 2014. Proceedings, Part II, LNCS, vol 8645. Springer, pp 42–58 26. Patella M, Ciaccia P (2009) Approximate similarity search: A multi-faceted problem. J Discrete Alg 7(1):36–48 27. Silva YN, Pearson SS, Cheney JA (2013) Database Similarity Join for Metric Spaces. In: Similarity Search and Applications, pp. 266–279 28. Silva YN, Reed JM (2012) Exploiting MapReduce-based similarity joins. In: Proceedings of SIGMOD ’12. ACM Press, New York, p 693 29. Wan J, Wang D, Hoi S, Wu P, Zhu J, Zhang Y, Li J (2014) Deep Learning for Content-Based Image Retrieval: A Comprehensive Study. In: Proceedings of 22nd ACM International Conference on Multimedia 30. Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity Search: The Metric Space Approach, Advances in Database Systems, vol 32. Springer 31. Zezula P, Savino P, Amato G, Rabitti F (1998) Approximate similarity retrieval with M-Trees. VLDB J 7(4):275–293