Jan 26, 2011 - In general, the precise metric-based similarity search is relatively ... dedicated set of computers among which the calculations are distributed. On the ...... for the experiments: a computing-server machine and a cluster of back-.
Information Processing and Management 48 (2012) 855–872
Contents lists available at ScienceDirect
Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman
Large-scale similarity data management with distributed Metric Index David Novak, Michal Batko ⇑, Pavel Zezula Masaryk University, Brno, Czech Republic
a r t i c l e
i n f o
Article history: Received 10 February 2010 Received in revised form 4 December 2010 Accepted 19 December 2010 Available online 26 January 2011 Keywords: Distributed data structures Performance tuning Similarity search Scalability Peer-to-peer structured networks Metric space
a b s t r a c t Metric space is a universal and versatile model of similarity that can be applied in various areas of non-text information retrieval. However, a general, efficient and scalable solution for metric data management is still a resisting research challenge. In this work, we try to make an important step towards such management system that would be able to scale to data collections of billions of objects. We propose a distributed index structure for similarity data management called the Metric Index (M-Index) which can answer queries in precise and approximate manner. This technique can take advantage of any distributed hash table that supports interval queries and utilize it as an underlying index. We have performed numerous experiments to test various settings of the M-Index structure and we have proved its usability by developing a full-featured publicly-available Web application. Ó 2010 Elsevier Ltd. All rights reserved.
1. Introduction Multimedia information retrieval is one of the computer science areas in which current data-management technologies lag behind user desiderata. Recently, we have witnessed that two major Internet search engines Google and Bing introduced a brand new service – finding images that are visually similar to a selected example image. Taking a closer look at the process behind, these services do not execute classic nearest-neighbor similarity search on the full image databases – they both apply similarity ranking methods on a collection of images selected by a preceding keyword search. General public estimations say that the image database managed by Google has billions or 10s of billions items – currently, there is no generally used data-management technology that would provide a full-fledged similarity search on such complex data and would scale to these data volumes. In this work, we try to make an important step towards such database system. We first propose a distributed version of a data structure for similarity management called the Metric Index (M-Index) (Novak, Batko, & Zezula, in press) that combines the valued properties of the structured peer-to-peer networks with the ability to search by similarity defined by a general metric. Then we thoroughly analyze its efficiency from various points of view – the performance tests are conducted on a collection of half a billion MPEG-7 global image descriptors (MPEG-7, 2002) from the CoPhIR dataset (Bolettieri et al., 2009) (five descriptors extracted from each of the 100 million Flickr images). Finally, we analyze the performance behavior of such large-scale distributed system within a fully functional Web application – we focus on real query response times, throughput, and building costs on various hardware infrastructures. The M-Index approach adopts the metric space as a universal and versatile model of similarity that can be applied in various areas of non-text information retrieval. The core of the M-Index is a general mapping from the data space to a numeric domain. This approach enables to actually store the data in well-established structures such as the B+-tree or to build a distributed system based on any structure for efficient key-object management such as P-Grid (Aberer, 2001) or Skip ⇑ Corresponding author. E-mail addresses: david.novak@fi.muni.cz (D. Novak), batko@fi.muni.cz (M. Batko), zezula@fi.muni.cz (P. Zezula). 0306-4573/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2010.12.004
856
D. Novak et al. / Information Processing and Management 48 (2012) 855–872
Graphs (Aspnes & Shah, 2003). We propose novel algorithms for both precise and approximate strategies for similarity search on the distributed M-Index and we also describe a way to combine the distributed and the centralized M-Index variants. The efficiency and performance evaluation of the proposed system results in several findings. The precise similarity search scales to hundred-million data volumes with expected costs development as the dataset size increases – the overall costs grow slightly sublinearly. Nevertheless, nearly-constant parallel costs can be achieved if the hardware resources grow simultaneously with the data volume. To the best of our knowledge, this work is the first to provide experiments with precise similarity search on dataset sizes of such order of magnitude. Furthermore, the approximate strategy of M-Index delivers query answers of a stable quality even when the dataset size increases by two orders of magnitude while keeping the search costs fixed. The approximate answer recall is almost constant for close nearest-neighbors of the query point. The real-life analysis proved that the distributed M-Index system can run on a wide variety of hardware infrastructures from a singleCPU machine to a large computer cluster – mapping of the system to specific hardware can be used for tuning the system performance in terms of response times and query throughput. Section 2 provides discussion about current theoretical works and existing systems that have objectives similar to ours. In Section 3, the distributed M-Index is introduced – its principles, architecture, and search algorithms. Section 4 evaluates the efficiency of the structure and its algorithms and Section 5 examines its performance in real-application conditions and on various hardware infrastructures. Finally, Section 6 summarizes and concludes the text. 2. Related work The metric-based data management has become a significant research stream in the last decade (Zezula, Amato, Dohnal, & Batko, 2006, 2005). The research has explored the principles of metric indexing and has developed both basic and advanced index structures. In general, the precise metric-based similarity search is relatively expensive and the costs typically grow linearly with the dataset size (Zezula et al., 2006). This challenge is being solved: (1) by sacrificing the preciseness and adopting approximate search strategies, (2) by designing distributed data structures, or (3) by a combination of these two approaches. A well-established metric indexing structure M-Tree (Ciaccia, Patella, & Zezula, 1997), which is a balanced disk-oriented tree-structure, uses the first approach by proposing approximate evaluation strategies (Zezula, Savino, Amato, & Rabitti, 1998). We have shown that M-Index is more efficient for both precise and appropriate search in Novak et al. (in press) by numerous experiments. Recently, several approximate techniques emerged (Amato & Savino, 2008, Chávez, Figueroa, & Navarro, 2008, 2009b) that index data based on a similar principle as the M-Index, i.e. according to pivot permutations (Skala, 2009). These centralized approaches were compared with the centralized M-Index (Novak et al., in press). However, they are designed purely for approximate search and cannot be straightforwardly adopted to a distributed environment. The only exception is the PP-Index (Esuli, 2009b), where the author sketches a way to distribute the index but does not provide any distributed experiments. The suggested solution also requires a single entry-point where the necessary tree structure is maintained and a dedicated set of computers among which the calculations are distributed. On the other hand, our proposed distributed M-Index builds on the peer-to-peer paradigms thus allowing having no bottleneck, being more tolerant to failures, and allowing to utilize practically unlimited distributed resources. Several distributed data structures for metric-based searching were proposed: GHT⁄ (Batko, Gennaro, & Zezula, 2005), MCAN (Falchi, Gennaro, & Zezula, 2007), or M-Chord (Novak & Zezula, 2006). A detailed comparison of these techniques (Batko, Novak, Falchi, & Zezula, 2006, 2008) indicates that the M-Chord slightly outperforms the others under most conditions. There is also a distributed approximate algorithm for the M-Chord (Novak, Batko, & Zezula, 2008). The M-Index deepens the core ideas of M-Chord; centralized M-Index is significantly more efficient than a centralized version of M-Chord (Novak et al., in press) and experiments conducted further in this work show that its approximate algorithm also outperforms M-Chord search that was realized on the same dataset (Novak et al., 2008). There are several on-line demonstrations of large-scale techniques for content-based image-retrieval. ALIPR1 searches a set of images according to automatically generated annotations. ImBrowse2 allows to search about 750,000 images by color, texture, shapes (and combinations) employing five independent engines. Furthermore, idée system3 searches a commercial database of 2.8 million images according to image signatures and GazoPa4 is a private service by Hitachi searching 80 million images by color and shape. Also the aforementioned PP-Index structure was employed in an image-retrieval system called MiPai (Esuli, 2009a) indexing 100 million images by visual similarity. All these systems, except for the very recent projects GazoPa and MiPai, search databases two orders of magnitude smaller than the one presented in this work. They are also typically designed only for digital images searched by a specific method, which contrasts with the highly versatile metric-based approach. On the other hand, the GazoPa and MiPai systems do not use the peer-to-peer approach but rather centralized indexing solutions with limited throughput and scalability. 1 2 3 4
http://www.alipr.com/. http://media-vibrance.itn.liu.se/. http://labs.ideeinc.com/. http://www.gazopa.com/.
D. Novak et al. / Information Processing and Management 48 (2012) 855–872
857
A decade ago, Gionis, Indyk, and Motwani (1999) proposed to apply the Locality Sensitive Hashing (LSH) to the task of distance-based similarity search. The key idea is to map the dataset into a set of buckets in such a way that close objects tend to be hashed into the same bucket and this tendency decreases with objects’ mutual distance. Searching for data similar to a given example, the index accesses the bucket where the example is hashed – this approach is distributable and relatively scalable. This research area evolved by introducing new techniques such as multi-probe index access (Lv, Josephson, Wang, Charikar, & Li, 2007) or a self-tunable LSH Forest (Bawa, Condie, & Ganesan, 2005). In general, we must define a specific LSH function for each particular (dis)similarity function on which we want to apply an LSH approach; there exist LSH functions for basic vector distances such as Minkowski distances, Hamming distance, or Jaccard coefficient on the domain of sets. The M-Index also defines a hash function that preserves locality of the data, it is applicable to any generic metric space, and its efficiency outperforms standard LSH functions on Euclidean vector space (Novak, Kyselak, & Zezula, 2010). 3. Distributed M-Index In order to deal with the scalability challenge, we develop a distributed version of Metric Index (M-Index) – a metricbased data structure for similarity data management (Novak et al., in press). The M-Index can be seen as a general hashing principle which transforms metric objects to certain M-Index hash keys and is independent of actual data-storage structure. The centralized version of the M-Index (Novak et al., in press) utilizes the B+-tree for storage and a search speed-up. In this work, we propose to store the data in a distributed hash table like Skip Graphs (Aspnes & Shah, 2003) and we describe corresponding distributed search algorithms that exploit parallelism. The index has constant building costs and, at the same time, its similarity search algorithms employ practically all known principles of metric space partitioning, pruning and filtering (Zezula et al., 2006), and thus achieve high search performance. In the following, we first briefly summarize the principles of M-Index data management fully described in Novak et al. (in press) and then propose two variants of generic distributed M-Index architecture. We also elaborate on a method to combine distributed and centralized M-Indexes into an efficient and scalable similarity management system. 3.1. M-Index principles The M-Index treats the data purely as a metric space M ¼ ðD; dÞ, where D is a domain of objects and d is a total distance function d : D D ! R satisfying metric postulates (non-negativity, identity, symmetry, and triangle inequality) (Zezula et al., 2006). Let us assume that this distance function is normalized: d : D D ! [0, 1), which can be achieved by dividing the distance by a constant greater than the maximal value of d expected in the data domain. In practice, such constant can always be determined and the transformed space retains all properties of the original one. The metric space as a model of similarity (Zezula et al., 2006) is typically searched according to the query-by-example paradigm – we focus on (1) the range query R(q, r), which retrieves all objects o 2 X within the range r from q (where X # D is the dataset stored by a data structure), and (2) the nearest-neighbors query kNN(q, k), which returns the k objects from X with the smallest distances from q. As many applications focus on query-processing efficiency and may tolerate a certain level of inaccuracy, the queries are often evaluated in approximate manner. The M-Index proposes both precise and approximate query-evaluation strategies that successfully compete with other approaches (Novak et al., in press). For more details about metric indexing and searching in general see some recent monographs (Zezula et al., 2006, 2005). 3.2. Data-space mapping in M-Index The fundamental idea of the M-Index is to define a universal mapping schema from a metric space D to a numeric domain. This schema uses a fixed set of n reference objects (pivots) {p0, p1, . . ., pn1} and has the ability to preserve proximity of data. To describe the mapping schema precisely, we need one preliminary definition. For an object o 2 D, let ()o be any permutation of pivot indexes {0,1, . . ., n 1} such that
dðpð0Þo ; oÞ 6 dðpð1Þo ; oÞ 6 6 dðpðn1Þo ; oÞ: In other words, sequence pð0Þo ; pð1Þo ; . . . ; pðn1Þo is ordered with respect to distances between the pivots and object o. The MIndex recursively partitions the data space in a Voronoi-like manner: On the first level, each object o 2 D is assigned to its closest pivot pi – clusters Ci are formed in this way (in other words, (0)o = i for all objects o 2 Ci). On the second level, each cluster Ci is partitioned into n 1 clusters by the same procedure using set of n 1 pivots {p1, . . ., pi1, pi+1, . . ., pn} creating clusters Ci,j, where j is index of the second closest pivot to objects in cluster Ci,j, i.e. (1)o = j. Fig. 1 (left) shows an example of the M-Index partitioning for two-levels. This partitioning process is repeated l-times for M-Index with l levels, where l is an integer 1 6 l 6 n. The M-Index with l levels further defines a mapping of the data space to a numeric domain keyl : D ! R where the integral part of keyl(o), o 2 D results from a numbering schema of the clusters. Specifically, cluster C i0 ;i1 ;...;il1 is assigned number:
clusterðC i0 ;i1 ;...;il1 Þ ¼
l1 X j¼0
ij nðl1jÞ :
ð1Þ
858
D. Novak et al. / Information Processing and Management 48 (2012) 855–872
C1,0
p2
C1,2
p1
C1,0
C1,2 p1
C2,1
C0,1 p0
C1,3
C1,3 C2,3
C3,1
C0,3 C3,0
C3,2 p3
... 0
... 4
5
6
7
8
16
Fig. 1. Principles of a two-level M-Index: partitioning (left) and mapping (right).
The fractional part of keyl(o) is the distance between the object o and its closest pivot dðpð0Þo ; oÞ. Altogether:
keyl ðoÞ ¼ dðpð0Þo ; oÞ þ clusterðC ð0Þo ;ð1Þo ;...;ðl1Þo Þ:
ð2Þ
Fig. 1 (right) sketches this M-Index mapping principle with the level l = 2 (only for clusters C1,j) – see the M-Index paper (Novak et al., in press) for details. The M-Index as described so far has a static partitioning and mapping for a given level l. Since neither the data distribution nor the space partitioning are uniform, the M-Index has a dynamic variant that further partitions only clusters that exceed certain data volume limit. In this case, the M-Index maintains a dynamic cluster tree to keep track of actual depth for individual clusters. The formula for key(o) calculation is modified to take into account actual tree-depth (Novak et al., in press) – the tree has an a priori given maximal level 1 6 lmax 6 n. Such key-assignment approach can be considered a variant of extensible hashing (Fagin, Nievergelt, Pippenger, and Strong, 1979). The schema of this tree-like structure for lmax = 3 is sketched in Fig. 2. The nodes of the cluster tree have the following structure: hl; ði0 ; . . . ; il1 Þ; ðptr0lþ1 ; . . . ; ptrlþ1 n1 Þi where l is the node level, (i0,. . .,il1) are the pivot indexes identifying this cluster, and ptrlþ1 are pointers to nodes on level l + 1. Non-existing clusters j on level l + 1 are indicated by undefined pointers ptrlþ1 (equal to null) – leaf nodes have all these pointers undefined. The i cluster tree has always the following root node: h0; ðÞ; ðptr10 ; . . . ; ptr1n1 Þi. 3.3. Distributed index architecture The M-Index mapping and searching principles are independent of the data storage layer. The M-Index requires that the data is stored according to the M-Index key, ideally with efficient evaluation of interval queries on the keys since such queries are exploited by the M-Index search algorithms – B+-tree is an ideal centralized structure for this purpose. The underlying storage structure can also be distributed which allows to run the query-evaluations in parallel and improve both the query response time and the throughput of the system. We aim for a highly-scalable solution, so we decided to use a peer-to-peer paradigm. It enables us to (1) have virtually unlimited computational and storage resources that can be adjusted on demand, (2) avoid bottlenecks caused by dedicated components, (3) utilize the inherent fault-tolerance mechanisms of the network. In the following, we propose a distributed M-Index that operates on a structured peer-to-peer network Skip Graphs (Aspnes and Shah, 2003). In this structure, every participating peer manages data with M-Index keys from a given interval and queries are efficiently forwarded to peers that overlap with the given query interval. In case of a dynamic M-Index, which requires existence of a cluster tree, there are two general variants of the overall system structure: 1. The cluster tree is a provided via an external centralized service available for all the peers. The service is synchronously notified whenever a data partition is moved between peers, e.g. by splitting incurred by inserts, and the cluster tree is kept up-to-date. This service can be replicated to improve the fault-tolerance.
C0
C1
C0,1
...
...
C0,n
Cn
l=1
...
Cn,1 ... Cn,0,1 Cn,0,2 Cn ,0,n−1
C0,2,1 C0,2,3 C0,2, n
0
... Cn,0
C0,2
Cn,n−1
keyl Fig. 2. Dynamic cluster tree, lmax = 3.
n3
l=2 l=3
D. Novak et al. / Information Processing and Management 48 (2012) 855–872
859
2. The system fully adopts the peer-to-peer principles where every peer is an independent entry-point for data modifications and queries. The former scenario implies less complicated search and update algorithms but the directory becomes a point of potential failure and performance bottleneck. The latter schema introduces another level of complexity for the dynamic version of M-Index because the cluster tree, as introduced in the previous section, must be distributed over the participating peers. The peers can organize their local data in an independent similarity structure, for instance M-Index – see Fig. 3 for the overall schema of the system. 3.4. Distributed M-Index with full cluster tree Let us assume that a full cluster tree exists, which happens (1) either when the system has a single point of entry or (2) when the M-Index is static and has no dynamic cluster tree – in this case, the cluster tree is static and is replicated on all peers. We propose distributed algorithms for precise evaluation of range and kNN queries and an approximate strategy forkNN. 3.4.1. Range algorithm Range query R(q,r) can be used in application when the user knows the maximal distance of interest, the query retrieves all the objects that are within the specified radius r from the query object q. Algorithm 1 shows the main procedure of the distributed search algorithm for a range query R(q,r) that is realized on the cluster tree. First, the procedure calculates distances d(pi,q), i = 0,. . .,n 1 and sorts them to find pivot permutation ()q (lines 2–4 of Algorithm 1). The algorithm traverses the cluster tree from root to leaves in the breadth-first manner using a queue Q of tree nodes (initialization on lines 5–6).The algorithm tries to prune the cluster tree, i.e. skip accessing of certain tree nodes, due to the repetitive application of Voronoi partitioning. According to double-pivot distance constraint (Zezula et al., 2006), cluster Ci can be skipped if
dðpi ; qÞ dðpj ; qÞ > 2 r
ð3Þ
where pj can be any pivot j 2 {0,. . .,n 1}. To maximize the distance difference on the left side of the Eq. (3), we set pj ¼ pð0Þq . Because the Voronoi partitioning is repeated l-times for cluster C i0 ;...;il1 , this rule can be applied l-times in order to skip this cluster – once for each pivot pi0 ; . . . ; pil1 . If pivot pð0Þq is among pivots pi0 ; . . . ; pil2 (for l P 2), it means that it was not considered for Voronoi partitioning on level l. Therefore, this pivot cannot be used as pivot pj in the pruning Eq. (3) and pj is identified as the pivot with the smallest distance d(pj,q) that is not among pi0 ; . . . ; pil2 (see lines10-12 of Algorithm 1). The descendants of the non-pruned internal nodes need to be further explored (lines 12–14). Keys of objects stored in cluster C i0 ;...;il1 contain distances from pivot pi0 and, therefore, Algorithm 1 can determine an interval of the M-Index key domain to be searched within that cluster (line 17):
½clusterðC i0 ;...;il1 Þ þ dðpi0 ; qÞ r; clusterðC i0 ;...;il1 Þ þ dðpi0 ; qÞ þ r:
ð4Þ
This is a direct application of object-pivot distance constraint (Zezula et al., 2006) (recall that cluster(C) returns the M-Index number of cluster C). A set intervals of query-relevant key intervals is created by this mechanism. Every peer P in the distributed structure is responsible for data from a certain interval of M-Index keys P. interval. At this point, the query request is sent to every peer whose interval intersects with the list intervals. The routing mechanism, provided by Skip Graphs (Aspnes and Shah, 2003) or other structure, is used and Algorithm 2 is executed on every target peer P to collect a partial answer AP based on its data. We assume that the peer’s local data is organized separately for each
M−Index
M−Index
M−Index
Distributed M−Index domain of size n l
ry
que
C1
M−Index
C0
...
C0,2 ...
C0,n
C0,2,1 C0,2,3 C0,2, n
0
keyl
n3
local M−Index
C0,1
...
M−Index
local M−Index
response
C0
C1
...
C0,2 C0,1
...
...
C0,n
C0,2,1 C0,2,3 C0,2, n
0
keyl
Fig. 3. Schema of distributed M-Index.
n3
860
D. Novak et al. / Information Processing and Management 48 (2012) 855–872
individual cluster – each interval I from intervals uniquely identifies a cluster C (see lines 2–10 of Algorithm 2). The peer can maintain the minimal (C.minKey) and maximal (C.maxKey) key of objects stored in each cluster C and these values can be used for skipping this cluster, if condition on lines 2–11 is satisfied. Looking at the intervals’ construction (4), this condition is an application of range-pivot distance constraint (Zezula et al., 2006), which allows skipping a data cluster if
dðp; qÞ þ r < rmin
or
dðp; qÞ r > r max
where rmin and rmin are the minimal and maximal object-pivot distances within cluster C (note that rmin and rmax are the fractional parts of C.minKey and C. maxKey, respectively). Algorithm 1. Distributed range search algorithm
1 2 3 4
Input Query object q, query radius r Output Set A = {o 2 Xjd(q,o) 6 r} A ; for i 0 to n 1 do calculate d(pi,q) sort the pivots to find pð0Þq ; . . . ; pðn1Þq
5 6 7 8
intervals empty list of intervals; Q empty queue; Q.enqueue(root) while : Q.empty do
lþ1 // node=hl, (i0,. . .,il1), ðptr lþ1 0 ; . . . ; ptr n1 Þ inode 9 //double-pivot distance constraint 10 j smallest j P 0 s.t. (j)q R {i0,. . .,il2}; 11 if l > 0^ dðpil1 ; qÞ dðpðjÞq ; qÞ > 2r then
12 13 14
Q.dequeue();
Continue; if node is internal then for i 0 to n 1 do
Þ) Q.enqueue(dereferenceðptrlþ1 i else (node is a leaf node) //object-pivot distance constraint 17 intervals.add(clusterðC i0 ;...;il1 Þ þ dðpi0 ; qÞ r; clusterðC i0 ;...;il1 Þ þ dðpi0 ; qÞ þ r); //use routing algorithm, e.g. Skip Graphs 18 subanswers sendSearchRequest(intervals); 19 foreach answer set AP in subanswers do A.addAll(AP);
15 16
Algorithm 2. Range search alg. at peer P Input search request for R(q,r) at intervals Output SetAP = {o 2 X—key(o) 2 intervals\P.interval ^ d(q,o) 6 r} ;; 1 AP 2 foreach I 2 intervals \ P.interval do 3 C P.getCluster(clusterNumber(I)); //range-pivot distance constraint 4 if I.upperBound < C.minKey _ I.lowerBound > C.maxKey 5 Continue; 6 if C.hasMetricIndex() 7 AP.addAll(C.processQuery(R(q,r))) 8 else 9 data C.getDataForInterval(I); 10 for each object o in data do //pivot filtering 11 if maxn1 i¼0 jdðpi ; qÞ dðpi ; oÞj > r then 12 Next o; 13 ifd(q,o) 6 r then 14 AP.addObject(o); 15 P.sendAnswer(AP,originator);
D. Novak et al. / Information Processing and Management 48 (2012) 855–872
861
The cluster data can be locally organized in an independent similarity structure (see Fig. 3). In such case, the query is executed on this index and the result is added to set AP (lines 6–7). If not and the data is organized according to M-Index keys, the interval of keys I is examined and the objects that satisfy the query condition are added to the partial answer AP (lines 9–14). Number of computations of the distance function d can be further reduced by a standard technique called pivot filtering (Zezula et al., 2006), which requires that the object-pivots distances are stored with the objects and used at this point (lines 11–12). Finally, the partial answers AP are returned to the query originator and merged to the final answer (lines 18–20 of Algorithm 1). The presented search algorithm is also the base for other similarity queries. 3.4.2. Nearest-neighbors search algorithm Nearest-neighbors query kNN(q,k) is more user-friendly than the range query, since the user requests to retrieve the given number k of objects most similar to q instead of specifying the radius. Algorithm for the kNN(q, k) queries adopt a standard two-phase strategy: 1. employ a heuristic to locate k data objects ‘‘near’’ q and measure the distance .k to the kth nearest object found – value .k is an upper bound on the distance to the actual kth nearest-neighbor of q; 2. execute the R(q, .k) query and return the k objects with the lowest distances from the query result. The heuristic accesses cluster C ð0Þq ;...;ðl1Þq , in which object q would be stored as ‘‘the most promising cluster’’. We could also use other radius-estimation heuristics, for instance based on statistical analysis of the dataset distance distribution (Doulkeridis, Vlachou, Kotidis, and Vazirgiannis, 2007). During evaluation of the R(q,.k) query, only k nearest objects are always kept and the actual radius .k may shrink as more objects are explored. The two steps of the algorithm are consecutive which increases the overall query response time. This overhead can be partially reduced by skipping the data searched in the first phase during the second-stage R(q, .k) processing. 3.4.3. Approximate kNN search The distributed approximate search strategy for kNN(q, k) queries follows schema of Algorithms 1 and 2. The cluster queue Q used in the first phase becomes a priority queue ordered according to a heuristic designed to prioritize clusters that should contain objects from the precise kNN(q, k) answer (see below). Because the query radius is unknown, neither the double-pivot constraint (lines 10–12) nor the interval specification (line 17) are applied. Instead, only the midpoint of the cluster’s interval is specified as clusterðC i0 ; ...; il1 Þ þ dðpi0 ; qÞ and added to the list of clusters to be explored. This Q-based clustertree traversal ends after a given parameter number c of ‘‘promising’’ clusters is identified and search requests are forwarded to peers responsible for these c intervals midpoints. The cluster-ordering heuristic (Novak et al., in press) is based purely on analysis of the d(p0, q),. . .,d(pn1,q) distances. Cluster C ð0Þq ;...;ðl1Þq , in which object q would be stored in the M-Index (let us denote it Cq), gets the highest priority – it is assigned penalty equal to 0. Cluster Cq has the smallest sum of distances dðpð0Þq ; qÞ þ þ dðpðl1Þq ; qÞ and this sum grows for other clusters. This is reflected by the penalty in order to express the ‘‘proximity’’ of the cluster to q. Specifically:
penaltyq ðC i0 ;...;il1 Þ ¼
l1 n o 1 X max dðpij ; qÞ dðpðjÞq ; qÞ; 0 : l j¼0
ð5Þ
Note that the penalty is normalized by 1/l in order to make the penalty values comparable for clusters on different levels. Fig. 4 provides an example in which Cq = C0,3 and the query-pivots distances are d(p0,q) = 0.2, d(p3, q) = 0.25, d(p1,q) = 0.3, and d(p2, q) = 0.5. We know that penaltyq(C0,3) = 0 and penalties for other ‘‘close’’ clusters are, for instance:
C1,0
0.5
C0,1 p0
p2
C1,2
p1 0.3
0.2
C2,1
C1,3 q
C3,1
C0,3
C2,3
0.25
C3,2 C3,0
p3
Fig. 4. Principles of approximate strategy.
862
D. Novak et al. / Information Processing and Management 48 (2012) 855–872
penaltyq ðC 0;1 Þ ¼ 1=2 ðdðp0 ; qÞ dðp0 ; qÞ þ ðdðp1 ; qÞ dðp3 ; qÞÞÞ ¼ 0:025; penaltyq ðC 3;0 Þ ¼ 1=2 ðð0:25 0:2Þ þ max f0:2 0:25; 0gÞ ¼ 0:025; penaltyq ðC 1;0 Þ ¼ 1=2 ðð0:3 0:2Þ þ max f0:2 0:25; 0gÞ ¼ 0:5: Approximate variant of Algorithm 2 is also slightly modified. It requires a parameter local – maximal volume of data to be accessed within every cluster (specified by number of objects or block reads). As intervals I from intervals are specified only by their midpoints, lines 4–5 are skipped. If the cluster is locally organized, the approximate query is processed on the structure (a local M-Index, for instance Novak et al. (in press)) passing the parameter local. If not, objects of the cluster are explored starting from the specified midpoint and walking alternately one block right and left (according to M-Index keys) until the data volume of size local is explored. After accessing first k objects, the approximate algorithm uses current query radius .k for pivot filtering (lines 11–12). Naturally, this approximate strategy has a number of alternatives and modifications, for instance, number of clusters to be visited c can be determined adaptively for each query point (Novak et al., 2008) or the query can be forwarded to neighboring peers, in case the cluster spreads over more than one peer (Novak et al., 2008). 3.5. Dynamic M-Index with distributed cluster tree Let us briefly sketch a distributed M-Index system with a dynamic number of levels and with the cluster tree distributed over the participating peers. Such architecture turns from a system based on a (dynamic) hashing into a peer-to-peer structure with an Address Search Tree (AST) as proposed, for instance, in GHT⁄ (Batko et al., 2005) or P-Tree (Crainiceanu, Linga, Gehrke, and Shanmugasundaram, 2004). In this concept, every peer maintains a part of the global search-tree that covers root-leaf paths for all leaves for which the peer stores some data. Each pointer in the local tree then points either to another local tree node or to a different peer that contains given part of the tree. The insert operation typically follows a single root-leaf path being forwarded from peer-to-peer, if necessary. Any modification in the tree structure (caused by node splitting/merging or by peers joining/leaving) is spread to a subset of peers influenced by this change. The search requests wander over the distributed structure (they often fork) according to what part of the tree-structure the search requires and the replies with partial answers are finally sent back to the originator. Such concept could be applied for the M-Index as well, but we do not specify it precisely because we would like to keep the hashing nature of the M-Index. Moreover, the static index (with a low number of levels l) can be sufficiently efficient as large clusters can be distributed over several peers and all clusters are further indexed (partitioned) by local M-Index structures. Individual clusters can be even indexed by another M-Indexes. In this way, we define M-Index with multiple ‘‘layers’’ each of which uses a different set of pivots and thus brings an orthogonal mapping of the dataset. 4. M-Index efficiency evaluation In this section, we present experiments conducted on the distributed M-Index architecture described above. The objective of these trials is to analyze efficiency of the presented precise and approximate search algorithms in terms of various cost measures. The structure was implemented in Java using the MESSIF (Batko, Novak, and Zezula, 2007) framework which provides extensive support for creating prototypes of metric-based indexing techniques. Data partitioning and query navigation of the system are provided by the peer-to-peer structure Skip Graphs (Aspnes and Shah, 2003) as proposed in Sections 3.3 and 3.4. All the distributed M-Index systems under test have static number of levels and, thus, apply algorithms from Section 3.4. A summary of the various symbols and M-Index parameters is provided in Fig. 5. 4.1. Settings and measurements The experiments were realized on a voluminous real-life dataset consisting of visual descriptors from digital images – five MPEG-7 features were extracted from every image: Scalable Color, Color Structure, Color Layout, Edge Histogram, and Homogeneous Texture (MPEG-7, 2002). The images with the descriptors were taken from the CoPhIR Database (Bolettieri et al., 2009, 2010, Batko, Kohoutkova, and Novak, 2009). Each of these descriptors is compared using a metric function and we use a weighted sum of these distances to aggregate them into a single metric space (Batko et al., 2010). In total, this representation can be viewed as a 280-dimensional vector (occupying about 1 kB of memory) together with a complex aggregation distance function – evaluation of one distance takes approximately 0.01 ms on standard hardware. The intrinsic dimensionality (Chávez and Navarro, 2000) of this dataset is 12.9 which makes it rather difficult to index. We used sets of various sizes from 1 million to full 100 million images. Table 1 summarizes the M-Index configurations used in this section for various dataset sizes. The one-million dataset is organized by a small distributed M-Index with l = 2, n = 16 and with five peers. This network is created only for comparison reasons – such data volumes could be conveniently managed by a single centralized M-Index. The 10-million dataset is distributed among 50 peers by a static distributed M-Index with l = 2 or l = 3. The partitioning of the data among the peers is done by Skip Graphs mechanisms (Aspnes and Shah, 2003) resulting in roughly uniform data distribution. The full 100 million CoPhIR dataset is distributed over 500 peers by M-Index with the same settings. Individual clusters (or their parts,
D. Novak et al. / Information Processing and Management 48 (2012) 855–872
863
Fig. 5. Symbols used in this work.
Table 1 M-Index system settings for various dataset sizes. Dataset volume
Settings
# Of peers
1,000,000
Distributed static M-Index, l = 2 and n = 16 Distributed static M-Index, l = 2, 3 and n = 8, 16, 32 Distributed static M-Index, l = 2 and n = 32
5
10,000,000 100,000,000
50 500
if the cluster is spread over several peers) are locally organized by separate centralized M-Indexes as discussed in Section 3.4.1. These local indexes are dynamic with lmax = 6 and a set of 32 pivots (n = 32). This set of pivots was selected uniformly randomly from the dataset and it is, naturally, different than the set of pivots for the distributed M-Index (but all local M-Indexes work with the same set of pivots). Search efficiency of the structure is measured by I/O costs either in terms of the number of 4 kB-block reads realized during the search or by number of objects accessed. Further, the computational costs are measured as the number of evaluations of the distance function d (distance computations) and we also report on average response times. The network communication costs are expressed as the number of messages exchanged during the query-processing. For the approximate evaluation strategies, we measure the answer quality by the recall, i.e. the percentage of the precise query result that is returned by the approximate search. All presented experiment results are taken as an average over 50 queries randomly chosen from the dataset. 4.2. Precise similarity queries In the first set of experiments, we focus on efficiency of the precise evaluation strategy for kNN queries, which incorporates evaluation of a range query (see Section 3.4.2). First, let us compare efficiency of various M-Index settings for the 10-million CoPhIR dataset – we present results for three configurations of distributed M-Index: 1. l = 2 and n = 16 which results in maximally 16 15 = 240 clusters distributed among 50 peers, 2. l = 2 and n = 32 with a maximum of 991 clusters, 3. l = 3 and n = 8 with a maximum of 336 clusters. We present these parameter combinations because they result in ‘‘reasonable’’ numbers of clusters (with reasonable data volumes) with respect to the number of peers. Hand in hand with this fact, these settings have efficiency superior to other combinations we have tested; the results are summarized in Table 2. The table confirms one well-known general truth – the precise similarity search for complex data types is relatively expensive. All the M-Index networks have to access about 40% of the data and compute over 1.2 million distances to process
864
D. Novak et al. / Information Processing and Management 48 (2012) 855–872
Table 2 Costs of precise kNN(q, 30) on distributed M-Index with various settings on 10 million CoPhIR dataset. M-Index settings
Clusters
d Comp.
Obj. accessed (%)
I/O (blocks)
Messages
Response (ms)
Level l = 2, n = 16 Level l = 2, n = 32 Level l = 3, n = 8
240 991 336
1,283,000 1,283,000 1,291,000
40.50 40.46 42.42
965,000 963,000 1,010,000
75 79 89
9420 9350 10,260
a kNN in a precise manner. The second configuration, which partitions the space into more clusters than the first one, seems to be slightly more efficient. Although the number of potential clusters grows four times, the number of messages sent grows only from 75 to 79, which indicates that majority of these additional clusters are pruned by the double-pivot distance constraint (see Section 3.4.1) so that search requests for these clusters are not sent to the network. On the other hand, the third configuration with eight pivots and three levels obviously prunes the clusters less effectively – there is a higher number of sent messages and larger volume of accessed data. The numbers of distance computations are very close for all three settings due to the pivot filtering (see Section 3.4.1) but the higher I/O costs cause longer response times of the third configuration. Naturally, the response times are strongly influenced by the hardware infrastructure – this aspect is discussed in detail in Section 5 of this work. Fig. 6 shows costs of the precise kNN processing for M-Index systems with various dataset sizes. We chose the second configuration (n = 32) for the 10-million dataset and applied it also for the full hundred-million set (see Table 1). The graph shows the three main cost measures introduced above and they all exhibit an expected linear trend (note that both axes have logarithmic scale). Taking a closer look, the one-million M-Index accesses 50% of the data objects, this measure is about 40% for the 10-million M-Index (Table 2), and the hundred-million system keeps the number at 40%. The one-million M-Index computes distances to almost 250,000 objects (25%), the 10-million system decreases the relative number of d computations to about 12.8% (Table 2), and this value is 10% for the full dataset, as observable from the second line in Fig. 6. Overall, these total costs seem to have slightly sublinear trends for growing dataset volume. One of the key objectives of distributed data structures is parallel processing of search requests. Let us define parallel versions of the cost measures discussed above as the ‘‘maximal search costs realized in a sequential manner during the query distributed processing’’. These parallel measures assume that every peer runs on an independent hardware infrastructure (CPU and disk). Fig. 7 shows these results for the precise kNN experiment on various data volumes. The parallel numbers
Costs for precise kNN (q,30) 1e+08
costs
1e+07 1e+06 100000 # accessed objects # of dist. comp. # of block reads
10000 1000 1e+06
1e+07
1e+08
dataset size Fig. 6. Costs of precise kNN(q, 30) for various data volumes.
Parallel costs for precise kNN (q,30) par. # of acc. obj. par. # of dist. c. par. # of block r.
200000
costs
150000 100000 50000 0 1e+06
1e+07
dataset size Fig. 7. Parallel costs of precise kNN(q, 30).
1e+08
D. Novak et al. / Information Processing and Management 48 (2012) 855–872
Recall for approx. kNN (q,30)
100
average recall [%]
865
80 60 local =2,000 local =6,000 local =10,000 local =20,000
40 20 0
0
0.2%
0.4%
0.6%
0.8%
1%
total # of accessed objects Fig. 8. Recall of approximate kNN(q, 30) on the 10-million dataset with various local-search settings.
of accessed objects and distance computations are slightly growing (note that, this time, the y-axis has linear scale). This trend is caused by the growth of cluster volumes – large clusters may occupy the whole peer (or more peers) and, thus, a query can sequentially access all data objects stored by that peer. On the other hand, data of such peers is less fragmented which is indicated by almost constant parallel number of block reads. 4.3. Approximate Similarity Searching In this subsection, we analyze efficiency of the M-Index algorithm for approximate kNN searching. The basic indicator we observe is the query recall with respect to query costs. As described in Section 3.4.3, the M-Index approximate algorithm has two main parameters: c is the number of ‘‘the most promising’’ clusters to be accessed by the algorithm, and local is the maximal number of objects probed by local centralized M-Index structures at the peers visited by the algorithm. Individual clusters have different sizes (they can be even empty) and we define total number of accessed objects as the sum of numbers of objects accessed on all peers visited during the approximate search. Fig. 8 shows the recall for kNN(q, 30) with respect to this measure. This experiment was performed on the M-Index structure with the l = 2, p = 32 settings on the 10-million dataset. Individual curves were created by increasing the number of accessed clusters c with different fixed parameter local. We can see that the with local = 2000 we can reach a high recall very quickly – the system has 90% recall while accessing only 0.1% of the managed data (10,000 objects). This setting apparently cannot reach higher recall than about 92% because some of the relevant objects are skipped during the very restricted local M-Index processing. With local = 6000 or higher, the search can achieve even 99% recall. In the kNN(q, k) experiment presented in Fig. 9, we set local = 6000 and present recall for various values of k. This graph bears an important message – if the M-Index approximate kNN search misses some objects from the precise answer, then these objects are more likely to be from higher k-positions. Overall, we can see that the M-Index is able to provide very high recall while accessing a tiny fraction of the database. Graph in Fig. 10 compares results of the same experiment (local = 6000 and growing c) for the three 10-million M-Index settings proposed above. We can see that the general trends are very similar. The settings with n = 32 would have slightly higher recall if we wanted the algorithm to access about 0.1% of the data, while n = 16 delivers better results for higher volumes of accessed data. The final experiment of this section answers an important question: ‘‘How scalable is the distributed M-Index approach?’’ We processed the approximate queries on M-Indexes with various dataset sizes and with the same configurations as in Figs. 6
Recall for approx. kNN search average recall [%]
100 80 60 kNN (q,10) kNN (q,30) kNN (q,50) kNN (q,100)
40 20 0
0
0.2%
0.4%
0.6%
0.8%
1%
total # of accessed objects Fig. 9. Recall of approximate kNN(q, k) for various k on the 10-million dataset with local = 6000.
866
D. Novak et al. / Information Processing and Management 48 (2012) 855–872
Recall for approx. kNN(q,30) average recall [%]
100 80 60 40
settings l =2, n =16 settings l =2, n=32 settings l =3, n =8
20 0
0
0.2%
0.4%
0.6%
0.8%
1.0%
total # of accessed objects Fig. 10. Recall of approximate kNN(q, 30) on the 10-million dataset with various M-Index settings.
Recall for approx. kNN search
average recall [%]
100 80 60 40 20 0 1e+06
kNN (q,10) kNN (q,30) kNN (q,50) kNN (q,100) 1e+07
1e+08
dataset size Fig. 11. Recall of approximate kNN(q, k) for various k and dataset size with fixed absolute total cost.
and 7. For each network we set local = 6000 and parameter c was set so that the queries accessed on average about 50,000 objects in total (c = 20 for 1- and 10-million structures and c = 10 for hundred-million network). These results are presented in Fig. 11 (note the logarithmic scale of axis x). We can see that the approximate kNN search has nearly constant recall for lower values of k and it slightly decreases for k = 50 and k = 100. This again proves that the approximate search is rarely loosing the closest neighbors. In other words, the M-Index reached nearly constant scalability for kNN(q,k) searches up to k = 30 which was shown by increasing the dataset size by two orders of magnitude. 5. Developing a real similarity-search application In the previous sections we have presented the distributed variant of the M-Index structure and proved the concept by numerous experiments. However, deploying an indexing structure as a real search system still requires some additional steps. In this section, we would like to describe our experience with creating such system from data preparation and hardware design to a fully functional application with a Web interface. We have performed a set of experiments that allowed us to better understand the searching performance characteristics of the structure while deployed and, as a result, tune its performance. Namely, we have tested various back-end architectures – memory-based and disk-based with various number of CPUs – and measured the query response times and throughput of the application. 5.1. Image retrieval by M-Index To prove our concept by developing a real application, the distributed M-Index structure was used as an underlying engine in the image-retrieval demonstration5 of the MUFIN project (Novak, Batko, and Zezula, 2009). The goal of the demonstration is to search for images that are visually similar to a given example (represented by a user-supplied image or image from the database) – this searching paradigm is usually referred to as the query-by-example. In order to search for similar images, we need to define what the ‘‘similar’’ means. A common practice in image-retrieval applications is to derive a content information from the image, such as a color histogram or edge map, which can be used
5
http://mufin.fi.muni.cz/imgsearch/.
D. Novak et al. / Information Processing and Management 48 (2012) 855–872
867
computationally to express the similarity. Such extracted features typically cover only a particular visual aspect, so we combine several such features to get overall similarity measure. 5.1.1. Dataset The extraction of image features can be a lengthy process, especially if the collection is very large. Therefore, we decided to use an existing dataset CoPhIR (Batko et al., 2010), which is a collection of 100 million images from a photo-sharing system Flickr6 This collection was crawled and the features extracted in a period of 18 months using 35 machines from the EGEE European GRID. Each image in the collection is represented by five MPEG-7 descriptors that cover color, textures and layout of images as numeric vectors. The similarity between images is then defined as a weighted sum of the metric functions defined for each descriptor – three descriptors have a weighted L1 or L2 distances, the other two are more complex (Batko et al., 2009). Since the resulting similarity function satisfies the metric postulates (Section 3.1), we can build a distributed M-Index structure over such data. 5.1.2. Queries The query-by-example in our application paradigm is implemented by (approximate) kNN(q, k) queries where q represents the features extracted from the user-supplied image (example) and k is the number of images displayed back to the user as an answer. A certain level of approximation is tolerable, since the image similarity is highly subjective and thus missing some images (especially the more distant ones) is not a problem. Since multiple users can access the system at the same time, the queries should be served in parallel utilizing the available resources as much as possible in order to keep the response times low. 5.1.3. Hardware The application works with 100 million images or, in other words, half a billion descriptors, which can be kept either in main memory or on a disk. The five descriptors of each image require about 1 kB on disk and 1.8 kB being in memory (the increase is caused by decompression and by the overhead of encapsulating the data in processable objects), which means that about 100 GB of disk storage or nearly 180 GB of RAM is needed. Some additional space is also required to store the M-Index control data – the cluster tree and the routing tables – but it is a negligible fraction of the data space (about 300 MB). During a search, the index structure needs to access the data (this incur the I/O costs) and do the necessary computations (most time is spent by evaluating the distance functions). Thus, the search response times are influenced heavily by the speed of the storage (disks or main memory) and the number of CPUs. 5.1.4. Interface The user interface in our application is represented by a Web page where a user can choose a query image and run the search (see Fig. 12). In general, the search procedure is the following: 1. A user provides an example image for which the similar images should be retrieved. It can be picked randomly, selected from the previous search result, searched for using keywords, or given explicitly by uploading an image. 2. The image features (five MPEG-7 descriptors in our case) are extracted – if the image is already in the database, we can skip this step. 3. A peer of the M-Index structure is contacted and kNN(q,k) query is executed. The query returns the identifiers of the images in the database and their distances from the query image q. 4. Finally, a Web page is generated showing thumbnail images for the returned identifiers. In the following, we will focus on the search execution, since submitting user requests, extraction and Web page generation are beyond the scope of this paper. 5.1.5. M-Index setup Considering the lessons learned from the experiments in Section 4 and the application requirements as described above, we have decided to use the following M-Index settings: 50 peers, l = 2, n = 32 for the 10M collection; 500 peers, l = 2, n = 32 for the 100M collection. Both these networks use a local index that is organized by separate dynamic M-Indexes with lmax = 6 and n = 32. Each peer organizes 200,000 objects on average. 5.2. Building the index structure A basic approach to create an index for a given collection, provided that the index is dynamic, is to insert the objects oneby-one into the index and let it adjust its internal structure. Indeed, this is possible with M-Index and we will establish it as a
6
http://flickr.com/.
868
D. Novak et al. / Information Processing and Management 48 (2012) 855–872
Fig. 12. User interface of the demonstration application.
baseline strategy. However, if the data collection is known in advance, we can preprocess the data and speed-up the insertion significantly by parallelization. Algorithm 3. Insert algorithm Input Object o 2 D Output Insert confirmation 1 for i 0 to n 1 do 2 calculate d(pi,o); 3 bind d(pi,o) with o for pivot filtering; 4 sort the pivots to find pð0Þo ; . . . ; pðn1Þo ; //for static level or via cluster tree 5 calculate key (o); //use routing algorithm (Skip Graphs) 6 send o to peer P responsible for key(o); 7 store object o in local index of peer P; 8 handle split if local storage of peer P overflows; 9 confirm object insertion;
Table 3 Building costs of a distributed M-Index for 100M images using peer-to-peer and bulk-insertion techniques. Method
Peers
Time
# Of objects per peer (k)
# Clusters per peer
Peer-to-peer insert Single machine bulk-insert GRID bulk-insert
587 500 500
13 h 12 min 13 h 57 min 3 h 29 min
120–210 180–220 180–220
1–17, Avg. 1.6 1–26, Avg. 1.8 1–26, Avg. 1.8
D. Novak et al. / Information Processing and Management 48 (2012) 855–872
869
Algorithm 4. Bulk-insert algorithm (static lvl.)
1 2 3 4 5
InputData collection X # D, p0, ,pn1,l Output Peer-to-peer network with the data for o 2 X do for i 0 to n 1 do calculate d(pi,o); bind d(pi,o) with o for pivot filtering; sort the pivots to find pð0Þo ; . . . ; pðn1Þo ;
6 calculate keyl(o); 7 split X into clusters C i0 ;;il1 ; 8 Q all clusters sorted by cluster(C); 9 for i 1 to #peers do 10 while #objects(Pi) < (#objects(X)/#peers) do // use cluster part if the size is too big 11 assign dequeue(Q) to peer Pi; 12 build peer-to-peer network; The procedure for inserting a single object into the index is formalized by Algorithm 3. We start the insertion of object o at any peer by executing the insert algorithm. The algorithm first evaluates the distances to pivots p0, ,pn1 and computes the M-Index key(o). Then the routing table of the underlying peer-to-peer network is consulted and the object is sent to the peer responsible for key(o). There, the object is stored into the peer’s local index. We keep a soft threshold on the number of objects that can be kept per peer. When this threshold is reached and a new peer is available, we move half of the data kept by the overloaded peer to the new one. The peers respect the cluster boundaries, so the adjacent clusters from the end of the peer’s key interval are moved. If there is a cluster that is too big to be moved as a whole, this cluster is divided into two parts moving the created part to the new peer. We refer to this method as a peer split. Observe that, given a static level l and a set of pivots p0, ,pn1, we can compute the M-Index key(o) for any object o 2 X, provided that the dataset X is known in advance, and determine its target cluster (see Section 3.2). Since all these computations are independent, the task of computing keys and dividing objects into clusters can be easily run in parallel. In this case, each parallel task processes a part of X and provides the objects clustered according to their key(o). Then, the respective cluster parts from all the parallel tasks are merged and the clusters are divided among a given number of peers, so that the number of objects in each peer is more-or-less equivalent. Note that clusters C are assigned in the order of their cluster(C) numbers, so the peer is always responsible for a whole interval of keys. Finally, the peer-to-peer network is created – each peer initializes its local index from the assigned clusters independently in parallel. The procedure is schematically depicted by Algorithm 4 and we will refer to it as the bulk-insert. This algorithm is also a typical candidate for stream-based parallel processing in the MapReduce fashion (Dean et al., 2008). 5.2.1. Building costs In this section, we summarize the costs of building the distributed M-Index for the settings specified in Section 5.1. Table 3 shows the costs of building the index by (1) inserting the objects one-by-one from one peer using an infrastructure of 6 computers with 8 CPUs each, (2) bulk-inserting the objects using a single machine with 8 CPUs, and (3) bulk-insertion on a GRID infrastructure with 42 CPUs. We applied a split threshold of 200,000 objects per peer with a tolerance of 20,000 objects used for cluster boundaries adjustments. In the first scenario, the peer-to-peer network was growing spontaneously by peer splits whenever the threshold was exceeded and a new peer was always available to handle the split. The objects were inserted into the network as fast as possible In the second and third scenarios, the described HW infrastructure was exploited as much as possible with an exception of one sequential phase – chopping of the sorted data domain among the peers (lines 8–11 of Algorithm 4). As expected, building the network by bulk-insertion in a highly parallel environment of a GRID infrastructure was by far the fastest. On the other hand, the GRID job control, the need to transfer the data between GRID nodes, and the resubmission of a small percentage of interrupted jobs resulted in a slightly higher time than 8/42 of the building costs required by the 8 CPU machine. What is rather interesting is that a single machine, when bulk-inserting the network, has finished the job in a similar time compared to the eight-times more powerful cluster of computers. We can also observe that the preprocessing allowed us to build a more compact network. The peers created by the peer-to-peer insert approach have a little less data per peer which resulted in more peers. The M-Indexes for experiments in the next section as well as the experiments in Section 4 were prepared by the bulkinsert technique.
870
D. Novak et al. / Information Processing and Management 48 (2012) 855–872
5.3. Searching the index Having the index prepared, we can measure the search performance of the distributed M-Index on different hardware infrastructures. As described in Section 5.1, the demonstration application uses the approximate kNN(q, k) queries when searching for similar images. Therefore, we fixate the query settings that give the best costs-effectiveness ratio as experimentally discovered in Section 4. Specifically, we set local=6,000, c = 10, and k = 30. We will use two distributed M-Indexes, one with the full 100 million collection and a smaller one taking a subset of 10 million images. Both the indexes are built using the bulk-insertion, the former having 500 peers and the latter 50 peers. We map the peer-to-peer network to a specific hardware and run a batch of 1000 queries in each experiment. 5.3.1. Hardware and measurements We have two hardware architectures available for the experiments: a computing-server machine and a cluster of backend servers. The computing-server has more memory and a high number of CPUs but its disks are slower. The cluster consists of middle-class servers with less memory and less CPUs that have a powerful disk array with 4 ms seek times and transfer rates of nearly 400 MB per second on each server. All the machines run a GNU/Linux operating system with a server Java virtual machine on which the M-Index software runs. The specific hardware configurations are summarized in Table 4. In each experiment, we simulate a behavior of 50 users that are simultaneously executing a batch of 20 queries (different for each user). This is achieved by executing a batch from any 50 peers at a pre-set time. We measure the response times of each query as well as the time of the whole experiment. The overall time allows us to compute the throughput of the system, i.e. the number of queries answered per second. 5.3.2. Experiment results Fig. 13 reports on the throughput (left), query times (right) for the 10M collection. These experiments were performed on the single Sun Fire server and we present two results in the graphs – one for the M-Index kept in memory and the other for a disk-based storage. We varied the number of CPUs on which the M-Index was mapped. We can observe that the system with one CPU only can serve 0.6 queries per second. Also the computational load incurred by the 50 queries executed in parallel while competing for the single-CPU resulted in response times of 15 s on average and we have observed a maximal response time of nearly 42 s. On the other hand, the 16 CPU variant achieved a throughput of 8 queries per second while queries were answered in 1 s on average. The total running time the whole batch was 1660 s using one CPU and 124 s for 16 CPUs. We can see that for the memory-based storage, the M-Index achieves nearly linear speed-up – with 16 times more computational resources we achieve 13.4 times higher throughput. For the disk-based index the queries are also blocked by disk I/O which resulted in lower throughput and higher response times. Moreover, the disk subsystem started to become a bottleneck for experiments with more than 8 CPUs, and we can observe that increase of the throughput and response times are slowing. Similar characteristics are presented in Fig. 14 for the 100M collection. In this case, the experiments were performed on the cluster of servers (as specified in Section 5.3.1). However, even the total of 120 GB RAM of all machines in the cluster cannot hold the whole 100M index in memory (180 GB would be required). For this reason, we do not provide the results for memory-based 100M collection but rather for the disk-based index where the necessary parts of the disk were in a
Table 4 Hardware architectures for experiments. Configuration
# Machines
CPUs per machine
RAM
Disks
Sun Fire 4600 server IBM 3400 cluster
1 6
16 Cores 8 Cores
64 GB 20 GB
3 70 GB 7 k SATA (striping) 5 72 GB 15 k SAS (RAID5)
Response time for approx. 30NN on 10M 25
response time [s]
queries per second
Throughput for approx. 30NN on 10M 9 8 7 6 5 4 3 2 1 0
10M memory 10M disk array 2
4
6
8
10
#CPUs
12
14
16
10M memory 10M disk array
20 15 10 5 0
2
4
6
8
10
12
#CPUs
Fig. 13. Throughput and response times of 1000 approximate 30NN queries run on 10M collection.
14
16
871
D. Novak et al. / Information Processing and Management 48 (2012) 855–872
Response time for approx. 30NN on 100M 4
12
3.5
response time [s]
queries per second
Throughput for approx. 30NN on 100M 14
10 8 6 100M cached 100M disk array
4 2
6
12
18
24
30
#CPUs
36
42
48
100M cached 100M disk array
3 2.5 2 1.5 1 0.5
6
12
18
24
30
36
42
48
#CPUs
Fig. 14. Throughput and response times of 1000 approximate 30NN queries run on 100M collection.
memory cache (this was done by enabling the disk caches and running the experiment twice for the same data). On the other hand, we have switched off the caching for the disk-only experiments, so the data was actually read from the disk. Compared to the 10M experiment, the disk arrays were able to supply the data faster and the influence is visible on slightly lower throughput when more CPUs were used. We can also observe that even when the collection size increased from 10M to 100M images, the results for the same amount of resources (i.e. from 6 to 16 CPUs) exhibit similar response times and throughput. For example, for 12 CPUs the 10M index had 1.3 s average response time and 6.8 queries per second while 100M index showed 1.8 s query time and a throughput of five queries per second. Furthermore, we can see that the MIndex exhibits nearly linear speed-up as the computational resources grow provided the data is cached. On the other hand, even if the data was actually read from the disk, the queries were served in a reasonable time suitable for an on-line Web application and the throughput of nearly nine simultaneous queries per second is acceptable too. Note that in all cases, the response times were measured on highly loaded systems which have no spare resources. Therefore, some of the queries were actually starving, especially in configurations with lower numbers of CPUs. The response times for idle 10M network on machine with 2 CPUs are 1 s on average and the idle 100M network handles a single query in 600 ms using 6 CPUs and in 300 ms on 18 CPU configuration. Our experiments have revealed the behavior of our index structure in a real application environment. We have confirmed that there is a nearly linear correlation between the number of CPUs and the throughput when the index is in memory. This expectation also holds for a disk-based index up to a certain point when the disks become a bottleneck. However, we can tune these parameters by mapping the index structure to a suitable hardware – for example, to overcome the disk bottleneck problem, mirroring the disks instead of using RAID5 can be applied.
6. Conclusions We have presented a distributed index structure for similarity data management called the Metric Index (M-Index). We have designed algorithms for efficient data management and similarity searching in precise and approximate manner – range and k-nearest-neighbors queries were considered. The technique can take advantage of any distributed hash table that supports interval queries, such as the structured peer-to-peer networks P-Grid or Skip Graphs. We have performed numerous experiments to test various settings of the M-Index structure and proved its ability to scale well by showing results of both precise and approximate search on collections of one, ten and hundred-million objects with a complex data type. The overall costs for precise searching grow slightly sublinearly and, for the first time, the precise nearest-neighbor queries were evaluated on a collection of a half a billion high-dimensional vectors. Moreover, we have shown that the designed approximation strategy provides practically stable recall even if the indexed data collection grows by two orders of magnitude. Finally, we have proved the usability of the M-Index technique by developing a full-featured publicly-available Web application. This system can search in real time for images similar to a given example in a collection of 100 million photos from the Flickr portal. We have shown response times and throughput characteristics of this application when running on various hardware infrastructures. The ability to map the once-built index to different hardware can be actually used to tune the performance of the search system and thus customize the application for various environments.
Acknowledgements This work has been supported by national research projects VF20102014004, GACR 201/09/0683, GACR 103/10/0886, GACR P202/10/P220, and MSMT 1M0545. The hardware infrastructure was provided by the METACentrum under the research intent MSM6383917201.
872
D. Novak et al. / Information Processing and Management 48 (2012) 855–872
References Aberer, K. (2001). P-Grid: A self-organizing access structure for P2P information systems. Lecture Notes in Computer Science, 2172, 179–194. Amato, G., & Savino, P. (2008). Approximate similarity search in metric spaces using inverted files. In InfoScale ’08: Proceedings of the 3rd International Conference on Scalable Information Systems (pp. 1-10). ICST, Brussels, Belgium. Aspnes, J., & Shah, G. (2003). Skip graphs. In SODA ’03: Proceedings of the fourteenth annual ACM-SIAM symposium on discrete algorithms (pp. 384–393). Philadelphia, PA, USA: Society for Industrial and Applied Mathematics. Batko, M., Falchi, F., Lucchese, C., Novak, D., Perego, R., Rabitti, F., et al (2010). Building a Web-scale image similarity search system. Multimedia Tools and Applications, 47, 599–629. Batko, M., Gennaro, C., & Zezula, P. (2005). Similarity grid for searching in metric spaces. DELOS workshop: Digital library architectures, lecture notes in computer science 3664/2005 (pp. 25–44). Batko, M., Kohoutkova, P., & Novak, D. (2009). Cophir image collection under the microscope. In SISAP ’09: Proceedings of the 2009 second international workshop on similarity search and applications (pp. 47–54). Washington, DC, USA: IEEE Computer Society. Batko, M., Novak, D., Falchi, F., & Zezula, P. (2006). On scalability of the similarity search in the world of peers. In Proceedings of the first international conference on scalable information systems (INFOSCALE’06), Hong Kong, May 30–June 1 (pp. 1–12). New York, NY, USA: ACM Press. Batko, M., Novak, D., Falchi, F., & Zezula, P. (2008). Scalability comparison of peer-to-peer similarity search structures. Future Generation Computer Systems, 24(8), 834–848. Batko, M., Novak, D., & Zezula, P. (2007). MESSIF: Metric similarity search implementation framework. In First international DELOS conference, Pisa, Italy, revised selected papers. Lecture notes in computer science (Vol. 4877) (pp. 1–10). Springer. Bawa, M., Condie, T., & Ganesan, P. (2005). LSH forest: Self-tuning indexes for similarity search. In WWW ’05: Proceedings of the 14th international conference on World Wide Web (pp. 651–660). New York, NY, USA: ACM. Bolettieri, P., Esuli, A., Falchi, F., Lucchese, C., Perego, R., Piccioli, T., et al. (2009). CoPhIR: A test collection for content-based image retrieval. CoRR abs/ 0905.4627v2. Chávez, E., Figueroa, K., & Navarro, G. (2008). Effective proximity retrieval by ordering permutations. IEEE Transactions of Pattern Analysis and Machine Intelligence, 30(9), 1647–1658. Chávez, E., & Navarro, G. (2000). Measuring the dimensionality of general metric spaces. Tech. Rep. TR/DCC-00-1, Department of Computer Science, University of Chile. Ciaccia, P., Patella, M., & Zezula, P. (1997). M-Tree: An efficient access method for similarity search in metric spaces. In Proceedings of 23rd International Conference on Very Large Data Bases (pp. 426–435), August 25–29, 1997, Athens, Greece. Crainiceanu, A., Linga, P., Gehrke, J., & Shanmugasundaram, J. (2004). Querying peer-to-peer networks using P-Trees. In WebDB ’04: Proceedings of the 7th international workshop on the web and databases (pp. 25–30). New York, NY, USA: ACM Press. Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters (Vol. 51). New York, NY, USA: ACM. January, pp. 107–113. Doulkeridis, C., Vlachou, A., Kotidis, Y., Vazirgiannis, M. (2007). Peer-to-peer similarity search in metric spaces. In VLDB 2007: 33rd international conference on very large data bases (pp. 986–997). September 23–27 2007, University of Vienna, Austria: ACM. Esuli, A. (2009a). Mipai: Using the pp-index to build an efficient and scalable similarity search system. In Second international workshop on similarity search and applications (SISAP’09) (pp. 146–148). Esuli, A. (2009b). PP-Index: Using permutation prefixes for efficient and scalable approximate similarity search. In Proceedings of the 7th workshop on largescale distributed systems for information retrieval (LSDS-IR09) (pp. 17–24). Fagin, R., Nievergelt, J., Pippenger, N., & Strong, H. R. (1979). Extendible hashing – A fast access method for dynamic files. ACM Transactions on Database Systems, 4(3), 315–344. Falchi, F., Gennaro, C., & Zezula, P. (2007). A content-addressable network for similarity search in metric spaces. In: Databases, information systems, and peer-to-peer computing, international workshops, DBISP2P 2005/2006, Trondheim, Norway, August 28–29, 2005, Seoul, Korea, September 11, 2006, Revised Selected Papers. Vol. 4125 of Lecture notes in computer science (pp. 98–110). Springer (August). Gionis, A., Indyk, P., & Motwani, R. (1999). Similarity search in high dimensions via hashing. In VLDB ’99: Proceedings of the 25th international conference on very large data bases (pp. 518–529). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.. Lv, Q., Josephson, W., Wang, Z., Charikar, M., & Li, K. (2007). Multi-probe LSH: Efficient indexing for high-dimensional similarity search. In VLDB ’07: Proceedings of the 33rd international conference on Very large data bases (pp. 950–961). VLDB Endowment. MPEG-7, 2002. Multimedia content description interfaces. Part 3: Visual. ISO/IEC 15938-3:2002. Novak, D., Batko, M., & Zezula, P. (2008). Web-scale system for image similarity search: When the dreams are coming true. In Proceedings of the sixth international workshop on content-based multimedia indexing (CBMI 2008) (p. 8). IEEE. Novak, D., Batko, M., & Zezula, P. (2009). Generic similarity search engine demonstrated by an image retrieval application. In Proceedings of the 32nd annual international ACM SIGIR conference on research and development in information retrieval (p. 840). Novak, D., Batko, M., & Zezula, P. (in press). Metric Index: An efficient and scalable solution for precise and approximate similarity search. Information Systems. doi:10.1016/j.is.2010.10.002. Novak, D., Kyselak, M., & Zezula, P. (2010). On locality-sensitive indexing in generic metric spaces. In Proceedings of the third international conference on similarity search and applications. SISAP ’10 (pp. 59–66) New York, NY, USA: ACM. Novak, D., & Zezula, P. (2006). M-Chord: A scalable distributed similarity search structure. In Proceedings of the first international conference on scalable information systems (INFOSCALE 2006), Hong Kong, May 30–June 1, 2006 (pp. 1–10). New York, NY, USA: ACM Press. Samet, H. (2005). Foundations of multidimensional and metric data structures. Computer graphics and geometric modeling. San Francisco, USA: Morgan Kaufmann Publishers. Skala, M. (2009). Counting distance permutations. Journal of Discrete Algorithms, 7(1), 49–61. Zezula, P., Amato, G., Dohnal, V., & Batko, M. (2006). Similarity search: The metric space approach. Advances in Database Systems (Vol. 32). Springer. Zezula, P., Savino, P., Amato, G., & Rabitti, F. (1998). Approximate similarity retrieval with M-Trees. The VLDB Journal, 7(4), 275–293.