Multidimensional Descriptor Indexing: Exploring the BitMatrix
Catalin Calistru 1
12
12
, Cristina Ribeiro
?
12
, Gabriel David
FEUPFaculdade de Engenharia da Universidade do Porto 2 INESCPorto Rua Dr. Roberto Frias s/n, 4200-465 Porto, Portugal
[email protected],
[email protected],
[email protected]
Abstract.
Multimedia retrieval brings new challenges, mainly derived
from the mismatch between the level of the user interactionhigh-level concepts, and that of the automatically processed descriptorslow-level features. The eective use of the low-level descriptors is therefore mandatory. Many data structures have been proposed for managing the representation of multidimensional descriptors, each geared toward efciency in some set of basic operations. The paper introduces a highly parametrizable structure called the BitMatrix, along with its search algorithms. The BitMatrix is compared with existing methods, all implemented in a common framework . The tests have been performed on two datasets, with parameters covering signicant ranges of values. The BitMatrix has proved to be a robust and exible structure that can compete with other methods for multidimensional descriptor indexing.
1
Introduction
People need to automatically search on multimedia objects. The retrieval task is characterized by the specication of user queries and the selection of appropriate objects by the system. While textual data allows an easy identication of highlevel concepts, images, for instance, do not directly provide such concepts, even if they are subject to state-of-the-art analysis. Multimedia data has to conform to convenient models in order to be used in retrieval. Extraction techniques are used to analyze the streams, generating higher-level representations of features such as color or texture, in the form of multidimensional descriptors. Descriptors constitute a possible search space on which similarity calculations are required [1]. We concentrate here on the specic task of retrieving objects that satisfy some (sharp) similarity criterion in a database of objects represented by multidimensional descriptors. A BitMatrix is proposed along with methods for searching according to similarity criteria. The BitMatrix is compared with other approaches using an extensible test platform. ?
Partially supported by FCT under project POSC/EIA/61109/2004 (DOMIR)
2
Multimedia Object Models
Multimedia objects typically have a complex structure. It is common for objects to encapsulate parts in dierent media, to have references to components they do not directly include, and to have complex relationships among them. This is reected in the models for multimedia objects adopted in recent standards such as MPEG-7 [2]. The goal of our work is to build systems capable of managing structured multimedia objects and oer retrieval functions that range from textbased to structure and content-based browsing and querying. We are using an operational model accounting for the structure of current standards [3] and need to accommodate various descriptors in a exible way. As computing power is cheap, it is viable to have image and video processing algorithms analyzing object features and generating large amounts of descriptor data. It is therefore crucial to adopt descriptor representations and indexing methods that, besides accommodating the expected diversity of descriptors, can be eective in the basic retrieval operations. Moreover, as similarity search is a core task for these descriptors, metrics are required and they must be t for the nature of descriptors. A exible model for metrics is likely to require a negrained representation of the descriptor values and their types. Multidimensional indexing requires assumptions on the nature of data and the algorithms to search it. Several requirements from the application domain may condition the choice of the indexing method. A common requirement is that updates to the object database are allowed. This will lead to indexing methods able to incrementally update their data structures. Another important aspect is the ability to add new descriptors of varying dimensionality. To meet this requirement, it is necessary to allow varying dimensionality in descriptors, and again to be able to extend the indexing structure piecemeal. Handling a large number of multidimensional descriptors may be inconvenient in some parts of the retrieval process. Being able to search on a chosen subset is therefore a desirable feature. Metrics can take many forms, and the choice of metric is also important from the point of view of exibility of the retrieval process.
3
Multidimensional Indexing Methods
The similarity between objects is evaluated with a comparison between their representations. Each object a set as a
o from the universe of objects O
is characterized by
features (color, motion activity, . . . ), where each feature fi is captured feature vector vi . In the resulting vector space, with dimensionality N , F
of
the similarity between objects is given by a similarity measure which is not necessarily a metric distance. There are, however, distance-only datasets, for which the information available is the distance between the objects. Given a query object
o and a metric function d(ox , oy ), the search task accounts to nding
a particular set of objects. Possible results sets are the
N N (q),
the set of objects such that
Nearest Neighbor Set the k- Nearest
∀v ∈ O, d(q, o) ≤ d(q, v),
Neighbors Set N Nk (q), a set of k
elements closest to q in O , i.e. objects o such ∀v ∈ O, d(q, v) < d(q, o) → v ∈ N Nk (q) and the approximate Nearest Neighbor Set (N NA (q)), a set of objects such that d(q, o) ≤ (1 + ) ∗ d(q, N N (q)) for some > 0.
that
Generally, the high-dimensional indexing methods divide the search space in a set of ranges with the goal of pruning them at search time. The remaining partitions have to be exhaustively scanned. A great diversity of indexing methods emerged from the dierences in the models of the underlying search space (vectorial or metric), partitioning strategies and similarity measures.
Spatial Access Methods (SAM)
are based on a tree data structure with the
data nodes (the leaves) grouped in directory nodes. The partitioning strategies of the SAM's can be data partitioning (DP) which uses minimum bounding regions
R-tree, R∗ -tree , X-tree,
bounding spheres such as SS − tree, SR − T ree, generic minimum bounding regions(hyper rectangle, cube, sphere) such as T V − tree and space partitioning methods (SP) such as the kDB − tree, Hybrid − tree, SH − tree [4]. (MBR) such as
MBR and bounding spheres such as
Metric Access Methods (MAM),
are distance-based indexing methods, that
work with relative distances between the objects rather than their absolute positions. MAM's are also based on tree-structures that recursively partition the data set into subsets (ball partitioning or generalized hyperplane partitioning) at each node level [5]. The applicability of such methods ranges from native distanceonly datasets to high-dimensional datasets for which conventional SAM's are no longer ecient [6].
Single-Dimensional Mapping
techniques map the points in the high-dimen-
sional space to single-dimensional values for which ecient techniques exist. In [7] a spacial data partitioning(DP) technique is applied, followed by a singledimensional mapping technique within each partition. The mapping process consists of sorting the objects in each partition on the distance to a specic reference point (such as the center). The reference points are then indexed in a
B + − tree
structure.
Aggregation algorithms treat each dimension as a separate list, and their goal is to obtain the result of the query by accessing a minimum number of lists and as few objects as possible in each of the visited lists. These methods operate in middleware systems [8], like the Fagin's Algorithm, the Threshold Algorithm, the Quick-Combine [9], or directly on the original data like BOND [10].
Data Approximation Structures
make the assumption that a sequential scan
is inevitable [11] and construct a vector of approximations (VA-le), signicantly smaller than the original data. Each dimension
D
is divided in
kD
ranges, ob-
taining a grid of approximations. In the rst step the approximations are pruned based on their minimum/maximum distance to the query point. For the remaining approximations the corresponding exact data points have to be analyzed. Using a similar grid of ranges the IGrid [12] maintains separate lists for the objects in each range. The similarity between any two objects uses only the set of dimensions for which the two objects lie in the same range (the
set ).
proximity
The number of objects that are accessed is kept small as dimensionality
increases, at the cost of a large storage overhead (100%) and signicant edgeeects. Bitmap variants have been proposed [13,14]. As dimensionality increases, the distances from the query object to the nearest and the farthest objects are harder to distinguish [15]. In such conditions, a query region that includes the nearest neighbor will overlap most of the other partitions, dramatically decreasing selectivity. With low selectivity, the search methods end up accessing all the nodes of their structures, which means pseudo-random access to all the objects in the dataset. Sequential scan therefore becomes a robust competitor. Recent works in this area have established concepts like the
meaningfulness [15,16], and the distinctiveness [17] of the retrieved
objects in order to characterize the retrieval process in high-dimensional spaces.
4
The BitMatrix-based methods
Given the high cost of random disk access as compared to sequential access, the idea is to construct a collection of signatures that can be sequentially analyzed and used to eectively prune the search space. The BitMatrix method follows a data approximation approach in the spirit of VA-File [11] and IGrid [12],
N dimensions in k ranges . A partition of a dimension πD = {ri = [li , ui ] , i = 1 . . . kD }, where li , ui are the lower and upper bounds of range i. The partitioning scheme (k1 , k2 , . . . , kN ) is used to obtain bitmap signatures for all the objects in a dataset O arranged as lines partitioning each of the
D
is a set of ranges
in a matrix.
4.1
Building the BitMatrix
D2 2 o4
D1 D2 0 1 2 0 1 2 1 1 o1 1 1 o2 1 1 o3 1 o4 1 1 o5 1 1 1 o6 1 o7 1 1 1 o8 1 1 o9 1 1 o10 q Naive 1 1 qExp 1 1 1
o9
o8 et2
o7 1
et1 1
et1
o2
q o3
et2 o5
0 0
o1
o10 o6
1
2
Fig.1.
D1
BitMatrix
Cardinalities Naive
Exp
2 1 2 1 1 0 1 1 0 1
2 2 2 1 1 1 1 1 1 2
The rst step in the construction of the BitMatrix is to choose a partitioning scheme such as
equi-width, equi-depth
or
k-means
partitioning. In the case of
equi-width partitioning the ranges have the same length, while in the case of equidepth the ranges contain an equal number of objects. The k-means partitioning requires a previous k-means clustering step, where
k
clusters are identied and
k ranges are obtained li = (Ci−1 + Ci )/2 and its
their centroids calculated and sorted. The bounds of the from the centroids: the lower bound for range upper bound
i
is
ui = (Ci + Ci+1 )/2.
Denition 1 (Signature). Given a partitioning scheme (k1 , k2 , .., kN ), an
ob-
N is a bitmap of length ΣD=1 kD . For each dimension the signature contains 1 for the range where the object belongs and 0 for the other ranges. ject's signature
The
cardinality
of a signature is the number of bits set to 1. The example in
Figure 1 has 2 dimensions (N
= 2) and 3 ranges per dimension (k1 = k2 = 3). o2 has signature 001010 as the object lies in range 2 for dimension 1, and in range 1 for dimension 2. Arranging each signature as a line in a matrix we N obtain the BitMatrix with |O| lines and ΣD=1 kD columns. Object
4.2
Searching with the BitMatrix
We now propose two algorithms for approximate nearest neighbor, exploring the sequential access to the BitMatrix. The
naïve approach
selects objects based
on the cardinality of the bitwise AND between object and query signatures, as follows.
The naïve approach.
Step 1 : Given a query object q = [q1 , . . . , qN ], obtain it's signature, i.e. nd for each dimension
D
the range in which the query coordinate
qD
lies.
Step 2 : Iterate through the objects, performing bitwise AND between their signatures and the query object's signature. If the cardinality of the resulting bitmap is above a predened
cardinality threshold(ct)
the object is
retained for the next phase.
Step 3 : Access the full vector values of the remaining objects, compute their exact distance to the query object and rank them. We will use
cardinality of an object
in the sequel to refer to the cardinality
of the bitmap resulting from the bitwise AND between the signatures for the
qnaive =010010 of the query o1 , o2 , ..o10 . With the cardinality threshold set to 2 only objects o1 and o3 remain for Step 3. The example above shows that the naïve approach prunes object o2 which object and the query. In Figure 1, the signature
ed
object is AND'
with the signatures of all the other objects
happens to be the nearest neighbor. This eect, known as the appears because for all dimensions are considered. The
D,
edge-eect
[12]
only the objects in the same range as
range expansion
qD
heuristic is a modication of the naïve
approach aecting Step 1: for a dimension in which the query object is close enough to one of the edges, the query's signature is set to 1 for both the query's
Fig.2.
Subspace selection
qD lies in range i for dimension D, kqD − li k < eti kui − li k or to the right if
range and the range next to it. Assuming that the expansion takes place to the left if
kqD − ui k < eti kui − li k, where eti , the expansion threshold, takes values in [0, 0.5]. The cardinalities column (with expansion) in Figure 1 shows that with the query signature qexp =011010 and the same cardinality threshold, objects o2 and o10 are not pruned, as their cardinalities are now 2.
Subspace selection. The increase of dimensionality makes the task of nd-
ing the nearest neighbor harder because the distances between objects become very similar. For the majority of the high-dimensional search methods, the nearest neighbor becomes indistinguishable from the rest of the objects. In order to improve the quality of the nearest neighbor selection, we have considered the
subspace selection
approach, where subsets with smaller dimensionality are
successively explored using the BitMatrix algorithms. Let
s
be the number of
dimensions of the subspace to be processed in the current iteration:
Step 1 : selected
Step 2 :
Apply Step 1 and 2 of the naïve or expansion approaches on the
s
s
dimensions (ΣD=1 kD columns of the BitMatrix).
Combine the partial result set obtained in this iteration with the
previous result set (intersection, union). If
the stop condition
is false repeat
Step 1 on the next subspace.
Step
3: Same as Step 3 of the naïve approach.
The stop condition becomes true if enough dimensions have been processed or enough objects have been pruned. Figure 2 illustrates a space with 256 dimensions processed with
s=86. In the left part, intersection between the partial result
sets is performed, using a low cardinality threshold in each subspace, and in the right part union is performed with a high cardinality threshold.
4.3
Insert, Update, Delete
The insertion of a new object in the BitMatrix, accounts for computing its signature and adding it as a new line in the matrix. The size of the BitMatrix grows linearly with the number of objects and with the dimensionality. The precise size of the BitMatrix is
N (ΣD=1 kD ) ∗ |O|
bits. To update an existing
object its signature has to be modied through bitwise operations. To delete an object, the corresponding line is removed from the matrix.
Fig.3. 5
Time Comparison
Experimental Results
One of the diculties encountered in the evaluation of the various high-dimensional indexing methods was the lack of a common platform on which they can be objectively tested. Indexing methods depend on parameters and storage models, which make them better suited for some domains. This makes them dicult to compare, and the majority of the proposed high-dimensional retrieval methods are only compared to the sequential scan. High-dimensional indexing methods, however, follow common steps such as preprocessing the object data, partitioning the search space, index construction, query processing, searching the index, accessing the remaining objects. A rst step in the experimentation work has been to develop a Java framework for the integration and benchmark of the various indexing methods [18]. We have included Sequential Scan, Bond [10], VA-File [11], GridBitmap [13], and the proposed BitMatrix. The rst experiment has been designed to test the time performance of the various methods as memory-based indexing methods. The time columns in Figure 3 have two components: the time to build the index in memory (build time) and the time to search it (engine time). The engine time for the BitMatrix is clearly smaller than the values for Sequential Scan and Bond and is in the same range with the GridBitmap. The time for the VA-le does not do justice to the method as it is tested using the same partitioning scheme as GridBitmap and BitMatrix; with
7
ranges in each of the
256
dimensions, there are
7256
cells and
the non-empty ones have at most one object. The second set of experiments was geared toward nding a good parametrization for the BitMatrix. The quality of the parametrization is evaluated comparing the approximate results with the
A
k-nearest neighbors. If R is the set N Nk (q) and
is the approximate result (|A| varies with the query object,
ct et
, , partition|A∩R| . This measure can be |R| as the relevant set and A as the answer
ing scheme, subspace) the quality is computed as regarded as a formal recall rate, taking
R
set. The percentage of objects that remain after pruning is also recorded. Two datasets have been used: a dataset of 9908 image histograms with
N = 256
di-
mensions obtained from real images (Dataset 1) and a synthetic dataset of 10000
(a) No Expansion
Fig.4. objects with
N = 80
(b) Exp threshold=0.01
Histogram of cardinalities
dimensions IID (Independent Identical Distributed) uni-
form distribution (Dataset 2). The cardinality threshold was set as a percentage of the maximum cardinality encountered up to that moment for the subspace. The histograms of cardinalities in Fig. 4 use dark bins for the cardinalities of the
N N10 .
On a subspace of
all the
10
86
dimensions from the original
nearest neighbors have cardinalities larger than
256, 40.
after expansion,
Table 1 shows average values of the two measures (formal recall rate, and % of objects accessed) with respect to
N N (q)
and
N N10 (q)
across random sets
of 100 queries. With a cardinality threshold of 0.55, less than 3% of Dataset 1 is accessed, the average recall rate is 0.93 (relative to to
N N10 ).
NN)
and 0.83 (relative
The experiments have shown that the tradeo between quality of
retrieval and speed can be tuned with the expansion mechanism. For example, with the cardinality threshold 0.47, about 6% of Dataset 1 is accessed, and the average recall rate relative to
N N10
is 0.91. If expansion is performed the recall
is 0.94 at 7.5% accessed, while for a smaller cardinality threshold (ct = 0.4) the recall is 0.93 at 10.9% accessed. Thus, expansion should be preferred in this case. The results for the synthetic IID uniform distributed Dataset 2, (second half of Table 1) show worse performance. Much larger amounts of the Dataset 2 have to be analyzed in order to obtain acceptable recall rates. The expansion mechanism clearly improves the recall rate in this case as well.
6
Conclusions
The purpose of this work has been to study the BitMatrix with as few assumptions as possible on the underlying structure of the descriptors. The BitMatrix is highly parametrizable oering a large space for experimentation: cardin-
,
ality threshold
,
expansion threshold
number of dimensions processed in each
step, dimension processing order for the case of weighted dimensions. While the majority of the high-dimensional indexing approaches are only compared to Sequential Scan, the current experiments were driven on top of a prototype
Dataset 1
ct
N = 256, kD = 7, i = 1 . . . 256 (k − means partitioning) N N (q) N N10 (q)
Naïve(et=0)
NN
Rate accessed
NN
et=0,01
Naïve(et=0)
Rate accessed
et=0,01
N N10 Rate
accessed
N N10 Rate
accessed
0.73
0.6
0.2%
0.7
0.4%
0.39
0.2%
0.48
0.4%
0.67
0.78
0.8%
0.87
1.0%
0.61
0.8%
0.68
1.0%
0.55
0.93
2.9%
0.95
3.8%
0.86
2.9%
0.9
3.8%
0.47 0.97 6.0% 0.99 7.4% 0.91 6.0% 0.94 7.4% 0.40 1.0 10.9% 1.0 13.5% 0.93 10.9% 0.96 13.5% N = 80, ki = 7, i = 1 . . . 80 (k − means partitioning) N N (q) N N10 (q)
Dataset 2
et=0,01
Naïve (et=0)
Naïve (et=0)
et=0,01
0.60
0.24
0.19%
0.25
0.21%
0.08
0.19%
0.09
0.21%
0.50
0.55
1.99%
0.57
2.31%
0.33
1.99%
0.35
2.31%
0.40
0.78
8.16%
0.84
9.32%
0.57
8.16%
0.62
9.32%
0.35
0.90
17.0%
0.93
19.2%
0.77
17.0%
0.80
19.2%
0.30
0.90
32.2%
0.94
34.21%
0.85
32.2%
0.87
34.21%
Table 1.
Testing the BitMatrix on two datasets
framework for integration of high-dimensional indexing methods, and include a set of 5 methods: Sequential Scan, Bond, VA-File, GridBitmap, and BitMatrix. The experiments revealed that the BitMatrix retains most of sequential scan's exibility with good quality of the approximations and a much better time performance. It can be conveniently arranged for ecient sequential access and optimized bitwise operations. It can further be broken into segments for distributed or parallel processing. It supports weighted queries and accommodates query feedback mechanism. Relevant features selection (dimensional reduction) and multiple clustering techniques can be used with the BitMatrix as long as the metric is xed. Future work includes research on the expansion mechanism, extensive testing with larger collections and integration into a multimedia retrieval system.
References
6
1. Mojsilovic, A.: Semantic Metric for Image Library Exploration. IEEE Transactions on Multimedia
(2004) 828838
2. MPEG-7 Requirements Group (Editor José M. Martínez): MPEG-7 Overview v.10. ISO/IEC JTC1/SC29/WG11 N6828 (2004) 3. Calistru, C., Ribeiro, C., David, G.:
A exible model for multimedia content
structure and description: MetaMedia and its applications. In preparation (2006) 4. Böhm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces: Index
33
structures for improving the performance of multimedia databases. ACM Comput. Surv.
(2001) 322373
33
5. Chávez, E., Navarro, G., Baeza-Yates, R., Marroquín, J.L.: spaces. ACM Comput. Surv.
Searching in metric
(2001) 273321
6. Digout, C., Nascimento, M.A.: High-dimensional similarity searches using a metric pseudo-grid. In: ICDE Workshops 1174. (2005) 7. Jagadish, H.V., Ooi, B.C., Tan, K.L., Yu, C., Zhang, R.: iDistance: An adaptive
30
B+-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst.
(2005) 364397
8. Carey, M.J., Haas, L.M., Schwarz, P.M., Arya, M., Cody, W.F., Fagin, R., Flickner, M., Luniewski, A.W., Niblack, W., Petkovic, D., Thomas, J., Williams, J.H., Wimmers, E.L.:
Towards heterogeneous multimedia information systems: the
Garlic approach.
In: RIDE '95: Proceedings of the 5th International Workshop
on Research Issues in Data Engineering-Distributed Object Management (RIDEDOM'95), IEEE Computer Society (1995) 124131 9. Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS '01: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, ACM Press (2001) 102113 10. Arjen, P.d.V., Mamoulis, N., Nes, N., Kersten, M.: Ecient k-NN search on vertically decomposed data. In: SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data, ACM Press (2002) 322333 11. Weber, R., Schek, H.J., Blott, S.: A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces.
In: Proc. 24th Int.
Conf. Very Large Data Bases, VLDB. (1998) 194205 12. Aggarwal, C.C., Yu, P.S.: The IGrid index: reversing the dimensionality curse for similarity indexing in high dimensional space.
In: KDD '00: Proceedings of the
sixth ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA, ACM Press (2000) 119129 13. Cha, G.H.: Bitmap indexing method for complex similarity queries with relevance feedback. In: MMDB '03: Proceedings of the 1st ACM international workshop on Multimedia databases, New York, NY, USA, ACM Press (2003) 5562 14. Goldstein, J., Platt, J.C., Burges, C.J.C.:
Redundant Bit Vectors for Quickly
Searching High-Dimensional Regions. In: Deterministic and Statistical Methods in Machine Learning. (2004) 137158
1540
15. Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When Is Nearest Neighbor Meaningful? Lecture Notes in Computer Science 16. Aggarwal, C.C.:
(1999) 217235
Towards meaningful high-dimensional nearest neighbor search
by human-computer interaction.
In: Data Engineering, 2002. Proceedings. 18th
International Conference on. (2002) 593604 17. Katayama, N., Satoh, S.:
Distinctiveness-Sensitive Nearest Neighbor Search for
Ecient Similarity Retrieval of Multimedia Information.
In: Proceedings of the
17th International Conference on Data Engineering, Washington, DC, USA, IEEE Computer Society (2001) 493502 18. Gonçalves, B., Calistru, C., Ribeiro, C., David, G.: Experimental results for multidimensional multimedia descriptor indexing. Submitted for publication (2006)