Multidimensional Descriptor Indexing: Exploring the BitMatrix *

2 downloads 0 Views 237KB Size Report
Neighbors Set NNk(q), a set of k elements closest to q in O, i.e. objects o such that ∀v ∈ O,d(q, v) < d(q, o) → v ∈ NNk(q) and the approximate Nearest.
Multidimensional Descriptor Indexing: Exploring the BitMatrix

Catalin Calistru 1

12

12

, Cristina Ribeiro

?

12

, Gabriel David

FEUPFaculdade de Engenharia da Universidade do Porto 2 INESCPorto Rua Dr. Roberto Frias s/n, 4200-465 Porto, Portugal [email protected], [email protected], [email protected]

Abstract.

Multimedia retrieval brings new challenges, mainly derived

from the mismatch between the level of the user interactionhigh-level concepts, and that of the automatically processed descriptorslow-level features. The eective use of the low-level descriptors is therefore mandatory. Many data structures have been proposed for managing the representation of multidimensional descriptors, each geared toward efciency in some set of basic operations. The paper introduces a highly parametrizable structure called the BitMatrix, along with its search algorithms. The BitMatrix is compared with existing methods, all implemented in a common framework . The tests have been performed on two datasets, with parameters covering signicant ranges of values. The BitMatrix has proved to be a robust and exible structure that can compete with other methods for multidimensional descriptor indexing.

1

Introduction

People need to automatically search on multimedia objects. The retrieval task is characterized by the specication of user queries and the selection of appropriate objects by the system. While textual data allows an easy identication of highlevel concepts, images, for instance, do not directly provide such concepts, even if they are subject to state-of-the-art analysis. Multimedia data has to conform to convenient models in order to be used in retrieval. Extraction techniques are used to analyze the streams, generating higher-level representations of features such as color or texture, in the form of multidimensional descriptors. Descriptors constitute a possible search space on which similarity calculations are required [1]. We concentrate here on the specic task of retrieving objects that satisfy some (sharp) similarity criterion in a database of objects represented by multidimensional descriptors. A BitMatrix is proposed along with methods for searching according to similarity criteria. The BitMatrix is compared with other approaches using an extensible test platform. ?

Partially supported by FCT under project POSC/EIA/61109/2004 (DOMIR)

2

Multimedia Object Models

Multimedia objects typically have a complex structure. It is common for objects to encapsulate parts in dierent media, to have references to components they do not directly include, and to have complex relationships among them. This is reected in the models for multimedia objects adopted in recent standards such as MPEG-7 [2]. The goal of our work is to build systems capable of managing structured multimedia objects and oer retrieval functions that range from textbased to structure and content-based browsing and querying. We are using an operational model accounting for the structure of current standards [3] and need to accommodate various descriptors in a exible way. As computing power is cheap, it is viable to have image and video processing algorithms analyzing object features and generating large amounts of descriptor data. It is therefore crucial to adopt descriptor representations and indexing methods that, besides accommodating the expected diversity of descriptors, can be eective in the basic retrieval operations. Moreover, as similarity search is a core task for these descriptors, metrics are required and they must be t for the nature of descriptors. A exible model for metrics is likely to require a negrained representation of the descriptor values and their types. Multidimensional indexing requires assumptions on the nature of data and the algorithms to search it. Several requirements from the application domain may condition the choice of the indexing method. A common requirement is that updates to the object database are allowed. This will lead to indexing methods able to incrementally update their data structures. Another important aspect is the ability to add new descriptors of varying dimensionality. To meet this requirement, it is necessary to allow varying dimensionality in descriptors, and again to be able to extend the indexing structure piecemeal. Handling a large number of multidimensional descriptors may be inconvenient in some parts of the retrieval process. Being able to search on a chosen subset is therefore a desirable feature. Metrics can take many forms, and the choice of metric is also important from the point of view of exibility of the retrieval process.

3

Multidimensional Indexing Methods

The similarity between objects is evaluated with a comparison between their representations. Each object a set as a

o from the universe of objects O

is characterized by

features (color, motion activity, . . . ), where each feature fi is captured feature vector vi . In the resulting vector space, with dimensionality N , F

of

the similarity between objects is given by a similarity measure which is not necessarily a metric distance. There are, however, distance-only datasets, for which the information available is the distance between the objects. Given a query object

o and a metric function d(ox , oy ), the search task accounts to nding

a particular set of objects. Possible results sets are the

N N (q),

the set of objects such that

Nearest Neighbor Set the k- Nearest

∀v ∈ O, d(q, o) ≤ d(q, v),

Neighbors Set N Nk (q), a set of k

elements closest to q in O , i.e. objects o such ∀v ∈ O, d(q, v) < d(q, o) → v ∈ N Nk (q) and the approximate Nearest Neighbor Set (N NA (q)), a set of objects such that d(q, o) ≤ (1 + ) ∗ d(q, N N (q)) for some  > 0.

that

Generally, the high-dimensional indexing methods divide the search space in a set of ranges with the goal of pruning them at search time. The remaining partitions have to be exhaustively scanned. A great diversity of indexing methods emerged from the dierences in the models of the underlying search space (vectorial or metric), partitioning strategies and similarity measures.

Spatial Access Methods (SAM)

are based on a tree data structure with the

data nodes (the leaves) grouped in directory nodes. The partitioning strategies of the SAM's can be data partitioning (DP) which uses minimum bounding regions

R-tree, R∗ -tree , X-tree,

bounding spheres such as SS − tree, SR − T ree, generic minimum bounding regions(hyper rectangle, cube, sphere) such as T V − tree and space partitioning methods (SP) such as the kDB − tree, Hybrid − tree, SH − tree [4]. (MBR) such as

MBR and bounding spheres such as

Metric Access Methods (MAM),

are distance-based indexing methods, that

work with relative distances between the objects rather than their absolute positions. MAM's are also based on tree-structures that recursively partition the data set into subsets (ball partitioning or generalized hyperplane partitioning) at each node level [5]. The applicability of such methods ranges from native distanceonly datasets to high-dimensional datasets for which conventional SAM's are no longer ecient [6].

Single-Dimensional Mapping

techniques map the points in the high-dimen-

sional space to single-dimensional values for which ecient techniques exist. In [7] a spacial data partitioning(DP) technique is applied, followed by a singledimensional mapping technique within each partition. The mapping process consists of sorting the objects in each partition on the distance to a specic reference point (such as the center). The reference points are then indexed in a

B + − tree

structure.

Aggregation algorithms treat each dimension as a separate list, and their goal is to obtain the result of the query by accessing a minimum number of lists and as few objects as possible in each of the visited lists. These methods operate in middleware systems [8], like the Fagin's Algorithm, the Threshold Algorithm, the Quick-Combine [9], or directly on the original data like BOND [10].

Data Approximation Structures

make the assumption that a sequential scan

is inevitable [11] and construct a vector of approximations (VA-le), signicantly smaller than the original data. Each dimension

D

is divided in

kD

ranges, ob-

taining a grid of approximations. In the rst step the approximations are pruned based on their minimum/maximum distance to the query point. For the remaining approximations the corresponding exact data points have to be analyzed. Using a similar grid of ranges the IGrid [12] maintains separate lists for the objects in each range. The similarity between any two objects uses only the set of dimensions for which the two objects lie in the same range (the

set ).

proximity

The number of objects that are accessed is kept small as dimensionality

increases, at the cost of a large storage overhead (100%) and signicant edgeeects. Bitmap variants have been proposed [13,14]. As dimensionality increases, the distances from the query object to the nearest and the farthest objects are harder to distinguish [15]. In such conditions, a query region that includes the nearest neighbor will overlap most of the other partitions, dramatically decreasing selectivity. With low selectivity, the search methods end up accessing all the nodes of their structures, which means pseudo-random access to all the objects in the dataset. Sequential scan therefore becomes a robust competitor. Recent works in this area have established concepts like the

meaningfulness [15,16], and the distinctiveness [17] of the retrieved

objects in order to characterize the retrieval process in high-dimensional spaces.

4

The BitMatrix-based methods

Given the high cost of random disk access as compared to sequential access, the idea is to construct a collection of signatures that can be sequentially analyzed and used to eectively prune the search space. The BitMatrix method follows a data approximation approach in the spirit of VA-File [11] and IGrid [12],

N dimensions in k ranges . A partition of a dimension πD = {ri = [li , ui ] , i = 1 . . . kD }, where li , ui are the lower and upper bounds of range i. The partitioning scheme (k1 , k2 , . . . , kN ) is used to obtain bitmap signatures for all the objects in a dataset O arranged as lines partitioning each of the

D

is a set of ranges

in a matrix.

4.1

Building the BitMatrix

D2 2 o4

D1 D2 0 1 2 0 1 2 1 1 o1 1 1 o2 1 1 o3 1 o4 1 1 o5 1 1 1 o6 1 o7 1 1 1 o8 1 1 o9 1 1 o10 q Naive 1 1 qExp 1 1 1

o9

o8 et2

o7 1

et1 1

et1

o2

q o3

et2 o5

0 0

o1

o10 o6

1

2

Fig.1.

D1

BitMatrix

Cardinalities Naive

Exp

2 1 2 1 1 0 1 1 0 1

2 2 2 1 1 1 1 1 1 2

The rst step in the construction of the BitMatrix is to choose a partitioning scheme such as

equi-width, equi-depth

or

k-means

partitioning. In the case of

equi-width partitioning the ranges have the same length, while in the case of equidepth the ranges contain an equal number of objects. The k-means partitioning requires a previous k-means clustering step, where

k

clusters are identied and

k ranges are obtained li = (Ci−1 + Ci )/2 and its

their centroids calculated and sorted. The bounds of the from the centroids: the lower bound for range upper bound

i

is

ui = (Ci + Ci+1 )/2.

Denition 1 (Signature). Given a partitioning scheme (k1 , k2 , .., kN ), an

ob-

N is a bitmap of length ΣD=1 kD . For each dimension the signature contains 1 for the range where the object belongs and 0 for the other ranges. ject's signature

The

cardinality

of a signature is the number of bits set to 1. The example in

Figure 1 has 2 dimensions (N

= 2) and 3 ranges per dimension (k1 = k2 = 3). o2 has signature 001010 as the object lies in range 2 for dimension 1, and in range 1 for dimension 2. Arranging each signature as a line in a matrix we N obtain the BitMatrix with |O| lines and ΣD=1 kD columns. Object

4.2

Searching with the BitMatrix

We now propose two algorithms for approximate nearest neighbor, exploring the sequential access to the BitMatrix. The

naïve approach

selects objects based

on the cardinality of the bitwise AND between object and query signatures, as follows.

The naïve approach.

 Step 1 : Given a query object q = [q1 , . . . , qN ], obtain it's signature, i.e. nd for each dimension

D

the range in which the query coordinate

qD

lies.

 Step 2 : Iterate through the objects, performing bitwise AND between their signatures and the query object's signature. If the cardinality of the resulting bitmap is above a predened

cardinality threshold(ct)

the object is

retained for the next phase.

 Step 3 : Access the full vector values of the remaining objects, compute their exact distance to the query object and rank them. We will use

cardinality of an object

in the sequel to refer to the cardinality

of the bitmap resulting from the bitwise AND between the signatures for the

qnaive =010010 of the query o1 , o2 , ..o10 . With the cardinality threshold set to 2 only objects o1 and o3 remain for Step 3. The example above shows that the naïve approach prunes object o2 which object and the query. In Figure 1, the signature

ed

object is AND'

with the signatures of all the other objects

happens to be the nearest neighbor. This eect, known as the appears because for all dimensions are considered. The

D,

edge-eect

[12]

only the objects in the same range as

range expansion

qD

heuristic is a modication of the naïve

approach aecting Step 1: for a dimension in which the query object is close enough to one of the edges, the query's signature is set to 1 for both the query's

Fig.2.

Subspace selection

qD lies in range i for dimension D, kqD − li k < eti kui − li k or to the right if

range and the range next to it. Assuming that the expansion takes place to the left if

kqD − ui k < eti kui − li k, where eti , the expansion threshold, takes values in [0, 0.5]. The cardinalities column (with expansion) in Figure 1 shows that with the query signature qexp =011010 and the same cardinality threshold, objects o2 and o10 are not pruned, as their cardinalities are now 2.

Subspace selection. The increase of dimensionality makes the task of nd-

ing the nearest neighbor harder because the distances between objects become very similar. For the majority of the high-dimensional search methods, the nearest neighbor becomes indistinguishable from the rest of the objects. In order to improve the quality of the nearest neighbor selection, we have considered the

subspace selection

approach, where subsets with smaller dimensionality are

successively explored using the BitMatrix algorithms. Let

s

be the number of

dimensions of the subspace to be processed in the current iteration:

 Step 1 : selected

 Step 2 :

Apply Step 1 and 2 of the naïve or expansion approaches on the

s

s

dimensions (ΣD=1 kD columns of the BitMatrix).

Combine the partial result set obtained in this iteration with the

previous result set (intersection, union). If

the stop condition

is false repeat

Step 1 on the next subspace.

 Step

3: Same as Step 3 of the naïve approach.

The stop condition becomes true if enough dimensions have been processed or enough objects have been pruned. Figure 2 illustrates a space with 256 dimensions processed with

s=86. In the left part, intersection between the partial result

sets is performed, using a low cardinality threshold in each subspace, and in the right part union is performed with a high cardinality threshold.

4.3

Insert, Update, Delete

The insertion of a new object in the BitMatrix, accounts for computing its signature and adding it as a new line in the matrix. The size of the BitMatrix grows linearly with the number of objects and with the dimensionality. The precise size of the BitMatrix is

N (ΣD=1 kD ) ∗ |O|

bits. To update an existing

object its signature has to be modied through bitwise operations. To delete an object, the corresponding line is removed from the matrix.

Fig.3. 5

Time Comparison

Experimental Results

One of the diculties encountered in the evaluation of the various high-dimensional indexing methods was the lack of a common platform on which they can be objectively tested. Indexing methods depend on parameters and storage models, which make them better suited for some domains. This makes them dicult to compare, and the majority of the proposed high-dimensional retrieval methods are only compared to the sequential scan. High-dimensional indexing methods, however, follow common steps such as preprocessing the object data, partitioning the search space, index construction, query processing, searching the index, accessing the remaining objects. A rst step in the experimentation work has been to develop a Java framework for the integration and benchmark of the various indexing methods [18]. We have included Sequential Scan, Bond [10], VA-File [11], GridBitmap [13], and the proposed BitMatrix. The rst experiment has been designed to test the time performance of the various methods as memory-based indexing methods. The time columns in Figure 3 have two components: the time to build the index in memory (build time) and the time to search it (engine time). The engine time for the BitMatrix is clearly smaller than the values for Sequential Scan and Bond and is in the same range with the GridBitmap. The time for the VA-le does not do justice to the method as it is tested using the same partitioning scheme as GridBitmap and BitMatrix; with

7

ranges in each of the

256

dimensions, there are

7256

cells and

the non-empty ones have at most one object. The second set of experiments was geared toward nding a good parametrization for the BitMatrix. The quality of the parametrization is evaluated comparing the approximate results with the

A

k-nearest neighbors. If R is the set N Nk (q) and

is the approximate result (|A| varies with the query object,

ct et

, , partition|A∩R| . This measure can be |R| as the relevant set and A as the answer

ing scheme, subspace) the quality is computed as regarded as a formal recall rate, taking

R

set. The percentage of objects that remain after pruning is also recorded. Two datasets have been used: a dataset of 9908 image histograms with

N = 256

di-

mensions obtained from real images (Dataset 1) and a synthetic dataset of 10000

(a) No Expansion

Fig.4. objects with

N = 80

(b) Exp threshold=0.01

Histogram of cardinalities

dimensions IID (Independent Identical Distributed) uni-

form distribution (Dataset 2). The cardinality threshold was set as a percentage of the maximum cardinality encountered up to that moment for the subspace. The histograms of cardinalities in Fig. 4 use dark bins for the cardinalities of the

N N10 .

On a subspace of

all the

10

86

dimensions from the original

nearest neighbors have cardinalities larger than

256, 40.

after expansion,

Table 1 shows average values of the two measures (formal recall rate, and % of objects accessed) with respect to

N N (q)

and

N N10 (q)

across random sets

of 100 queries. With a cardinality threshold of 0.55, less than 3% of Dataset 1 is accessed, the average recall rate is 0.93 (relative to to

N N10 ).

NN)

and 0.83 (relative

The experiments have shown that the tradeo between quality of

retrieval and speed can be tuned with the expansion mechanism. For example, with the cardinality threshold 0.47, about 6% of Dataset 1 is accessed, and the average recall rate relative to

N N10

is 0.91. If expansion is performed the recall

is 0.94 at 7.5% accessed, while for a smaller cardinality threshold (ct = 0.4) the recall is 0.93 at 10.9% accessed. Thus, expansion should be preferred in this case. The results for the synthetic IID uniform distributed Dataset 2, (second half of Table 1) show worse performance. Much larger amounts of the Dataset 2 have to be analyzed in order to obtain acceptable recall rates. The expansion mechanism clearly improves the recall rate in this case as well.

6

Conclusions

The purpose of this work has been to study the BitMatrix with as few assumptions as possible on the underlying structure of the descriptors. The BitMatrix is highly parametrizable oering a large space for experimentation: cardin-

,

ality threshold

,

expansion threshold

number of dimensions processed in each

step, dimension processing order for the case of weighted dimensions. While the majority of the high-dimensional indexing approaches are only compared to Sequential Scan, the current experiments were driven on top of a prototype

Dataset 1

ct

N = 256, kD = 7, i = 1 . . . 256 (k − means partitioning) N N (q) N N10 (q)

Naïve(et=0)

NN

Rate accessed

NN

et=0,01

Naïve(et=0)

Rate accessed

et=0,01

N N10 Rate

accessed

N N10 Rate

accessed

0.73

0.6

0.2%

0.7

0.4%

0.39

0.2%

0.48

0.4%

0.67

0.78

0.8%

0.87

1.0%

0.61

0.8%

0.68

1.0%

0.55

0.93

2.9%

0.95

3.8%

0.86

2.9%

0.9

3.8%

0.47 0.97 6.0% 0.99 7.4% 0.91 6.0% 0.94 7.4% 0.40 1.0 10.9% 1.0 13.5% 0.93 10.9% 0.96 13.5% N = 80, ki = 7, i = 1 . . . 80 (k − means partitioning) N N (q) N N10 (q)

Dataset 2

et=0,01

Naïve (et=0)

Naïve (et=0)

et=0,01

0.60

0.24

0.19%

0.25

0.21%

0.08

0.19%

0.09

0.21%

0.50

0.55

1.99%

0.57

2.31%

0.33

1.99%

0.35

2.31%

0.40

0.78

8.16%

0.84

9.32%

0.57

8.16%

0.62

9.32%

0.35

0.90

17.0%

0.93

19.2%

0.77

17.0%

0.80

19.2%

0.30

0.90

32.2%

0.94

34.21%

0.85

32.2%

0.87

34.21%

Table 1.

Testing the BitMatrix on two datasets

framework for integration of high-dimensional indexing methods, and include a set of 5 methods: Sequential Scan, Bond, VA-File, GridBitmap, and BitMatrix. The experiments revealed that the BitMatrix retains most of sequential scan's exibility with good quality of the approximations and a much better time performance. It can be conveniently arranged for ecient sequential access and optimized bitwise operations. It can further be broken into segments for distributed or parallel processing. It supports weighted queries and accommodates query feedback mechanism. Relevant features selection (dimensional reduction) and multiple clustering techniques can be used with the BitMatrix as long as the metric is xed. Future work includes research on the expansion mechanism, extensive testing with larger collections and integration into a multimedia retrieval system.

References

6

1. Mojsilovic, A.: Semantic Metric for Image Library Exploration. IEEE Transactions on Multimedia

(2004) 828838

2. MPEG-7 Requirements Group (Editor José M. Martínez): MPEG-7 Overview v.10. ISO/IEC JTC1/SC29/WG11 N6828 (2004) 3. Calistru, C., Ribeiro, C., David, G.:

A exible model for multimedia content

structure and description: MetaMedia and its applications. In preparation (2006) 4. Böhm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces: Index

33

structures for improving the performance of multimedia databases. ACM Comput. Surv.

(2001) 322373

33

5. Chávez, E., Navarro, G., Baeza-Yates, R., Marroquín, J.L.: spaces. ACM Comput. Surv.

Searching in metric

(2001) 273321

6. Digout, C., Nascimento, M.A.: High-dimensional similarity searches using a metric pseudo-grid. In: ICDE Workshops 1174. (2005) 7. Jagadish, H.V., Ooi, B.C., Tan, K.L., Yu, C., Zhang, R.: iDistance: An adaptive

30

B+-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst.

(2005) 364397

8. Carey, M.J., Haas, L.M., Schwarz, P.M., Arya, M., Cody, W.F., Fagin, R., Flickner, M., Luniewski, A.W., Niblack, W., Petkovic, D., Thomas, J., Williams, J.H., Wimmers, E.L.:

Towards heterogeneous multimedia information systems: the

Garlic approach.

In: RIDE '95: Proceedings of the 5th International Workshop

on Research Issues in Data Engineering-Distributed Object Management (RIDEDOM'95), IEEE Computer Society (1995) 124131 9. Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS '01: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, ACM Press (2001) 102113 10. Arjen, P.d.V., Mamoulis, N., Nes, N., Kersten, M.: Ecient k-NN search on vertically decomposed data. In: SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data, ACM Press (2002) 322333 11. Weber, R., Schek, H.J., Blott, S.: A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces.

In: Proc. 24th Int.

Conf. Very Large Data Bases, VLDB. (1998) 194205 12. Aggarwal, C.C., Yu, P.S.: The IGrid index: reversing the dimensionality curse for similarity indexing in high dimensional space.

In: KDD '00: Proceedings of the

sixth ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA, ACM Press (2000) 119129 13. Cha, G.H.: Bitmap indexing method for complex similarity queries with relevance feedback. In: MMDB '03: Proceedings of the 1st ACM international workshop on Multimedia databases, New York, NY, USA, ACM Press (2003) 5562 14. Goldstein, J., Platt, J.C., Burges, C.J.C.:

Redundant Bit Vectors for Quickly

Searching High-Dimensional Regions. In: Deterministic and Statistical Methods in Machine Learning. (2004) 137158

1540

15. Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When Is Nearest Neighbor Meaningful? Lecture Notes in Computer Science 16. Aggarwal, C.C.:

(1999) 217235

Towards meaningful high-dimensional nearest neighbor search

by human-computer interaction.

In: Data Engineering, 2002. Proceedings. 18th

International Conference on. (2002) 593604 17. Katayama, N., Satoh, S.:

Distinctiveness-Sensitive Nearest Neighbor Search for

Ecient Similarity Retrieval of Multimedia Information.

In: Proceedings of the

17th International Conference on Data Engineering, Washington, DC, USA, IEEE Computer Society (2001) 493502 18. Gonçalves, B., Calistru, C., Ribeiro, C., David, G.: Experimental results for multidimensional multimedia descriptor indexing. Submitted for publication (2006)

Suggest Documents