Efficient EMD-based Similarity Search in ... - Semantic Scholar

67 downloads 14394 Views 1MB Size Report
ilarities in feature space to define a high quality similarity measure in feature ..... be of equal total mass as the original vector which complies with Definition 1.
© ACM, 2008. This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM SIGMOD 2008. http://doi.acm.org/10.1145/1376616.1376639

Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction Marc Wichterich

Ira Assent

Philipp Kranen

Thomas Seidl

Data Management and Data Exploration Group RWTH Aachen University, Germany

{wichterich, assent, kranen, seidl}@cs.rwth-aachen.de ABSTRACT

database if the underlying similarity model reflects their sense of similarity. The Earth Mover’s Distance (EMD), developed in computer vision, is a highly adaptable similarity model that incorporates a ground distance on the feature space[17]. It has been successfully used in a number of application domains such as vector fields[11], music retrieval[19], image retrieval[20], phishing detection[4] and shape classification[2]. In an empirical study on dissimilarity measures for images[15], the authors conclude that “EMD is especially attractive since it allows superior classification and retrieval performance with a much more compact representation, but at a higher computational cost”. The EMD is defined as the minimal amount of changes required to transfer one feature representation into the other. Changes are weighted with respect to a ground distance. To compute the EMD a linear program has to be solved. This can be achieved using e.g. the simplex method for transportation problems[5]. While the exponential theoretical complexity is rarely observed in practice, computation is at least quadratic in the feature dimensionality. This is clearly infeasible for high-dimensional feature spaces. For high-dimensional databases, efficient query processing is crucial for EMD-based similarity search. Existing multistep approaches rely on specialized lower-bounding filter functions for speedup[17, 14, 1]. These filters are employed to derive a set of candidates which are refined using the Earth Mover’s Distance to find the exact results without loss of effectivity. All of these filter approaches have their merits but are not fully flexible in their application. The approaches in [17] and [1] devise filter distance functions other than the EMD and are limited by the dimensionality of the features (e.g. 64 in 64-dimensional color histograms) and by the feature space (e.g. 3 in a 3-dimensional color space like HSV), respectively. The filter proposed in [14] allows for variability regarding the dimensionality of the feature representations in fixed hierarchical steps of factor 4 and is limited to its grid-based image tiling application domain. Another approach for fast EMD retrieval consists of deriving distance functions that serve as upper bounds to the EMD and thus allow for approximate similarity search but do not guarantee completeness of the retrieval process. In [6, 7] the EMD is embedded into a high-dimensional L1 space with an upper bound for the distortion. In [9] upper bounds for the minimum of the EMD over families of transforms on the input are presented. Dimensionality reduction allows for efficiency improvement by means of EMD alone and offers flexibility through choice of the target dimensionality. Transforming the fea-

The Earth Mover’s Distance (EMD) was developed in computer vision as a flexible similarity model that utilizes similarities in feature space to define a high quality similarity measure in feature representation space. It has been successfully adopted in a multitude of applications with low to medium dimensionality. However, multimedia applications commonly exhibit high-dimensional feature representations for which the computational complexity of the EMD hinders its adoption. An efficient query processing approach that mitigates and overcomes this effect is crucial. We propose novel dimensionality reduction techniques for the EMD in a filter-and-refine architecture for efficient lossless retrieval. Thorough experimental evaluation on real world data sets demonstrates a substantial reduction of the number of expensive high-dimensional EMD computations and thus remarkably faster response times. Our techniques are fully flexible in the number of reduced dimensions, which is a novel feature in approximation techniques for the EMD.

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; H.2.4 [Database Management]: Systems—multimedia databases

General Terms Algorithms, Performance

Keywords Earth Mover’s Distance, Dimensionality Reduction, Multimedia Databases, Lower Bound, Filter Distance

1.

INTRODUCTION

Multimedia databases are prevalent in many scientific applications and entertainment, ranging from magnetic resonance imaging to music recommendation systems. Similarity search provides users with desired objects from the

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada. Copyright 2008 ACM 978-1-60558-102-6/08/06 ...$5.00.

199

tures and ground distance of the Earth Mover’s Distance, the linear program is substantially reduced. For smaller linear programs, average solution times are much lower than for the original problem as the complexity is clearly superlinear. To ensure that dimensionality reduction is a complete filter in the filter-and-refine architecture, the reduced EMD must be a lower bound[10, 18]. In this work, we propose new techniques for dimensionality reduction of the Earth Mover’s Distance which significantly increase efficiency in query processing. In a filter-andrefine architecture, reduced dimensionality EMD is used as a filtering step. The resulting set of candidates is refined using the Earth Mover’s Distance of the original dimensionality to determine the output. Completeness of the filter-andrefine query processing is proven. We formalize the EMD reduction and prove that the presented transformation of the ground distance is optimal for merging of dimensions. Two possible classes of dimensionality reduction are proposed, a flexible data-independent reduction as a generalization of [14] and a novel data-dependent method, which is unique in that it incorporates an analysis of EMD assignments between different dimensions to generate high quality reductions. We demonstrate that our novel approaches improve reduction quality by producing fewer candidates for the expensive refinement step. Advantages of our techniques include the reduced computational complexity (efficiency), the arbitrary number of reduced dimensions (flexibility), possibility of combining with other EMD lower bounds (chaining) and the absence of false dismissals (completeness).

2.

M Manhatta an

x = ( 0.5 0.0 0.2 0.0 0.3 0.0 ) 0.5 0.5 0.2 0.2 0.3 0.3

x = ( 0.5 0.0 0.2 0.0 0.3 0.0 ) 2.0 > 1.0

y = ( 0.0 0.5 0.0 0.2 0.0 0.3 )

0.5 0.0 0.2 0.0 0.3 0.0

z = ( 1.0 0.0 0.0 0.0 0.0 0.0 )

x = ( 0.5 0.0 0.2 0.0 0.3 0.0 )

x = ( 0.5 0.0 0.2 0.0 0.3 0.0 )

EMD

f31 = 0.2 f12 = 0.5

f34 = 0.2 f56 = 0.3

f11 = 0.5

f51 = 0.3

1.0 < 1.6 y = ( 0.0 0.5 0.0 0.2 0.0 0.3 ) C = [ c ijj

],

z = ( 1.0 0.0 0.0 0.0 0.0 0.0 ) cijj = | i – j |

Figure 1: Earth Mover’s Distance the sum of these weighted flows. Finally, the EMD is the minimum over all cost-weighted flows that solve the problem. We formally define the Earth Mover’s Distance for nonnegative vectors of normalized total mass: Definition 1. Earth Mover’s Distance. For two ddimensional vectors x = (x1 , . . . , xd ) and y = (y1 , . . . , yd ),  ∀1 ≤ i ≤ d : xi , yi ≥ 0 of normalized total mass di=1 xi = d d×d , the i=1 yi = 1 and a cost matrix C = [cij ] ∈ R EMD is defined as a minimization over all possible flows F = [fij ] under positivity constraints CP os, source constraints CSource and target constraints CT arget: d  d  cij fij |Constraints} EM DC (x, y) = min{

THE EARTH MOVER’S DISTANCE

F

The Earth Mover’s Distance (EMD) is an adaptable distance function that was developed in computer vision as a perceptually meaningful dissimilarity measure [17]. It has been successfully used in a number of application areas as diverse as physics and musicology [11, 19]. Histograms, denoted as feature vectors x = (xi , . . . , xd ), are a widely used representation of the feature distribution of an object. Bin-by-bin distance measures like Lp norms d p (Lp (x, y) = p i=1 |xi − yi | ) compute histogram distances by comparing one histograms bin at a time. The Manhattan distance L1 , for example, is the sum of absolute values of the differences in corresponding bins. Neighboring bins are ignored. In a color feature space, this means that small changes in color lead to large distances. This contrasts with human perception, where the overall color distribution outweighs small color changes. An example is given in Figure 1 (top): a slight shift in color might lead to the depicted color histograms x and y for otherwise identical images. An unrelated image might lead to a very different histogram z. According to the Manhattan distance x would be more similar to z than to y (1.0 < 1.6). This is in stark contrast to human perception. The Earth Mover’s Distance is based on a cross-bin approach, taking the overall distribution of the histogram entries into account. It measures the minimal amount of work necessary to transform (“move”) one histogram distribution (“earth” or “mass”) into the other. The movement of the mass is referred to as “flow”. To determine the EMD, each flow is multiplied, i.e. weighted, by the corresponding ground distance in the feature space. The total amount of work is

(1)

i=1 j=1

with Constraints = CP os ∧ CSource ∧ CT arget : CP os : ∀ 1 ≤ i, j ≤ d : CSource :

∀1≤i≤d:

CT arget :

∀1≤j≤d:

fij ≥ 0 d j=1 fij = xi d i=1 fij = yj

(2) (3) (4)

The ground distance in feature space is formalized in the cost matrix C, where cij denotes the cost of moving one unit of mass from bin i to bin j. The constraint CP os ensures that only non-negative flows are considered. CSource restricts flows from bin i to the amount of mass available in the “source” bin xi . Likewise, CT arget requires flows to bin j to equal the mass in the “target” bin yj . Referring back to Figure 1, Manhattan ground distance yields the cost matrix C. The optimal EMD flow is depicted by the arrows. The EMD between x and y is c12 ∗ f12 + c34 ∗ f34 + c56 ∗ f56 = 1 ∗ 0.5 + 1 ∗ 0.2 + 1 ∗ 0.3 = 1, for x and z it is c11 ∗ f11 + c31 ∗ f31 + c51 ∗ f51 = 0 ∗ 0.5 + 2 ∗ 0.2 + 4 ∗ 0.3 = 1.6. Thus, cross-bin computation of the EMD finds the perceived rating for the dissimilarities of image pairs (x, y) and (x, z). The Earth Mover’s Distance is a special linear program that can be solved using e.g. the simplex method from operations research [5]. The theoretical worst case complexity of the simplex method is exponential in the dimensionality. In practical applications, however, the typical runtime is about cubic [16]. As the optimization is over quadratic matrices in the size of the histograms, the complexity is at least quadratic. For high-dimensional histograms this is clearly infeasible. In [16], the authors state that coarse histograms are typically inadequate as bins are too large to

200

x = ( 0.5 0.0 0.2 0.3 0.0 0.5 )

filter

candidates

refinement

f12 = 0.5

result

f33 = 0.2

f44 = 0.3

x' = ( 0.5

f65 = 0.5

0.2 0.3 0.0

f15 = 0.5

)

f44 = 0.3

1.0 < 2.0 f33 = 0.2

y = ( 0.0 0.5 0.2 0.3 0.5 0.0 )

query object j

y' = ( 0.0

0.2 0.3 0.5

)

Figure 3: Discarding dimensions

original data

original one. Consider the example in Figure 3: Removing two dimensions results in a larger EMD distance. Consequently, to avoid false dismissals, simply discarding is not a valid option. A special case of lower bounds is discussed in [14]. Focusing on bioinformatics image data, twelve separate MPEG-7 color layout descriptor measures are computed for a 12x8 tiling of the images. Each image is associated with 12 independent 96-dimensional feature vectors. For these features, the authors derive a hierarchy of filters, constructed by merging “neighboring” histogram bins, where “neighboring” means adjacent with respect to the tiling of the images. We will see in this paper that this merging constitutes a special case of dimensionality reduction which we generalize in section 3.1 to derive reductions in sections 3.3 and 3.4 that are fully flexible in the number of resulting dimensions and have a substantially wider application domain.

Figure 2: Multistep query processing benefit from cross-bin dissimilarity assessment while for fine histograms runtimes are too slow.

2.1 Multistep query processing From a database perspective, efficiency can be improved by multistep filter-and-refine architectures. A filter which is (1) efficient, (2) lower-bounding and (3) selective generates a set of candidates which is refined using the original distance function (cf. Figure 2). The following benefits are obtained from such a setting: (1) Efficient filter computation is necessary to obtain a speedup compared to original distance computations. (2) A lower bounding filter guarantees that the original distance is never overestimated. For multistep algorithms such as GEMINI or K N earest neighbor Optimal P rocessing (KNOP), underestimating asserts completeness, i.e. no false dismissals. (For proofs see [10, 18]).

3. EMD DIMENSIONALITY REDUCTION As formalized in Definition 1, the Earth Mover’s Distance of two d-dimensional vectors is defined using a cost matrix C which reflects the feature space ground distance. Any dimensionality reduction technique thus has to specify two transformations: first, a rule on how dimensions of the vectors are reduced, and second, a corresponding reduction for the cost matrix. In this chapter we formalize dimensionality reduction of the Earth Mover’s Distance. The optimal cost matrix reduction is given and formally proven in section 3.2.1. In section 3.2.2 we then concentrate on the reduction.

(3) Selectivity of a filter refers to the tightness of a lower bound. The tighter the filter approximates the original distance, the smaller the set of candidates. Hence, fewer refinements and lower runtimes. For the Earth Mover’s distance, specialized lower bounds have been suggested [17, 1, 14]. These include averages in feature space, weighted Lp norms and dimension-wise optimization of EMD-components. The filter presented in [17] is limited by the dimensionality of the underlying feature space, the ones from [1] by the original dimensionality and [14] only allows for limited flexibility regarding the reduced dimensionality. Neither makes use of information about the database at hand to improve filter selectivity. Dimensionality reduction does not rely on separate classes of filter functions. Instead, the original type of distance function is used, yet on a smaller representation of the features. This is especially helpful for distance functions with superlinear complexity, like the Earth Mover’s Distance. For smaller dimensionalities, distance calculations can be performed in reasonable time. Moreover, dimensionality reduction can be chained with existing filter functions for the EMD, i.e an existing filter can additionally be applied to the reduced data as the result of the dimensionality reduction is again an Earth Mover’s Distance. For most bin-by-bin distances like Lp norms, simple dimensionality reduction is straightforward. By discarding histogram dimensions, only non-negative addends are dropped, thus the resulting distance is guaranteed to be a lower bound. For the Earth Mover’s Distance, discarding dimensions can result in larger distances, as the resulting match might be worse than the

3.1 Dimensionality reduction Formally, a linear dimensionality reduction of vectors can be described via a reduction matrix. Definition 2. General linear dimensionality reduction. A general linear dimensionality reduction from dimensionality d to d is characterized by a reduction matrix  R = [rij ] ∈ Rd×d . The reduction of a d-dimensional vector x = (x1 , . . . , xd ) to a d -dimensional vector x = (x1 , . . . , xd ) is defined as: x = x · R

(5)

A subtype of linear dimensionality reductions especially useful for the reduction of the EMD are those reductions that combine original dimensions to form one reduced dimension. Definition 3. Combining dimensionality reduction.  The set d,d ⊂ Rd×d of linear dimensionality reduction matrices that reduce the data dimensionality from d to d by combining original dimensions to form reduced dimensions

201

∀1 ≤ i ≤ d ∀1 ≤ j ≤ d ∧∀1 ≤ i ≤ d ∧∀1 ≤ j ≤ d

: rij ∈ {0, 1} (6)   : dj rij = 1 (7)  : di rij ≥ 1 (8)

x EMDoriginnal E

R ∈ d,d ⇔

Restrictions (6) and (7) together assert that each original dimension is assigned to exactly one reduced dimension, i.e. d · i =1 {i|rii = 1} = d, where d = {1, . . . , d} and (i = j  ) ⇒ ({i|rii = 1} ∩ {j|rjj  = 1} = ∅). The set {i|rii = 1} represents dimensions i that are combined to reduced dimension i . Additionally, restriction (7) induces the reduced vector to be of equal total mass as the original vector which complies with Definition 1. Restriction (8) ensures that every reduced dimension is assigned at least one original dimension. A reduced Earth Mover’s Distance is defined via a reduction for the query vector and a reduction for the database vectors. Both reductions are used to compute a reduced cost matrix (cf. Figure 4).



where C ∈ R

d1×d2

R1

reduced feature space

x' R2

C

y'

C'

Figure 4: Derivation of the reduced cost matrix C  from histogram reduction matrices R1 and R2

3.2.1 Optimal cost reduction Any reduction of the dimensionality of the Earth Mover’s Distance requires specification of a corresponding reduced cost matrix. This cost matrix provides the ground distance in the new reduced feature space. Consequently, the reduced cost matrix depends on the reduction matrices of Definition 3. The optimal cost matrix with respect to given reduction matrices is the one which provides the largest lower bound to the EMD in the original dimensionality. As we will prove, the tightest possible reduced cost matrix consists of minima over the original cost entries. To illustrate why those minima have to be chosen, we give an example of the worst case that leads to this condition: To ensure the lower bound property, underestimating the true distance means assuming the worst case, i.e. the original mass was transfered at minimum cost. Consider x = (0, 1, 0, 0) and y = (0, 0, 1, 0) and Manhattan ground distance (compare cost matrix in Figure 1). Their Earth Mover’s Distance is then simply 1 (moving one unit of mass from dimension x2 to y3 at ground distance 1: 1 ∗ 1). Combining the first two and the last two dimensions, the reduced features are x = (1, 0) and y  = (0, 1). The minimum cost entry from the original dimensions x1 and x2 to dimensions y3 or y4 is the cost from x2 to y3 which is indeed the 1 that was used in the original EMD. If this value were to be exceeded, the lower bound property would be lost. We formalize this definition and prove optimality.

Definition 4. Reduced Earth Mover’s Distance. For two d-dimensional vectors x, y and a cost matrix C according to Definition 1 and for two reduction matrices R1 ∈ d,d1 and R2 ∈ d,d2 , the lower-bounding reduced EMD is defined as: EM D R1,R2 (x, y) = EM D C  (x · R1, y · R2) C

y

dimensionality reduction

EMDreduceed E

original feature space

is defined by:

(9)

is a lower-bounding reduced cost matrix.

The lower-bounding reduced cost matrix C  is formally introduced in Definition 5. This reduced cost matrix is based on a worst-case assumption to guarantee the lower-bounding property for the filter step. The sparse combining reduction matrices according to Definition 3 limit the worst cases that can occur (cf. section 3.2.1) when compared to dimensionality reduction techniques such as PCA, ICA and Random Projection where rij ∈ R. Our tests with PCA (amended by an extra dimension to preserve the total mass) resulted in very poor retrieval efficiency due to the concessions that had to be made for the reduced cost matrix in order to guarantee the lower-bounding property. The two possibly differing reduction matrices R1 , R2 of differing dimensionality applied to the EMD operands (requiring only minor extension of definition 1 to support two differing vector dimensionalities) allow for handling the feature vectors in the database separately from the feature vectors of the queries. In particular, a database reduction to a low dimensionality for indexing in multidimensional structures, and, at the same time, only slight or no reduction of the query for high approximation quality is possible in our approach.

Definition 5. Optimal Reduced Cost Matrix. For two d-dimensional vectors x, y and a cost matrix C according to definition 1 and for a reduced EM D R1,R2 according C to definition 4, the optimal reduced cost matrix C  = [c i j  ] is defined by: c i j  = min{cij |r1ii = 1 ∧ r2jj  = 1}

(10)

In case of R1 = R2 ∈ d,d , the reduced cost matrix C  defined by (10) is equivalent to the lower bounding cost matrix of [14]. We show that C  is lower bounding for R1 = R2, too. Furthermore, we prove that the reduced cost matrix results in a greatest lower bound for given reduction matrices R1 and R2.

3.2 Optimal dimensionality reduction We define optimality of dimensionality reduction with respect to the efficiency of similarity search. During multistep query processing, dimensionality reduction is used to generate a set of candidates which is refined using the original dimensionality (cf. Figure 2). Smaller candidate sets induce fewer refinement computations and thus result in less computation time in the refinement step. For given target dimensionalities d1 and d2 the optimal dimensionality reduction is therefore the reduction which yields the smallest candidate sets during query processing.

Theorem 1. Lower bound. Given two reduction matrices R1 ∈ d,d1 and R2 ∈ d,d2 and a cost matrix C ∈ Rd×d , the reduced cost matrix C  according to (10) provides a lower bound: ∀x, y ∈ Rd : EM DC  (x · R1, y · R2) ≤ EM DC (x, y) Proof. We denote the optimal flow matrix for the original EM DC by F = [fij ] and the combined flows that satisfy

202

constraints (2) to (4) in Definition 1 for the reduced EM D C  by   fi j  = fij

example suffices for the proof. This can be achieved by two histograms x0 and y 0 with   1 , if i = ˆi 1 , if j = ˆj x0i = and yj0 = 0 , otherwise 0 , otherwise

{i|r1ii =1} {j|r2jj  =1}

and get EM D C

which lead to a contradiction of the assumption: =

d 

d 

i=1 j=1

(i)

=

d2 d1   i =1 j  =1

(ii)



d1 

d2 

i =1 j  =1

(iii)

=

=









⎞ fij cij ⎠

{i|r1ii =1} {j|r2jj  =1}

⎛⎛



⎝⎝



After proving the lower bounding property and the monotony of the EMD, we now prove that there is no better reduced cost matrix for given reduction matrices R1 and R2, i.e. C  according to definition 5 is optimal.

⎞ fij ⎠ ×

{i|r1ii =1} {j|r2jj  =1}

Theorem 3. Optimality. Given a cost matrix C ∈ Rd×d and two reduction matrices R1 ∈ d,d1 and R2 ∈ d,d2 , there is no greater lower bound than the one provided by C  according to (10):



⎟ min {cij |r1ii = 1 ∧ r2jj  = 1}⎠

  Definition 5 ⎛⎛ ⎞ ⎞ d2 d1      ⎝⎝ fij ⎠ c i j  ⎠ i =1 j  =1

(iv)

EM DC1 (x0 , y 0 ) = 1 · c1ˆiˆj > 1 · c2ˆiˆj = EM D C2 (x0 , y 0 )

fij cij

d2 d1  

¬∃C  ∈ Rd1×d2 ∀x, y ∈ Rd : T ighter ∧ LB ∧ (C  = C  ) where

{i|r1ii =1} {j|r2jj  =1}

fi j  c i j 

LB : EM D C  (x · R1, y · R2) ≤ EM D C (x, y)

(v)

≥ EM D C 

T ighter : EM D C  (x · R1, y · R2) ≤ EM D C  (x · R1, y · R2)

i =1 j  =1

Proof. For the proof we assume the negation

In step (i) we simply split the summation to let it sum   over the reduced dimensions using · d1 i =1 {i|r1ii = 1} = d d2 and · j  =1 {j|r2jj  = 1} = d. Step (ii) replaces the individual costs within the brackets by the minimum over all these costs, hence the sum can only decrease. Since this complies with Equation 10, it is replaced by ci j  in step (iii). Step (iv) substitutes the summed flows in the brackets by the combined flows introduced above. (v) holds since the left side of the equation is a feasible solution to the transportation problem, albeit not necessarily a mimimal one which is given by EM DC  .

∃C  = C  ∀x, y ∈ Rd : T ighter ∧ LB and show a contradiction. To comply with the T ighter constraint, the monotony of the EMD requires ∀i ∈ d1, j ∈ d2 : c ij ≤ c ij . Since we have C  = C  it must hold that ∃cˆiˆj < cˆiˆj . As C  was computed according to Definition 5 we know that cˆiˆj = min{cij |r1iˆi = 1 ∧ r2jˆj = 1}. We now construct two vectors x0 and y 0 that contradict the LB constraint. We choose two original dimensions i0 ∈ {i|r1iˆi = 1} and j0 ∈ {j|r2jˆj = 1} that result in ci0 j0 = cˆiˆj and set   1 , if i = i0 1 , if j = j0 x0i = and yj0 = 0 , otherwise 0 , otherwise

Before we prove the optimality of C  according to (10), we introduce a further property of the EMD that we call monotony. The monotony states that the quality of a lower bound by dimensionality reduction increases, i.e. the reduced EMDs are tighter with respect to the original EMDs, if the values in the cost matrix increase.

As the only flow in the original EMD is between x0i0 and yj00 , we have EM D C (x0 , y 0 ) =

Theorem 2. Monotony of the EMD. Given two cost matrices C1, C2 ∈ Rd×d it holds:

d d  

fij cij = fi0 j0 · ci0 j0 = 1 · cˆiˆj

i=1 j=1

C1 ≤ C2 ⇔ ∀x, y : EM D C1 (x, y) ≤ EM DC2 (x, y)

and due to i0 ∈ {i|r1iˆi = 1} and j0 ∈ {j|r2jˆj = 1}

where (C1 = C2) ∨ ≤ c2ij ∧ ∃i, j ∈ d : c1ij < c2ij )

EM DC  (x0 · R1, y 0 · R2) =

Proof. (Sketch) “⇒” Since the EMD is a sum of terms fij · cij with fij ≥ 0, it can only decrease when the cij are less or equal (C1 ≤ C2). “⇐” The proof is given by contradiction: assume the right side holds for all x and y and ∃c1ˆiˆj > c2ˆiˆj , i.e. there is at least one entry in C1 that is bigger than the corresponding entry in C2. Then a counter

=

C1 ≤ C2 ⇔ (∀i, j ∈ d : c1ij

d2 d1  

f i j  ci j 

i =1 j  =1

fˆiˆj · cˆiˆj = 1 · cˆiˆj

which together with cˆiˆj < cˆiˆj contradicts LB: EM D C (x0 , y 0 ) = cˆiˆj < cˆiˆj = EM D C  (x0 · R1, y 0 · R2)

203

min i {{…}} (cf. ( f Definition D fi iti 5)

Theorem 3 proves the reduction of a cost matrix C to C  according to definition 5 to be an optimal lower bound for given reduction matrices R1 and R2. Therefore, we now focus on how to find good reduction matrices. To simplify the discussion, we assume R1 = R2 and write R,R EM D R However, our methods C (x, y) := EM D C (x, y). can be extended to different simultaneous reductions in a straightforward manner.

0134 1023 C= 3201 4310

02 C' = 2 0

--- d1' = d1 + d2 --- d d2' = d3 + d d4

„lost“ information (intra-class dissimilarity)

Figure 5: Clustering based on the ground distance method that combines the original dimensions in such a way that the distances between the resulting reduced dimensions are as great as possible. At the same time the distance information that is lost shall be as small as possible. These two demands correspond to the well known goals of maximum inter-class dissimilarity and minimum intra-class dissimilarity aimed for by clustering algorithms. Figure 5 gives an example for d = 4 and d = 2, where the original dimensions d1, d2 are combined to the reduced dimension d1 and d3, d4 to d2 . The lost distance information within d1 is c12 = c21 = 1 and within d2 it is c34 = c43 = 1. The distance that is preserved between d1 and d2 is c23 = c32 = 2 which is the minimum according to Definition 5. A further postulate is flexibility in terms of the number of reduced dimensions d . This flexibility allows users full control of the trade-off between the quality of the approximation and the efficiency of the filter step computation. This is a unique advantage of our method that none of the existing approximation techniques for the EMD provide. We meet these demands for flexible data independent dimensionality reduction by clustering on the ground distance between dimensions in the Earth Mover’s Distance. Specification of d is possible with partitioning clustering algorithms such as k-means or k-medoids [8]. Both of these algorithms start with an initial random partition of the data into k groups, where k is the user specified number of clusters. Working in an iterative manner, these algorithms assign points to the nearest cluster center and re-compute these centers for the new partitioning until the clustering is stable with respect to a quality criterion. k-means uses the arithmetic mean as center of clusters, whereas k-medoids chooses a central point from the data set as representative. The points in our case refer to the original dimensions of the feature space. In this work we opt for the k-medoids algorithm, because, unlike k-means, it does not require an explicit distance function for the feature space. Thus we can handle any data set if the two inputs for the EMD calculation are provided (histograms and cost matrix). This holds even if the ground distance function is not explicitly known. We sketch our k-medoids algorithm on the EMD dimensions in the following. For details refer to [8]. The algorithm starts by randomly choosing k representatives (medoids) from the set of original dimensions and assigns the remaining ones to their nearest medoid according to the cost matrix. The quality of the clustering is determined as the total distance defined as:

3.2.2 Optimal flow reduction As discussed above, optimal reduction of the cost matrix in the Earth Mover’s Distance depends entirely on the reduction matrices. Consequently, the efficiency of any EMD reduction according to Definition 4 depends solely on the choice of R. We define what would constitute an optimal choice of R. As that R is not attainable in practice, we then introduce approximations that are found to result in efficient reductions as shown in our experiments. Given a d-dimensional query point x and a query distance , the optimal reduction R ∈ d,d to dimensionality d can be defined in terms of the number of refinements required to answer an  range query for a database DB: 

R R = arg minR ∈d,d |{y ∈ DB|EM DC (x, y) ≤ }|

Due to the lower-bounding property, only elements in the above set can potentially still have a refined distance below . Since this optimality is only concerned with one single query x, one typically chooses a workload w representative of the expected queries and defines optimality with respect to said workload. Definition 6. Optimal EMD reduction. Given a workload w = {(x1 , 1 ), . . . , (xt , t )}, where xi is a query vector and i the corresponding range threshold, the optimal reduction R ∈ d,d for w is:  R |{y ∈ DB|EM DC (x, y) ≤ }| R = arg minR ∈d,d (x,)∈w

While this equation describes the desired optimal reduction, the search space for the optimization is immense even for small databases and small dimensionalities. Due to the size of the combining reduction matrix, a (d · d )-variable 0-1 integer optimization problem with restrictions according to Definition 3 has to be solved. Summing over the workload, the objective function consists of |w|·|DB| individual (d ·d )variable linear optimization problems. Exhaustive enumeration of all possible reductions requires the computation of  a total of d(d−d ) · |w| · |DB| reduced EMDs. Even for a reduction from 16 to 8 dimensions of a database of size 1000 and a workload of size 100, this requires over 1.67 · 1012 EMD computations. As this is practically infeasible, we discuss heuristics that result in efficient reductions as seen in section 5.

3.3 Clustering-based reduction The first approach we propose is a data-independent dimensionality reduction which is generalized as stated in section 2.1. It is based on clustering algorithms and is motivated by the monotony of the Earth Mover’s Distance (cf. Theorem 2). Monotony means that higher cost matrix entries produce tighter dimensionality reductions. Therefore we propose a



TD =

d 



i =1 {i|rii =1}

c i mi

where mi is the representing medoid of the cluster that corresponds to the reduced dimension i . The total distance thus reflects the degree of dissimilarity within the clusters,

204

step 1

step 2

sample p data

calculate EMD/ collect flows

S

step 3 flows

We measure the expected tightness of a reduction R as the sum of the aggregated flows weighted by the cost matrix C  optimally reduced according to R:

R

optimize p





d d   initial solution

original data

The global optimization of this term requires computing all possible reductions which is clearly infeasible (cf. Section 3.2.2). Therefore, in our proposed algorithms we reassign one original dimension at a time to iteratively improve the reduction matrix. Figure 6 illustrates the steps we take to create a reduction matrix with our flow-based heuristic. In the first step we draw a sample S from the database. In the second step the EMD (original dimensionality) is calculated on the sample, i.e. we calculate the distances for each pair of histograms in S. While doing this, we sum up the EMD flow matrices F of the histograms pairs to obtain the average flows F S . Starting from an initial reduction matrix, the third and main step of the approach finds a local maximum of the expected lower bound tightness (Equation 12) by utilizing the aggregated flow information. We propose two variants of an algorithm that solve step 3. Figure 8 shows a pseudo code for the first variant which we named FB-Mod (flow-based reduction - modulo). The algorithm takes the current reduction matrix, starts at the first original dimension and changes its assignment. To this end, it iteratively assesses the assignment of the original dimension to each reduced dimension. If the quality of the resulting reduction matrix is better than the current solution, the change is made persistent and the algorithm continues with the next original dimension. Once it reaches the last original dimension it starts over at the first one until it visits the same original dimension twice without any changes in assignments. The expected tightness of a reduction matrix is calculated using the calcTight method that is displayed in Figure 7. It consists of three steps.

i.e. the objective function that the algorithm tries to minimize. In the next step, the algorithm aims at improving the clustering. It determines the total distance that results when swapping a non-medoid with a medoid. In a greedy manner, the configuration with the lowest total distance is chosen and the corresponding pair is swapped. The algorithm terminates if no swapping leads to further improvement of the total distance. The result is a clustering into k partitions. In our case each of the k clusters corresponds to one of the d reduced dimensions. By setting the parameter k, the reduced dimensionality can thus be flexibly chosen. The elements contained in cluster i are the original dimensions that were combined to i . Since there is no knowledge about the underlying data set incorporated in this approach, it is likely that one sacrifices great potential of improving the choice of the reduction matrix. We bridge this gap by introducing a second method for dimensionality reduction in the next section.

3.4 Flow-based reduction We call our second, data dependent method for dimensionality reduction flow-based reduction (FB reduction). Our algorithm incorporates knowledge on the underlying data set to generate a tighter reduction. We collect information about the flows of unreduced EMD computations to guide the process of generating tighter reduction matrices. At first, computing unreduced EMDs to later approximate them in a second step might sound like a paradox as this preprocessing step requires additional effort. However, this investment is more than justified through faster search times during query processing, i.e. the preprocessing is done once and does not affect the response times. The unreduced EMD is a sum of terms cij · fij . For a tight lower bound, we want to achieve largest possible terms ci j  · fi j  . Since we can derive an optimal reduced cost matrix [ci j  ] by applying theorem 3, we have to increase the reduced flows with respect to ci j  . This way, the reduced EMD increases and with it the quality of the lower bound. The information we incorporate is the average flow matrix  S S 1  ] with fij = |S| F S = [fij 2 x,y∈S fij (x, y) over a sample S of the database. We approximate the flows occurring in a reduced EMD by the average original flows aggregated according to the respective reduction matrix. 



(12)

i =1 j  =1

Figure 6: Flow-based reduction

aggrF low(F, R, i , j  ) =

aggrF low(F, R, i , j  ) · c i j 

fij

1. Change the assignment of the given original dimension from its current assignment to the given reduced dimension. 2. Reduce the original cost matrix according to the resulting reduction matrix. 3. Sum up the products of the reduced costs and the aggregated flows according to (12). double calcTight(R, F, C, origDim, QHZ5HG'LPG¶){ result = 0.0; //copy reduction matrix and temp. reassign dimension R¶ = R.copy(); R¶.reassign(origDim, newRedDim); //reduce the original cost matrix according to R¶ C¶ = reduceCostMatrix(C, R¶); //sum up the reduced costs * aggregated flows for(L¶ = 0; L¶ < G¶; L¶++) { for(j¶ = 0; j¶ < G¶; j¶++) { result += aggrFlow(F, R, L¶, j¶) * C¶[L¶][j¶]; } }

(11)

{i|rii =1} {j|rjj  =1}

return result;



}



The aggregated flow from the reduced dimension i to j is based on original flows fij between dimensions i and j that the reduction matrix R combined to i and j  respectively.

Figure 7: Tightness measure for a reduction

205

getNext()

ReductionMatrix optimizeFB_MOD(R&)GG¶){ origDim = 0; lastOrigDimChanged = 0; //get the exp. tightness of R without any changes currentTightness = calcTight(R,C,F,0,R.getAssignment(0)G¶); filter 1 (Red-IM)

//iterate over orig dimensions and change their assignment repeat{ threshold = currentTightness * THRESH;//improvement threshold //try each assignment to a reduced dimension for(redDim = 0; redDim < d¶; redDim++) { //calculate exp. tightness when changing the assignment swapTightness = calcTight(R, C, F, origDim, redDim, G¶); if (swapTightness - currentTightness > threshold){ R.reassign(origDim, redDim); //change assignment lastOrigDimChanged = origDim; //track last change currentTightness = swapTightness; break; //found improvement } } origDim = (origDim + 1) % d; //start over at dim 0 } until (origDim == lastOrigDimChanged); //stop if no improv.

R1

original data

R2

filter 2 (Red-EMD)

candidates

EMD

result

Figure 10: multi-step setup for query processing

return R;

2.1, complete multistep query processing in the GEMINI or KNOP framework requires lower-bounding filter functions [10, 18]. We have shown that our dimensionality reduction techniques provide such lower bounds. Consequently, we use reduced EMD in such an algorithm, following the optimal (with respect to the number of refinements) KNOP framework [18]. The resulting algorithm for k nearest neighbor queries on a ranking with respect to a lower bounding filter function is illustrated in Figure 11. For a specified parameter k and a query object, k initial results are retrieved from the getNext-method of the filter ranking. They are refined and inserted into the kN eighbors result set, sorted with respect to their actual distance from the query. Next, the getNext-method of the base ranking is queried for the next best object with respect to the filter distance. If the filter distance is smaller than the current kth nearest neighbor in the kN eighbors set, the object is refined and compared against the current kth nearest neighbor with respect to the actual distance. If smaller, it is sorted into the result set, displacing the furthest one from the set. This is repeated until the filter distance is larger than the current kth result. As soon as the filter distance is larger, none of the remaining objects have a smaller filter distance. And since the filter distance is a lower bound of the actual distance, their actual distance is also larger. The kN eighbors set now contains the actual k nearest neighbors. The dimensionality reduction techniques presented can be flexibly combined. As the reduced distance function again is an EMD computation, we can use existing filters for the EMD on the reduced dimensionality. This chaining of lowerbounding filters, widely used in multistep query processing, allows for efficient query processing as our experiments show. The LBIM technique from [1] is such a lower bound with respect to the Earth Mover’s Distance. In our work, we use the chaining multistep setup, i.e. a combination of three different distance functions, is illustrated in Figure 10. LBIM on dimensionality reduced features (Red-IM, filter 1) is followed by the reduced EMD (Red-EMD, filter 2) before refinement using the original dimensionality EMD computes the final result set. Each of the filter functions in the setup chain is a lower bound to the next one, which guarantees completeness in multistep query processing as proven in [10, 18]. As indicated, different reduction matrices for query and database, denoted as R1 and R2 respectively, may be used. Using all three distance functions requires computation of a Red-EMD ranking based on the previous Red-IM filter ranking. Pseudo code for this algorithm is given in Figure 12. A getNext query is provided given a base filter ranking. Initially, a first candidate is retrieved from the base ranking. While the next distance with respect to the

}

Figure 8: Computing a reduction using FB-Mod ReductionMatrix optimizeFB_ALL(R&)GG¶){ bestOrigDim = -1; bestRedDim = -1; //track best changes improved = true; //get the exp. tightness of R without any changes currentTightness = calcTight(R,C,F,0,R.getAssignment(0), G¶); while(improved){ improved = false; threshold = currentTightness * THRESH;//improvement threshold //iterate over all original dimensions for(origDim = 0; origDim < d; origDim++) { //try each assignment to a reduced dimension for(redDim = 0; redDim < d¶UHG'LP++) { //calculate exp. tightness when changing the assignment swapTightness = calcTight(R, C, F, origDim, redDim, G¶); //track values if change was better if (swapTightness - currentTightness > threshold){ currentTightness = swapTightness; bestOrigDim = origDim; bestRedDim = redDim; improved = true; } } } //use best assignment from this iteration if(improved) R.reassign(bestOrigDim, bestRedDim); } return R; }

Figure 9: Computing a reduction using FB-All

The second variant of our algorithm does not necessarily apply the first reassignment that yields a better solution (Fig. 9). Instead it evaluates all possibilities before choosing the one single reassignment which results in the best reduction matrix. It then starts the next iteration until no further improvement is achieved. We therefore call it FB-All. We propose using one of two differing initial reduction matrices. For a baseline solution, all original dimensions are assigned to the first reduced dimension. Alternatively, the result from the clustering based dimensionality reduction (section 3.3) is used. In this case, our algorithms start from a solution that reflects the ground distances in the feature space. In our experiments we refer to these two initial reductions as Base and KMed respectively.

4.

query object

candidates

getNext()

QUERY PROCESSING ALGORITHM

In this section, we describe query processing for k nearest neighbor queries. As discussed previously, this extends to range queries in a straightforward manner. While in knnqueries the value for ε is not known a priori, the distance of the kth nearest neighbor corresponds to an ε value for a range query with the same result set. As described in Section

206

PairList getNeighbors(q, k, refinementDistF, baseRanking){ PairList kNeighbors; //candidates; eventually result Pair next; //next candidate //get k initial candidates baseRanking.init(q); //start baseRanking for query q for(i=1; i

Suggest Documents