A Dimensionality Reduction Method for Efficient ...

6 downloads 6188 Views 637KB Size Report
Jun 6, 2003 - mation Science and Electrical Engineering, Department of. Intelligent ..... Output: ranked list of shots ..... Florida Institute of Technology, USA, in.
IEICE TRANS. INF. & SYST., VOL.E86–D, NO.6 JUNE 2003

1032

PAPER

A Dimensionality Reduction Method for Efficient Search of High-Dimensional Databases Zaher AGHBARI† , Nonmember, Kunihiko KANEKO† , and Akifumi MAKINOUCHI† , Regular Members

SUMMARY In this paper, we present a novel approach for efficient search of high-dimensional databases, such as video shots. The idea is to map feature vectors from the highdimensional feature space into a point in a low-dimensional distance space. Then, a spatial access method, such as an R-tree, is used to cluster these points based on their distances in the low-dimensional space. Our mapping method, called topological mapping, guarantees no false dismissals in the result of a query. However, the result of a query might contain some false alarms. Hence, two refinement steps are performed to remove these false alarms. Comparative experiments on a database of video shots show the superior efficiency of the topological mapping method over other known methods. key words: multidimensional indexing, video indexing and retrieval, video modeling

1.

Introduction

Tree-based spatial indexing methods (R-tree [9], or its variants [2][16]) are known to be inefficient when indexing high-dimensional feature vectors; however, they are found to be very efficient when indexing lowdimensional data [5][17]. This requirement calls for techniques to reduce dimensionality of feature vectors so that they can be efficiently indexed by a spatial access method (SAM), such as an R-tree. In this paper, we propose a dimensionality reduction method, called topological mapping, to map highdimensional feature vectors, such as color histograms, (points in a high-dimensional color space) of objects into points in low-dimensional distance space and, at the same time, guarantee the correctness of the query results. Consequently, these mapped points can be efficiently indexed by a SAM, such as an R-tree. Using this framework, a query (in the original highdimensional feature space) of the form “Find objects similar to the object Q,” becomes the query (in the low-dimensional distance space), “Find points that are close to the query point q”. That is, a query becomes a range query or nearest-neighbor query in a dimensionally reduced space [12]. Where, range queries are of the form “Find points that are within distance  from the Manuscript received November 29, 2001. Manuscript revised September 9, 2002. † The authors are with the Graduate School of Information Science and Electrical Engineering, Department of Intelligent Systems, Kyushu University, Fukuoka-shi, 812– 8581 Japan.

query point q”, and nearest-neighbor queries are of the form “Find the first K closest points to the query point q”. Our approach addresses two main problems: 1. Dimensionality “curse”: Most SAMs scale exponentially for high-dimensional data, eventually they reduce performance to that of a sequential scanning [4]. To solve this problem, the topological mapping method reduce dimensionality of feature vectors. 2. Correctness of results: The topological mapping method guarantees no “false dismissals”, but it might contain a few “false alarms”. Definition 1: false dismissals are the points that satisfy the user query but not returned in the result of a query. Definition 2: false alarms are the points that do not satisfy the user query but returned in the result of a query. The R-tree clusters these points based on their distances in the low-dimensional space. When a query is issued, the cluster (a small subset) of points that is most similar to the query is retrieved. The retrieved cluster contains all the qualifying points; however, it might contain a few false alarms. Thus, a two-step refinement process is performed to remove those false alarms. The first refinement step quickly removes points whose minimum distance from the query is greater than 0.5. The second refinement step uses a complete distance function to remove the remaining false alarms by sequentially matching the query points with the points remaining after the first refinement step. To show the applicability of the proposed method on high-dimensional databases, in this paper we used a database of video shots although the topological mapping method could be applied to many other highdimensional databases. One of the characteristics of video data is its high dimensionality. For example, in [1], the RGB color space is quantized into 64 equally spaced colors, thus the color feature has 64 dimensions. In this paper, we show that the topological mapping method can index and retrieve shots efficiently and effectively based on the colors of salient objects that exist in the shot. The idea is to extract keyframes that best represent the content of a shot and then from these

AGHBARI et al.: DIMENSIONALITY REDUCTION METHOD

1033

keyframes salient objects are segmented. The colors of each object are represented by a color histogram. Using a topological mapping method, each color histogram is mapped into a point in a low-dimensional distance space. Then, these mapped points are indexed by an R-tree in order to allow for efficient retrieval of these points. Although our method indexes and retrieves shots based on the color feature of objects, the method can be easily extended to include other features (texture, motion, etc.). The conducted experiments show that our method is efficient and effective as compared to both the sequential scanning method and the bounding method of QBIC [7]. The rest of this paper is organized as follows: in Section 2, we provide a survey of related work. Section 3 discusses the preprocessing steps on shots to extract the color histograms of objects that exist in these shots. Section 4 discusses shot indexing and provides a solution to the dimensionality curse and correctness of results problems. In Section 5, we explain how to formulate queries and present the two-step refinement process. Section 6 discusses the conducted experiments. Finally, we conclude this paper in Section 7. 2.

Related Work

Most of the work in indexing and retrieving video data has concentrated on extracting features of shots, representing objects that exist in the shots by the extracted features, and retrieving those shots. But, the problems of dimensionality curse and correctness of results of video data have remained without much investigation. Below, we introduce some works that tackled these two problems. In [11], MPEG encoded video frames are mapped into a trail of points in low-dimensional distance space using the FastMap method [4], where each frame is mapped into a point in low-dimensional distance space. This work is intended to analyze transitions between consecutive shots and to cluster shots’ frames in order to visualize the underlying structure of a video. However, representing a frame by one point is an underestimation to the content of the frame, which may contain several objects. Therefore, with this method,

Fig. 1

effective content-based retrieval is questionable. In [14], a camera-tracking technique to detect the type of camera operation by utilizing only two values is proposed. However, indexing a shot by only two values is an underestimation to the content of a shot, which may contain several objects. However, this method could be used as a rough filtering step in the process of retrieval. The QBIC system [7] of IBM uses a bounding method to approximate the distance of a color histogram by computing the average color components of the RGB color space. These average colors should be pre-computed and stored in the database. Then, an image, or an object, is represented by a 3-dimensional Briefly, the highvalue (Ravg , Gavg , and Bavg ). dimensional color feature is bounded and then inserted into a 3-dimensional R*-tree. The disadvantage with this approach is that for each object, or image, a color histogram as well as the average color components must be computed and thus two different distance functions are used to find the distances between objects in the different spaces (high-dimensional and lowdimensional). Furthermore, the average colors should be pre-computed and stored in the database and that adds another storage overhead. Our proposed method is similar in spirit to the method in the QBIC system, however we use a different mapping mechanism, called topological mapping, to reduce dimensionality. The conducted comparative experiments between our method and the method in the QBIC system show that our method leads to a better filtering of the data and thus faster search time. 3.

Shot Structure

Our video data are shots collected from the Internet. Therefore, in this paper, we don’t address the problem of shot segmentation. The shot structure consists of a shot layer, event layer, keyframe layer, and object layer. Definition 3: a shot, S, is a consecutive sequence of frames that constitute one camera operation. To generate the shot structure, a shot undergoes several preprocessing steps (event detection, keyframe selection, and object segmentation). We will use the soccer shot example (see Fig. 1) to explain these con-

A soccer shot example that contains 4 events, 8 keyframes.

IEICE TRANS. INF. & SYST., VOL.E86–D, NO.6 JUNE 2003

1034

cepts. This shot contains 5 objects: O1 is the red-shirt player (player at the bottom of ‘frame 1’), O2 is the white-shirt player (player at the top of ‘frame 1’), O3 is the ball, O4 is the goal keeper (player at the right of ‘frame 65’), and O5 is the goal post. Definition 4: an event, Ei , is a subsequence of consecutive frames that express a particular activity and contain a fixed number of semantically meaningful objects. (1) Event Detection: Each shot constitutes one or more events (E1 , E2 , . . . , Eh ). The start and end of an event are detected by the appearance of a new object into the scene or the disappearance of an existing object from the scene. For example, the soccer shot in Fig. 1 contains 4 events: E2 is detected by the disappearance of O1 , E3 is detected by the disappearance of O2 and the appearance of O4 , and E4 is detected by the disappearance of O4 and the appearance of O5 . The main goal of dividing a shot into events is to identify the frames that are most similar to each other. Then, by extracting keyframes from each event, those keyframes will best represent the content of a shot. (2) Keyframe Selection: Due to the close similarity between frames of an event Ei , at least two keyframes (first and last frames of the event) are extracted to represent the event, but if the event is longer than two seconds (60 frames), one keyframe per second is extracted. For example, in the soccer shot (see Fig. 1), since the length of each event is less than 60 frames, each event is represented by two keyframes (first and last frames of the event). (3) Object Segmentation: From each keyframe, a set of semantically meaningful objects(O1 , O2 , . . . , Oq ) are extracted. For example, in the soccer shot (see Fig. 1), the number of extracted objects, which are bounded by minimum bounding rectangles (MBRs), are as shown in Table 1: Definition 5: an object Oi is a semantically meaningful physical entity within a frame. There are many methods, e.g. [3][6], for segmenting objects from videos that are coded by non-objectbased encoders, such as MPEG-1 or MPEG-2. The system in [3] uses a method that segments image regions based on fusion of color, edge, and motion information. On the other hand, the system in [6] uses a contour-based method to segment objects. In this paper, we assume that shots are coded by an objectbased video encoder such as MPEG-4. In MPEG-4, objects with arbitrary shapes are encoded apart from their background. Therefore, segmentation information Table 1 Number of extracted objects from each keyframe of a soccer shot. Frame Number 1 58 59 64 65 73 74 103 # Objects 3 3 2 2 2 2 2 2

of objects is provided in the video input stream. Color Histogram Extraction: For each object a color histogram is computed. In this system, the RGB color space is quantized into 64 equally spaced colors, where each color consists of 3 components: red (r), green (g), and blue (b). Thus, the color histogram C of object Oi at keyframe κj is defined by: C : {(r1 , g1 , b1 , p1 ), (r2 , g2 , b2 , p2 ), . . . , (rl , gl , bl , pl ), κj }

(1)

Where pi is the percentage of color ci that is represented by ri , gi and bi , color components. We set l = 10, which is the maximum number of major colors in a color histogram of an object. By examining the characteristics of color histograms that represent objects in our experiments, we found that the color feature of an object is largely defined by the major 10 colors of a color histogram and the values of other colors would, usually, be very small, and thus not important. 4.

Shot Indexing

The goal of this paper is to index the color features (color histograms) of objects, which exist in a shot, by a SAM (i.e. R-tree) to allow for efficient search of shots. However, color histograms are highly dimensional and thus cannot be efficiently indexed by an R-tree, or any SAM [4]. According to [4][12][17], an R-tree and its variants are very efficient when used to index lowdimensional data. Therefore, the apparent solution is to reduce the dimensionality of color histograms and then insert them into an R-tree. Dimensionality reduction is achieved by mapping these color histograms into points in a low-dimensional distance space using the topological mapping method (explained in Section 4.2). Then, to cluster these points, they are inserted into an R-tree. The topological mapping method not only maps color histograms into points in a low-dimensional distance space but also guarantees the correctness of the retrieval results. 4.1 Distance Function Many image/video search engines use the Euclidean distance function to determine the dissimilarity (distance) between two vectors, say q and c. However, the Euclidean distance assumes that each dimension in the Euclidean space is independent of all other dimensions. Thus, correlations among features cannot be represented. The squared Euclidean distance of two N -vectors q and c is defined as: 2 (q, c) = (q − c) · (q − c)T Deuclid

=

N  i=1

(qi − ci )2

(2)

AGHBARI et al.: DIMENSIONALITY REDUCTION METHOD

1035

Euclidean distance functions are good for queryby-example type of queries, however if a user is to specify the color(s) of a wanted object, then it is very difficult for he/she to remember the exact shade of colors. Thus, if a user specifies a different shade of color, e.g. red instead of light red, the wanted object will not be retrieved. To allow a margin of mistake for the user’s specification of color(s), in this paper, we use the generalized quadratic form distance function, which represents the correlations (similarities) among features (dimensions). The generalized quadratic distance function more faithfully represents user intentions [10]. As a result, search results have high precision. The quadratic form distance (Equation 3) of two N -vectors q and c is based on an N × N similarity matrix A = [aij ], where aij denote the similarity between components i and j of the vectors. To reflect the human perception of similarity between colors, we set the values of A based on our personal perception of the similarity between colors. For example, we set the similarity between red and orange color to be 0.8 (80% similar). Since there are 64 quantized colors, we manually set the values of A (64 × 64). These values were further refined during the experimentation of the system. D2 (q, c) = (q − c) · A · (q − c)T =

N  N 

aij (qi − ci )(qj − cj )

(3)

i=1 j=1

The definition of the generalized (squared) quadratic form distance function also includes the (squared) Euclidean distance when A is equal to the

Fig. 2 (a) Euclidean space, (b) weighted Euclidean space, and (c) quadratic form distance space.

identity matrix, and the (squared) weighted Euclidean distance when A is equal to a diagonal matrix, A = diag(w1 , . . . , wN ), where wi is the weight of dimension i. As shown in Fig. 2, the range of a query in Euclidean space is a circle for isosurfaces, in weighted Euclidean space is an ellipse whose major axis is aligned with the coordinate axis, and in quadratic form distance space is an ellipse that is not necessarily aligned with the coordinate axis. The goal of our method is to map color histograms (points in high-dimensional color space) of objects into points in low-dimensional distance space and, at the same time, guarantee the correctness of the query results. 4.2 Topological Mapping Method The topological mapping method maps a highdimensional feature vector into a point in a 3dimensional distance space. These 3 dimensions need to be orthogonal so that the mapped points will be as sparse as possible in the 3-dimensional distance space. That leads to a better filtering of the data and thus faster search time. In the RGB color space, ‘red’, ‘green’ and ‘blue’ are orthogonal colors (dimensions). Thus, the 3-dimensional distance space (or topological spaces) of a color histogram constitutes the ‘red’ (Rd ), ‘green’ (Gd ) and ‘blue’ (Bd ) dimensions. As shown in Fig. 3, mapping the color histogram into the Rd Gd Bd distance space is achieved by, first, computing the distance D(C, R) between the color histogram C and a virtual red object R, where R is at the origin of the ‘red’ topological axis since we assume R to be 100% red. Then, a point representing a color histogram is placed on the ‘red’ topological axis based on the computed distance D(C, R). The algorithm for computing the distance between C and R is explained in Section 5.2 and the distance D(C, R) is computed using Equation 3. Similarly, the distance D(C, G) between C and a virtual green object G (100% green) is computed, and also the distance D(C, B) between C and a virtual blue object B (100% blue) is computed. Accordingly, the

Fig. 3 Mapping a color histogram into a point in 3-dimensional topological space (‘red‘, ‘green‘, and ‘blue‘).

IEICE TRANS. INF. & SYST., VOL.E86–D, NO.6 JUNE 2003

1036

D(C, G) and D(C, B) points are placed on the ‘green’ topological axis and the ‘blue’ topological axis, respectively. As shown in Fig. 3, these 3 computed distances between the high-dimensional color histogram and the 3 virtual objects represent a new feature vector of C in a 3-dimensional distance space. We have chosen to map color histograms of objects into a 3-dimensional (Rd Gd Bd ) topological space because of the following reasons: (i) R-trees are more efficient when they are used to cluster 3-dimensional data, (ii) 3-dimensions or lower are easier to visualize, and (iii) The R, G and B colors are orthogonal; therefore, the mapped points will be sparse, which leads to lower query cost. Although we simply chose the RGB color space to compute the color feature of an object any other color space, e.g. HSV, CMY, etc., could be used. Our emphasis in this paper is the proposed dimensionality reduction method, which would work with any color space, or any other feature (motion, texture, etc.) providing that we find 3 orthogonal components in the selected feature space. The condition of orthogonality is simply to make the mapped points as sparse as possible to improve the filtering of data when a query is issued and compared with the points in the R-tree; as a consequence, we achieve faster search time. Our choice of the RGB color space is only to show the applicability of the topological mapping method. After mapping the color histograms of objects into points in 3-dimensional distance space, these mapped points are then inserted into a 3-dimensional R-tree. Basically, the R-tree clusters these points based on their computed distances in the Rd Gd Bd topological space. The topological mapping method guarantees the correctness of the result of queries, that is the result should not contain “false dismissals”, but it might contain a few “false alarms”. To prove the correctness of the results, we need to prove that the mapping is contrac-

(a)

(b)

tive, that is D64 (q, c) ≥ DRGB (´ q, ´c), where D64 (q, c) is the distance between two color histograms q and c in the 64-dimensional color space, and DRGB (q, c) is their distance in the 3-dimensional distance space . Lemma 1: The correctness of the result of a range query is achieved if q, ´c) D64 (q, c) ≥ DRGB (´ Proof: We provide a proof for a 2-dimensional case. Then, a general proof follows smoothly. As shown in Fig. 4.(a), we want to proof that the distance between q and c in a 2-dimensional space is greater than, or equal to, the distance between their projections onto a 1-dimensional space (R-axis). Based on the topological mapping method, we formally state the problem as: D2 (q, c) ≥ D1 (´ q, ´c) Corresponds to, (q − c)A(q − c)T ≥   T (q − R)A(q − R) − (c − R)A(c − R)T 

(4)

We require that A is positive definite (i.e. D2 (q, c) > 0), thus, A can always be decomposed into M T M , as stated in [13]. That is A is a real symmetrical similarity matrix (i.e. D2 (q, c) = D2 (c, q)), which has two orthogonal eigenvectors (e1 and e2 ) and two eigenvalues (λ1 and λ2 ), respectively. These eigenvectors span 2 , thus (q − c) can be written as a linear combination of e1 and e2 . Let (q − c) = re1 cosθ + re2 sinθ. Therefore, (q − c)A(q − c)T = (q − c)M M T (q − c)T = (rM M T e1 cosθ + rM M T e2 sinθ) (reT1 cosθ + reT2 sinθ) = (rλ1e1 cosθ + rλ2e2 sinθ)(reT1 cosθ + reT2 sinθ)

(c)

Fig. 4 (a) Ellipsoid query D(q,c) in a 2-dimensional space, (b) for θ = 0, D(q,c) is equal to the absolute difference between the ellipsoid query D(q,R) and the ellipsoid query D(c,R), and (c) for θ = π/2, D(q,c) is equal to the absolute difference between the ellipsoid query D(q,R) and the ellipsoid query D(c,R).

AGHBARI et al.: DIMENSIONALITY REDUCTION METHOD

1037

= r2 λ1 cos2 θ + r2 λ2 sin2 θ

(5)

Using the derivation of Equation 5, we can translate Equation 4 into: 2

2

2

2

r λ1 cos θ + r λ2 sin θ ≥ |(rq2 λq1 cos2 θ + rq2 λq2 sin2 θ) − (rc2 λc1 cos2 θ + rc2 λc2 sin2 θ)|

(6)

Where, rq2 λq1 cos2 θ + rq2 λq2 sin2 θ is the squared distance between q and the virtual red object R, and rc2 λc1 cos2 θ + rc2 λc2 sin2 θ is the squared distance between c and R. It suffices to prove that Equation 6 holds at the cases when the object (in this case, c) is at the major axis of the ellipsoid query, that is when θ = 0, π/2, . . . ,etc., since middle values of θ would lead to values of Equation 6 that fall within the range of values resulted from θ = 0, π/2, . . . ,etc. When θ = 0 (see Fig. 4.(b)), Equation 6 becomes,   (7) r2 λ2 ≥ rq2 λq2 − rc2 λc2  Where, r2 λ2 , rq2 λq2 and rc2 λc2 are the distances D (q, c), D2 (q, R) and D2 (c, R), respectively. Geometrically, see Fig. 4.(b), line qR represents D(q, R) and line cR represents D(c, R). Assume qR = hR, then ch represents |D(q, R) − D(c, R)|. We notice in ´ qc > ch. Thus, Equation 7 holds the right triangle q hc: true for θ = 0. When θ = π/2 (see Fig. 4.(c)), Equation 6 becomes,   (8) r2 λ1 ≥ rq2 λq1 − rc2 λc1  2

Where, r2 λ1 , rq2 λq1 and rc2 λc1 are the distances 2 D (q, c), D2 (q, R) and D2 (c, R), respectively. Geometrically, see Fig. 4.c, line qR represents D(q, R) and line cR represents D(c, R). Assume qR = hR, then ch represents |D(q, R) − D(c, R)|. We notice in the right ´ qc > ch. Thus, Equation 8 holds true for triangle q hc: θ = π/2. When θ is between 0 and π/2, that is the position of point c is between its two positions on Figs. 4.(b) and 4.(c), similarly in triangle cqR, we could prove that D(q, c) ≥ |D(q, R) − D(c, R)|. Since both Equations 7 and 8 have been proven to hold true, consequently, Equation 4 holds true. Using the same derivations as above, the geometry of the 2dimensional case could be easily extended to prove the q, ´ c). It follows that 3-dimensional case D3 (q, c) ≥ D2 (´ the distance between two points in some dimensional space is greater than, or equal to, the distance between the mapped positions of the two points in a lower diq, ´ c). Proof mensional space. Thus, D64 (q, c) ≥ DRGB (´ is complete. Thus, the topological mapping method leads to a contractive mapping and thus guarantees the correctness of the result of a query.

5.

Shot Retrieval

Generally, exact match queries are impractical because it is difficult for users to remember the exact features of objects. Specifically, in video databases, when formulating a query that investigates the colors of objects that exist in the wanted shot, it is difficult for users to know the exact shades of colors of the objects and the exact number of colors of these objects. Therefore, in this method, we support similarity retrieval. To support similarity retrieval, we assigned metric values for the degree of perceptual similarity between the different quantized colors. As we mentioned in Section 3, the quantized RGB color space, which is used to express the color(s) of an object, consists of 64 colors; therefore, we have created a 64 × 64 matrix (A = [aij ]). These values are based on our perception of the similarity between colors, but were further refined during the experimentation of the system. Shot Representation: Each shot Si in the database is represented by several keyframes κ1 , κ2 , . . . , κNk . Where Nk is the number of keyframes extracted from Si . Each of these keyframes contains at least one object. However, the number of objects No (κj ) in each keyframe may be different. Thus, each shot is represented by No (Si ) points in low-dimensional distance space, where, No (Si ) =

Nk 

No (κj )

(9)

j=1

For example, No (Si ) of the soccer shot (see Fig. 1) is equal to the sum of salient objects in each of the eight keyframes (No (Si ) = 3 + 3 + 2 + 2 + 2 + 2 + 2 + 2 = 18 points). Each of these points represents the color histogram of an object at keyframe κj . Query Interface: Figure 6 shows our query interface (a color pallette, which is popular in visual databases for specifying the colors of objects) that allows users to formulate the color histograms of objects that exist in the wanted shot(s). Users pick the colors of an object and specify the percentage of each color in reference to the area of the object. Users can also visualize the specified color histograms of objects by click-

Fig. 5

Processing steps of a query.

IEICE TRANS. INF. & SYST., VOL.E86–D, NO.6 JUNE 2003

1038

Fig. 6 A query interface for formulating queries that investigate colors of objects in the wanted shot(s).

ing on the “DrawColorHists” button. Next, as shown in Fig. 5, feature vectors representing the color histograms of the specified objects are generated. Then, these feature are mapped using the topological mapping method into points in the Rd Gd Bd topological space in the same way as we discussed in Section 4. Each specified color histogram in a query Q will be represented by one query point qpi . Therefore, Q is represented by m points, which is equal to the number of specified color histograms in Q. Q = qp1 , qp2 , . . . , qpm

(10)

Each query point qpi is matched with the points in the R-tree (Fig. 5) and the cluster (a small subset of all points) that is most similar to qpi is retrieved. As a consequence, m clusters (Equation 10) are retrieved. These clusters are merged and a list of shots is generated. The list also contains values that represent the number of retrieved points Np that belong to each shot. Where,1 ≤ Np ≤ m. The number of retrieved points Np is used to remove false alarms. The shots whose Np < m/2 are removed from the list because their minimum distance Dmin from Q, according to Equation 11, is greater than 0.5. The distance Dmin between each shot and Q is computed by the following equation: m − Np m

5.2 Second-Step Refinement We provide the sequential match algorithm (shown below) to match Q with each of the shots remaining after the first-step refinement. Since these remaining shots are only a very small subset of all the shots in the database, this step will not degrade performance much. However, it is necessary to remove false alarms. Algorithm: sequential match Input: Q and a list of shots Output: ranked list of shots

5.1 First-Step Refinement

Dmin =

Np ≥ m/2) may still contain false alarms; therefore, a second refinement step is necessary.

(11)

Here, we assumed that there are no false alarms among the points of a shot when computing Dmin . However, if there are some false alarms, then Dmin will increase. The remaining retrieved shots (whose

1. for each shot S in the list of shots do 2. sum = 0 3. for each color hist. q(x), x =1..#Objs in Q do 4. for each color hist. c(y), y =1..#Objs in S do 5. compute Dy (q, c) / * Equation 3 */ 6. find smallest Dy (q, c) and store in smallestDist[x] 7. sum = sum + smallestDist[x] 8. aveDist = sum / NumOfObjs in Q 9. insert S into a ranked list list ordered by aveDist 10. return ranked list

The sequential match algorithm: in lines 3-5, the distance D(q, c) between the color histogram q of each object in Q and the color histogram c of every object that exist in the current shot S is computed according to Equation 3. In line 6, for each q, the smallest Dy (q, c) is found and stored in smallestDist array. In lines 7, sum the distances between all color histograms of objects in Q and their corresponding color histograms in the current shot S. In line 8, find the average aveDist of distances computed in line 7. This aveDist value becomes the distance between Q and

AGHBARI et al.: DIMENSIONALITY REDUCTION METHOD

1039

the current shot S, which is inserted into a ranked list (ascending-order list) based on the value of aveDist in line 9. This process is repeated for all the remaining shots. Then, the keyframes of the first K shots at the top of the ranked list are displayed to the user for manual browsing. In our experiments, we set K = 12, which is small enough for manual browsing, however this value can be changed easily according to the user’s preference. 6.

Experiments

We conducted two types of experiments to measure the effectiveness and efficiency of our method: specifically, the first type of experiments measures the precision and recall of the method. The second type of experiments measures how efficient our method is in locating a wanted shot. These experiments were performed on a collection of 190 shots categorized into sports (such as bike racing, car racing, skiing, soccer, football), movie, and animation. 6.1 Effectiveness Evaluation Supporting similarity retrieval in video databases implies that the returned results of a user query might contain false alarms or/and false dismissals (see Definitions 3 and 4). Generally, these two errors imposed a requirement to measure the effectiveness of information retrieval systems. Specifically, precision and recall metrics are used to measure the effectiveness of a system. Where, precision is the ability of a system to reject false alarms and recall is the ability of a system to retrieve all relevant shots. Formally, P recision = recall =

NRSR NT SR

NRSR NT RS

(12) (13)

Where, NRSR is the number of relevant shots returned, NT SR is the total number of shots returned, and , NT RS is the total number of relevant shots in the database to Q. To measure precision and recall of our method, we issued 10 sample queries to search for 10 randomly selected shots. Then, we compared the returned list of shots of each sample query with its ground truth. The ground truth of each sample query is established manually by watching all the shots in the database and selecting the relevant ones to each sample query. For each of the 20 sample queries, the precision and recall are calculated according to Equations 12, and 13. By varying the number of returned shots from 1 to 10, a curve is generated. The average precision and average recall curves over 10 queries are shown in Fig. 7. We notice from Fig. 7 that the average precision is quite high

Fig. 7 Precision and recall curves averaged over the result of 10 sample queries.

when the number of returned shots is small and the curve starts to slope down as the number of returned shots increases. That is due to the fact that the chance of having some false alarms, which were not detected and removed by second filtering process, increases as the number of returned shots increases. The average recall curve shows a smooth increase of the recall measure as the number of returned shots increases. 6.2 Efficiency Evaluation To evaluate the efficiency of our method, we compared our method with a sequential scanning method in terms of the average response time to various types of queries. Where, the response time to queries is the elapsed time from the start of execution of a query till the receipt of the result (till the end of the second filtering process). To perform this experiment, we formulated queries that investigate the colors of objects that exist in the wanted shot. The number of objects in these queries is varied from 1 to 5. Then the response time of a query is measured for both our method and the sequential scanning method. The average results over 5 sample queries, where the number of objects in each sample query is varied from 1 to 5, is shown in Fig. 8 (a). It is clear that as the number of objects in a query increases the response time of a query also increases. That is because the database of shots is scanned for each object in case of a sequential method and the Rtree is accessed for each object in case of our method. We notice from the result of Fig. 8 (a) that our method outperforms the sequential scanning method in terms of the response time to queries. Specifically, our method reduces the response time to queries by about 70% as compared to the sequential scanning method. This result is a considerable saving to the search time of shots. In Fig. 8 (b), we compared the time required by the first and second refinement steps (using our method) from the start of execution of a query till the end of each refinement step. The average time is measured

IEICE TRANS. INF. & SYST., VOL.E86–D, NO.6 JUNE 2003

1040

(a)

(b)

Fig. 8 (a) Comparison between our method and a sequential scanning method in terms of the average response time, (b) comparison between the first-step refinement and secondstep refinement in terms of the average time required for each step.

(a)

(b)

Fig. 9 topological mapping method vs. bounding method: (a) retrieved fraction of DB from R-tree (ratio before refinement steps), (b) total CPU time.

versus the number of objects in a query. We notice that the second refinement step is the most time-consuming step. That is because the second refinement step is basically a sequential matching process of the shots remaining after the first refinement step. However, the overall performance of a system (including both the first and second refinement steps) is improved by our method as compared with the sequential scanning method, see Fig. 8 (a). Comparative Study: We compared the topological mapping method against the bounding method (QBIC system [7]) in terms of: (i) the ratio of the database retrieved as candidate objects after comparing a query with the indexed data (fraction of DB before refinement steps), : (ii) the total CPU time. Figure 9 (a) shows that mapping the data using the topological mapping method leads to the retrieval of smaller fraction of the database as compared to the bounding method. That is because the mapped data by the topological mapping method are more sparse than that of the bounding method. In turn, this fact results in a

considerable savings, since the costly refinement steps, especially the second step, (linear computation of the complete D(Q, C), Equation 3) will be applied to only a small fraction of the database. This is clearly shown in Fig. 9 (b), where the topological mapping method has smaller total CPU time as compared to the bounding method. The above results are the average of 10 queries. 7.

Conclusion

We presented a novel approach for efficient and effective indexing and retrieval of high-dimensional databases. The main contribution of this paper is the topological mapping method, which is used to reduce the dimensionality of the high-dimensional feature vectors into 3 dimensions so that the resultant low-dimensional feature vectors could be efficiently indexed by a SAM, such as an R-tree. In this paper, we used a database of video shots to show the applicability of the topological mapping method. Furthermore, we utilized the RGB color

AGHBARI et al.: DIMENSIONALITY REDUCTION METHOD

1041

space to represent the color features of objects. Although we used a database of video shots, the method could be used with any high-dimensional database. Furthermore, we simply used the RGB color space, however the topological mapping method can be used with any other color space, or any other visual feature, providing that we find 3 orthogonal components (dimensions) in the selected feature space. The condition of orthogonality is simply to make the mapped points (in the 3-dimensional distance space) as sparse as possible to improve the filtering of data when a point query is issued and compared with the points in the Rtree, and to speed up the search time as a consequence. The conducted experiments show that the proposed method is very effective as indicated by the average values of recall and precision. Moreover, the experiments show that the topological mapping method outperforms both the sequential scanning method and the bounding method of QBIC in terms of search time. References [1] Z. Aghbari, K. Kaneko, and A. Makinouchi, “Towards semantical queries: Integrating visual and spatio-temporal video features,” IEICE Trans. Inf. & Syst., vol.E83-D, no.12, pp.2075–2087, Dec. 2000. [2] N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger, “R*tree: An efficient and robust access method for points and rectangles,” ACM SIGMOD, pp.322–331, May 1990. [3] S.F. Chang, W. Chen, H.J. Meng, H. Sundaram, and D. Zhong, “VideoQ: An automated content based video search system using visual cues,” ACM Multimedia Conf., Seattle, WA, Nov. 1997. [4] C. Faloutsos, Searching Multimedia Databases By Content, Kluwer Academic Publishers, Boston, USA, 1996. [5] C. Faloutsos and K. Lin, “A fast algorithm for indexing, Data-mining and visualization of traditional and multimedia datasets,” ACM SIGMOD, pp.163–174, May 1995. [6] A.M. Ferman, A.M. Tekalp, and R. Mehrotra, “Effective content representation for video,” IEEE Int’l Conf. on Image Processing, pp.521–525, Chicago, IL, Oct. 1998. [7] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D, Lee, D. Perkovic, D. Steele, and P. Yanker, Query by Image and Video Content: The QBIC System, IEEE, Sept. 1995. [8] B. Furft, S.W. Smoliar, and H. Zhang, Video and Image Processing in Multimedia Systems, Kluwer Academic Publishers, Norwell, MA, USA, Second printing 1996. [9] A. Guttman, “R-trees: A dynamic index structure for spatial searching,” ACM SIGMOD, pp.47–57, June 1984. [10] Y. Ishikawa, R. Subramanya, and C. Faloutsos, MindReader: Querying Databases through Multiple Examples, VLDB, NY, 1998. [11] V. Kobla, D.S. Doermann, and C. Faloutsos, “Developing high-level representations of video clips using videotrails,” Proc. SPIE Conference on Storage and Retrieval for Image and Video Databases VI, pp.81–92, Jan. 1998. [12] P. Korn, N. Sidiropoulos, C. Faloustos, E. Siegel, and Z. Protopapas, “Fast and effective retrieval of medical tumor shapes,” IEEE Trans. Knowl. Data Eng., vol.10, no.6. pp.889–904, Nov./Dec. 1998. [13] R. Kurniawati, J.S. Jin, and J.A. Shepherd, “An efficient nearest-neighbor search while varying Euclidean metrics,”

ACM Multimedia, 1998. [14] J.H. Oh and K.A. Hua, “Efficient and cost-effective techniques for browsing and indexing large video databases,” ACM SIGMOD, Dallas, TX, USA, 2000. [15] Overview of MPEG-4 Standard, “MPEG-4 subgroups: Requirements, audio, delivery, SYNC, systems, video,” I SO/MPEG, March 1999. [16] T. Sellis, N. Roussopoulous, and C. Faloustsos, “The R+Tree: A dynamic index for multi-dimensional objects,” 13th VLDB Conference, pp.507–518, England, Sept. 1987. [17] B. Yi, H.V. Jagadish, and C. Faloutsos, “Efficient retrieval of similar time sequences under time warping,” ICDE’98, pp.201–208, Orlando, FL, Feb. 1998.

Zaher Aghbari received the B.Sc. degree in Computer Engineering from Florida Institute of Technology, USA, in 1987, and the M.Sc. and Ph.D. in Computer Science from Kyushu University, Japan, in 1998 and 2001, respectively. Since 2001, he is a research associate at the Graduate School of Information Science and Electrical Engineering, Kyushu University. His research interests include multimedia databases, image/video semantic representation and classification, multi-dimensional indexing and searching, and e-Learning. He is a member of IEEE and IPSJ.

Kunihiko Kaneko received his Ph.D. degree from Kyushu University in 1995. Since 1996, he has been with the Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan, where he is an associate professor. His research interests include object databases, multimedia database, and bioinformatics databases. He is a member of IPSJ, and IEEE.

Akifumi Makinouchi received his B.E. degree from Kyoto University, Japan, in 1967, Docteur-Ingereur degree from Univercite de Grenoble, France, in 1970, and D.E. degree from Kyoto University, Japan, in 1985. Since 1989, he has been with the Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan, where he is a professor. His research interests include high performance database systems, spatial databases, high-dimensional indexes and their applications. He is a member of IPSJ, ACM, and IEEE.