Fast Visual Retrieval Using Accelerated Sequence Matching

1 downloads 0 Views 568KB Size Report
approach, an image/video is represented by an ordered list of ..... detecting copy segments between web videos and TV series. In this scenario, the query is ...
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

1

Fast Visual Retrieval Using Accelerated Sequence Matching Mei-Chen Yeh, Member, IEEE, and Kwang-Ting Cheng, Fellow, IEEE  Abstract—We present an approach to represent, match, and index various types of visual data, with the primary goal of enabling effective and computationally efficient searches. In this approach, an image/video is represented by an ordered list of feature descriptors. Similarities between such representations are then measured by the approximate string matching technique. This approach unifies visual appearance and the ordering information in a holistic manner with joint consideration of visual-order consistency between the query and the reference instances, and can be used for automatically identifying local alignments between two pieces of visual data. This capability is essential for tasks such as video copy detection where only small portions of the query and the reference videos are similar. To deal with large volumes of data, we further show that this approach can be significantly accelerated along with a dedicated indexing structure. Extensive experiments on various visual retrieval and classification tasks demonstrate the superior performance of the proposed techniques compared to existing solutions. Index Terms—Similarity measure, string matching, video retrieval, image classification.

I. INTRODUCTION

W

ITH the digital image/video production and distribution industry sectors continuing to grow, multimedia data are available everywhere in our daily lives. For example, consumers can now easily build a sizable personal photo collection as low-cost cameras and storage have become increasingly affordable. Furthermore, the popularity of social media has boosted the number of photos/videos shared via the Internet [32][46]. It is essential to develop techniques that enable users to easily access and organize large volumes of visual data. The fundamental problems in visual retrieval and organization are the design of good data representations and the definition of a quantitative metric that efficiently measures the similarities between each pair of visual data. A good representation is sensitive to data that represent different concepts and is invariant to data that are perceptually alike. A good similarity measure appropriately quantifies the similarity Manuscript received February 8, 2010. This work was supported in part by the National Science Council of the Republic of China, under Grant NSC 99-2218-E-003. Mei-Chen Yeh is with the Computer Science and Information Engineering Department, National Taiwan Normal University, Taipei, Taiwan (Phone: 886-2-7734-6694 Fax: 886-2-2932-2378 E-mail: [email protected]). Kwang-Ting Cheng is with the Electrical and Computer Engineering Department, University of California, Santa Barbara, CA 93106 USA (Phone: 1-805-893-7294 Fax: 805-893-3262 E-mail: [email protected]).

and is robust to imperfect features, such as the presence of noise. Both tasks are challenging because of the well known sensory and semantic gap [38][10]. With the recent development of robust local features [3][28][13][34], representations that consist of parts described by local descriptors have demonstrated impressive performance in various domains. For example, the bag-of-features method [9], which represents visual data as histograms over so-called visual words, has shown promising results for recognizing object, natural scene, and texture categories [48]. Another successful example is the use of shape context descriptors for contour matching [3][4]. In these methods, the set of multi-dimensional features is unordered, and does not have a fixed cardinality (i.e., the number of features is not fixed). For such representations, partial matching, which optimally determines corresponding points of two feature sets, is widely used to measure the similarity [35][3][15][16][4]. However, the order of parts carries useful information as it preserves some spatial layout information of the features [24]. In fact, many multimedia objects might be better represented by a sequence of features. For example, a contour image can be described by ordered local descriptors gathered from the contour. A video can be naturally represented as a sequence of features in which a feature represents a frame. In our previous work, we show that the descriptive ability of representations based on sets of features can be improved if their order is considered [42]. Moreover, a metric that takes into account both the similarities and dissimilarities between feature correspondences achieves better performance than one that considers similarities only [42]. The idea is to explore approximate string matching—a method that is used to compare two strings of symbols and can tolerate errors in the strings-under-matching—to compare two pieces of visual data that are represented by ordered features. The key characteristic of this method is that if one of the two sequences is likely an erroneous variant of the other, the distance between them would be small. The types of defined errors could be assigned a different cost function, and, thus, can be flexible and application-dependent. The derived similarity measure between visual data is an important component for many visual retrieval and recognition tasks. For example, the goal of the query-by-example search is to use the visual content of a query to identify the most relevant instances within a database, under some defined notion of similarity. The approximate sequence matching technique could serve as a measurement that determines how examples

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

2 are situated relative to one another in some feature or semantic space. Query by keywords is an alternative paradigm for visual retrieval. In such a scenario, visual data are analyzed and processed to generate textual tags. The approximate sequence matching technique can also serve as a conduit to machine learning methods such as Support Vector Machines (SVM) [11] for visual classification. The preliminary version of the work [42] focuses on formulation of matching two ordered sets of features, along with an approach for solving the problem. For very large databases, where it is infeasible to index using a naïve linear scan of all items, we show how approximatesequence-matching based approaches can be significantly accelerated with the construction and use of a dedicated indexing structure [43]. The indexing structure not only provides a rich vocabulary for representing visual instances, but can be used to eliminate unnecessary comparisons between dissimilar descriptors. Unlike the approach in [43] where the visual vocabulary was used to enable efficient computation of a heuristic method to derive the sequence similarities, we apply the vocabulary to construct a bioinformatics-inspired approach, and transform the matching problem between two visual instances to a longest simple path problem on a sparse, directed acyclic graph. This new acceleration approach unifies visual appearance and the ordering information in a holistic manner, and can be used to accelerate any sequence-matching-based techniques. We evaluate the proposed method in four applications: video copy detection, video retrieval, shape matching and scene recognition, and empirically demonstrate significant computational improvement without sacrificing accuracy. The results are different from those reported in [42][43] because the acceleration approach is new; moreover, additional experiments were conducted to validate the use of the technique for visual retrieval and classification. The large computational savings are crucial for designing a practical and large-scale system. In the remainder of the paper, we first review the formulation of the representation and the matching method in section II. Section III describes the acceleration method for efficient computation. Section IV demonstrates three case studies, and, finally, we conclude the paper in section V.

d ( X , Y )  min  ci , i

where ci is the cost of an operation, denoted in the form of δ (∙, ∙), which transforms a feature of X so that the resulting feature set is closer to Y, and the total cost is the sum of the costs of all operations that complete the transformation from X to Y. In this paper, we in particular use the Levenshtein distance, in which the set of operations is restricted to (1) insertion δ (ε, a), i.e. inserting a feature vector a to X, (2) deletion δ (a, ε), i.e. deleting a feature vector a from X, and (3) substitution δ (a, b), i.e. substituting a in X with b. Fig. 1 shows a simple example of matching two sequences that consist of three feature types. The determination of each operation’s cost is application dependent; however, a logical solution would be to integrate ground distances into this framework. The ground distance function returns a non-negative real number that defines the dissimilarity between two feature vectors. For example, one may define the substitution cost as the ground distance between the two features and assign a constant cost for the deletion and insertion operations. The Levenshtein distance can be computed using dynamic programming with the recurrence: D(i, j )  min{ D(i  1, j  1)   ( x i , y j ), D(i  1, j )   ( x i ,  ), D(i, j  1)   ( , y j )},

where Di,j is the minimum cost of operations needed to match x1, …, xi to y1, …, yj and D0,0 = 0. We return Dm,n as the Levenshtein distance. Fig. 1 (b) shows a graph for an example with m = 5 and n = 6. The algorithm finds the minimum cost path, from the node s to the node d. In the graph, vertical and horizontal edges are assigned the cost of inserting a feature vector, while diagonal edges are assigned the substitution cost. Note that an insertion in one feature sequence is equivalent to a deletion in the other. Moreover, multiple paths might exist to achieve the minimal cost. The computational complexity is O(mn) and the space required is O(min(m, n)) since we can choose either row-wise or column-wise processing based on their sizes. The sequences of operations performed to transform X into Y can be easily recovered from the matrix; but if we choose to do so, we need to store the entire matrix.

II. APPROACH A. Formulation We start with describing the formulation of matching two ordered sets of features, along with an approach for solving the similarity problem. As long as the feature descriptors are arranged as an ordered set, the formulation is independent of the specifics of features. This formulation can be used for different applications based on different feature descriptors. We would demonstrate the detailed usage of this technique for numerous applications in the experimental section. Let X = [x1, x2, …, xm] and Y = [y1, y2, …, yn] denote two sequences of feature vectors with sizes m and n respectively. The distance between X and Y is defined as the minimal cost of a sequence of operations that transforms X into Y:

Fig. 1. A simple example of approximate string matching and its underlying graph structure. There are three types of feature vectors, which are indicated by stars, diamonds and triangles. The best match occurs by deleting the 2nd feature in the top sequence and the 4th and 6th features in the bottom sequence. Equivalently, the matching result is derived by finding a shortest path on the graph. Each node records the total cost Di,j for comparing two feature subsets, and each edge represents the cost of an operation, where both horizontal and vertical edges are costs for insertion and deletion, and diagonal edges are costs for substitution.

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

3 The Levenshtein distance has been used for comparing two video clips in [1][19][5]. These studies suggested a quantization step that maps feature vectors into symbols. For such cases, all operations have equal costs. The Levenshtein distance is then calculated as the number of operations needed to make two sequences of symbols equal. In [19], a heuristic method was proposed to determine the best step-size for quantization. However, it is not clear how one should choose the number of symbols for generic videos. Similar to the idea in [5], we avoid this quantization step and rely directly on the ground distance for comparing two feature vectors to determine the operation costs. Another view of the Levenshtein distance is that it measures how two sequences are globally aligned. By global we mean two sequences are aligned across their entire lengths. However, in many visual retrieval applications such as video copy detection, the query video might not be a single clip, but the concatenation of a collection of clips [22]. Therefore, one often is interested in finding the most similar segments within two sequences that are aligned pair-wise in the segment-level rather than finding the best alignment for the entire length of two sequences. Local alignment methods can return more than one match for segments among the two sequences under comparison because there may exist multiple-to-one, one-to-multiple, or multiple-to-multiple matches of segments. Therefore, a metric based on local alignment is desirable. B. Extension to Local Alignment This approach can be extended to search for local alignments between two feature sequences. The algorithm is extended in two aspects. First, we derive a score v(xi, yj) between two feature vectors xi and yj based on their distance under the principle that v(xi, yj) would be positive if xi and yj are similar, and negative otherwise. The value v(xi, yj) can then be treated as the substitution “score.” Moreover, we assign negative scores to insertions, denoted as v(xi, ε), and deletions denoted as v(ε, yi). The optimal local alignment can be computed using the recurrence: S (i, j )  max{ 0, S (i  1, j )  v( xi ,  ),

III. ACCELERATION OF APPROXIMATE SEQUENCE MATCHING We now present a highly efficient retrieval framework between a query and a database based on the approximate sequence matching method along with a dedicated indexing structure. This method decomposes the design of the representation and matching, thus any string-of-descriptor based representation could be easily incorporated into this framework. Filtration is a widely used technique for speeding up information retrieval tasks [29][37]. We apply the filtration technique to examine only a small fraction of the descriptor pairs and to select good starting points for aligning two sequences. A. Indexing with a Vocabulary Tree We propose to use a vocabulary tree [30] for indexing all the feature vectors extracted from the database. The vocabulary tree was initially proposed for efficient image retrieval. In this work, each of the visual descriptors is hierarchically quantized by hierarchical K-means clustering. Here, K defines the branch factor of the tree rather than the final number of clusters. The vocabulary tree allows a large and discriminative vocabulary to be used efficiently. It was shown experimentally in [30] that this indexing structure leads to a dramatic improvement in retrieval quality. Similar to the implementation in [30], we kept an inverted file associated with each leaf node—a representative descriptor (visual word)—in the vocabulary tree. However, we recorded not only the visual instances that contain the word, but also those word IDs. Intuitively, instances that have descriptors similar to the query should potentially be of interest. Moreover, those similar descriptor pairs are candidate starting points for an alignment.

S (i, j  1)  v( , y j ), S (i  1, j  1)  v( xi , y j )}.

This is known as the Smith-Waterman algorithm [39]. Same as the basic form, the local alignment is obtained by searching for the maximal score in the dynamic programming graph, and by tracing back the optimal path until a score of zero is retrieved. We use a simple linear model v(xi, yj) = c - g(xi, yj) to derive the substitution score, where c is a constant and g(xi, yj) is the ground distance between two feature vectors. The computational cost of computing the Levenshtein distance is still expensive considering a large scale corpus of data. Next, we address the scalability problem and present means to accelerate this approach that would have the potential for large-scale visual retrieval applications.

Fig. 2. The levels of an adaptive vocabulary tree grow sub-linearly in terms of frame numbers in the database.

To improve the vocabulary tree for those tasks where instances can be added or removed from an active set over time, we implemented the adaptive vocabulary tree [45]. As its name implies, it adapts as instances are added to or removed from the database. One merit of the adaptive vocabulary tree is that the tree needs not to be re-built when the database slightly changes. Moreover, the tree grows based on a measure that encourages splitting those nodes that become too ambiguous and pruning nodes that are not active for the current set of tasks. Thus, the distribution of descriptors decides the structure of the tree, which is also an important factor in determining the vocabulary’s quality. As shown in [45], the retrieval

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

4 performance is less sensitive to the parameters—the number of branches and the capacity of a node. Fig. 2 shows an example of indexing 20,000 video frames. The adaptive vocabulary tree enables the retrieval time to grow sub-linearly with respect to the number of frames in the database. B. Fast Matching Algorithm The goal of acceleration is to filter unnecessary alignments that would not possibly lead to successful matching. Suppose two sequences are unrelated; then, the best local alignment is no better than no alignment! Inspired by FASTA [31], a fast algorithm used in bioinformatics for finding similar DNA and protein sequences, we apply a visual method called a dot plot. A dot plot puts a dot at (i, j) if descriptor i and descriptor j are similar. Fig. 3 shows an example of the dot plot if we look at the circles alone. This plot can be easily constructed by using those inverted files built in the previous step. Note that the dot plot is, in general, very sparse, especially when two sequences under comparison are either completely or partially unrelated. The problem of local alignment is then transformed to identifying long diagonals by connecting dots on the plot. We first construct a weighted directed graph G with k nodes, where k is the number of dots in the dot plot. Two types of directed edges are then established. First, two dots are assigned a diagonal link (i, j) if dot j is positioned at the bottom-right corner of dot i. The frame similarity v(xi, yj) is used as the edge weight. The diagonal links represent contiguous matched descriptor pairs. Next, a gap link (i, j) is established if dot j is bottom-right positioned with respect to dot i and the city block distance between the dots is within a threshold τ. We use a negative weight value as a penalty applied for such links, and its magnitude is proportional to distance between two dots. The gap links allow the operations of insertions and deletions, with the goal of extending the overall length of the alignment. Both links can easily be constructed by examining the dot coordinates. Finally, we derive the best alignment by searching for the longest simple path on G. Since G is acyclic, the optimal-substructure property—sub-paths of longest simple paths in G are longest simple paths—exists. Therefore, this problem can be solved efficiently by topologically pre-sorting the vertices and using the dynamic programming technique. The algorithm is summarized as below. To handle multiple local alignments, the back tracking step is performed repeatedly on the d and p arrays where previously detected segments are removed from the d array after each iteration. While the dot plot approach appears to be simple, it unifies visual appearance and the ordering information in a holistic way with joint consideration of visual-order consistency between the query and the reference instances. The alignment is explicitly carried out by solving an optimization problem, unlike methods such as voting scheme [21][33] that solves the alignment problem in a heuristic manner. Moreover, multiple-to-one, one-to-multiple, or multiple-to-multiple local alignments can seamlessly be handled by the approach. C. Complexity Analysis The Smith-Waterman algorithm compares each descriptor of

the query to every descriptor in the database. Suppose the length of a query is m, and the size of the database (i.e. the number of descriptors) is N. The time complexity of the query would be O(mN). In the fast method, we first construct dot plots by retrieving the corresponding visual word and its instance and descriptor IDs for each query descriptor. This step takes O(mL) using the vocabulary tree, where L is the tree depth. The complexity of deriving the local alignments from the dot plot is O(V+E) where V and E are the number of nodes and edges respectively. The upper bound for V is O(mN) while the upper bound for E is O(τmN). In practice, dots are distributed sparsely and the dynamic programming graph is sparse. The overall runtime is generally linear, rather than quadratic, with respect to the sequence lengths.

Fig. 3. An example of dot plot. Each dot represents a descriptor match between two visual sequences under comparison. Two sequences are locally aligned where the diagonals in the boxes indicate the region of alignment. We set τ as 3 in this example. See texts for the definition of τ.

Algorithm 1 GetBestAlignment(G) 1: Topologically sort the vertices of G 2: Initialization for every vertex v do dv ← 0, pv ← Nil 3: Propagation for each vertex v taken in topological order do for each vertex u adjacent to v do if du < dv + w(v, u) du ← dv + w(v, u), pu ← v 4: Termination dmax ← max(dv) Find the alignment by back tracking pi i = argmax(dv), …

IV. CASE STUDIES A. Video Copy Detection Content-based copy detection (CBCD) has been actively studied for a wide range of applications [18][47][19][21][8][17] [41][33]. Based on content alone, CBCD attempts to identify segments in a query video that are copies from a reference video database. A copy is not an exact duplicate but, in general, either a transformed or a modified version of the original video that remains recognizable [17]. Transformations to digital content such as resizing and inserting logos are frequently performed and the resulting near-duplicates could be different from the source in terms of not only formats, but also content [41].

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

5 Since a video can be naturally represented as a sequence of frames, the approximate sequence matching method is suitable to compare two videos. Moreover, one video could consist of one or several segments copied from different videos and, thus, copies may appear locally. For such a case, only a small portion of the query or the reference video is a copy. Identifying a copy segment locally fits well into the formulation that the proposed method addresses. We demonstrate the effectiveness of detecting and localizing partial near-duplicate videos on two experiments. The first experiment was conducted using the MUSCLE VCD benchmark [22], which is the evaluation set for the video copy detection in CIVR 2007. The second experiment evaluates the localization ability of the approach, and is designed to link partial segments crawled from media sharing websites to a 20-episode television series. The detected links can be used to propagate social tags or other online resources to the corresponding segments. Experiment setups will be described in details in the following subsections. 1) Representation The first step is to partition a video into a sequence of frames. To avoid unnecessary and duplicate comparisons for all frames, we employ a fixed sampling strategy and sample one frame per second. Although a video can be viewed as a list of shots represented by keyframes and shot detection methods are quite robust for videos with the same format, different keyframe sequences might still be generated when these techniques are applied to near-duplicate videos [41]. Since our matching approach is fast, we tolerate some redundancy (same or very similar adjacent keyframes) and obtain consistent keyframe sequences. In the second step, content in a frame is summarized by feature descriptors. We use a global frame descriptor that we recently proposed in [44]. The descriptor is based on computing the spectral properties of a graph built from partitioned blocks of a frame. More precisely, it encodes pair-wise correlations of geometrically pre-indexed blocks within a frame. The descriptor is compact—a 16-d feature vector for a frame. Although an alternative choice of features—local statistics such as interest points with PCA-SIFT—have been quite popular for this task [37][18][47][8], we did not use local features because matching between local features is just too costly. As indicated in [21], interest point detection alone is one of the computational bottlenecks in these methods. For videos, performing interest point detection on every extracted frame of a query video is simply infeasible due to the unacceptably high computational cost. To compare two frame descriptors, we use χ 2 as the ground distance function and a constant value of 0.5—the χ2 distance between a normalized descriptor and a zero vector—as the cost of insertion and deletion operations. We use the same ground distance function for all experiments. 2) Results on MUSCLE VCD dataset The MUSCLE VCD benchmark [22] is a publicly available

benchmark that consists of 101 videos with a total length of 80 hours. It provides ground truth data for evaluating a system’s detection accuracy based on two tasks: finding copies (ST1) and finding extracts (ST2). The first task evaluates a system’s ability to find full copies in the database. This corresponds to the global alignment problem addressed in this work. The second task is to detect partial copies, which clearly is a local alignment problem. Both tasks are challenging because the transformations applied to this benchmark were very diverse. Table I summarizes the detection performance of the best official results for all teams who participated in the evaluation, Poullot’s approach [33], and the proposed method. Our recipe that combines global features with the approximate sequence matching method achieves promising performance in comparison with other methods. More importantly, our method has a low computational cost. We report runtime t1/t2 where t1 is used for frame sampling, feature extraction and t2 is used for localizing copies. The matching process is faster than other approaches. It is more than three times faster than the feature extraction process. The runtime is reported based on non-optimized MATLAB codes running on a machine with an Intel Core 2 Duo 2.8 GHz CPU and 3GB of RAM. All experiments were performed with the same machine. Table II shows the number of frame pairs under comparison and the search time for the original and the accelerated approximate sequence matching method. The dot density, defined as the number of dots divided by the total plot area, of those dot plots between each query and the reference videos is, very sparse. It ranges from 0 to 5.4% for both tasks, with an average less than 0.5%. The dot density shows that the dot plot approach generates a compact dynamic programming graph structure compared to that of the original approximate sequence matching approach. Furthermore, the search time is clearly proportional to the dot density. For ST2, since each copy represents only a small portion (3.5%-12.6%) of the query, the speedup is more significant (37.83x) compared to that (18.13x) for ST1. The proposed method, with a runtime of about 1/35 of the query length, is a viable solution to applications that require real-time processing. Fig. 4 shows an example of the alignment, where white pixels denote original dots, and green crosses denote the detected alignment.

Fig. 4. An example of local alignment in ST2.

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

6

Method ST1

Score Runtime (min) Score Runtime (min)

ST2

Task

Query length

Total frame pairs

ST1 ST2

11,236 2,690

2.24x109 5.37x108

# Videos

TABLE I ACCURACY ON THE MUSCLE VCD BENCHMARK CIVR07 Teams ADV IBM CITYU CAS 0.86 0.86 0.66 0.53 64 44 45 14 0.33 0.86 N/A N/A 33 35

Poullot et al. [33] 0.93 N/A 0.86 N/A

TABLE II TOTAL RUNTIME (IN SECONDS) # frame pairs Average dot density Search time (ASM) under comparison 8.87x106 0.0043 5,887.47 1.82x106 0.0035 2,923.98 TABLE III SUMMARY OF THE KISS DATASET Query (web) videos 100

Ours feature/search 0.93 17.13 / 5.41 0.86 4.07 / 1.29

Search time (Accelerated ASM) 324.77 77.29

Speedup

18.13x 37.83x

Reference videos 20

# Extracted frames

46,260

85,108

Lengths

00:00:06 ~ 00:14:19

~ 1:20:00 per episode

TABLE IV RESULTS ON THE KISS DATASET (IN SECONDS) # Frame-pair comparisons ASM

3.94 x 109

Accelerated ASM

429,787

Search time

Feature extraction time

# Positive samples

Precision

Recall

21,894.42 (06:04:54) 186.44 (00:03:06)

3,380.95 (00:56:20)

85

1.00

1.00

3) Results on linking web videos to TV series Broadcast videos are especially rich with partial duplicate contents distributed across multiple media sharing websites, as users would extract and share online their favorite clips. In this experiment we tested on a popular Taiwanese drama “it started with a kiss.” The TV show has 20 episodes, and the runtime for each episode is about 80 minutes. For query videos, we crawled 100 videos from YouTube [46] using the title, and manually labeled the time stamps of alignments and the corresponding episode for each query. Table III summarizes the dataset. The task of linking web vides to a TV series is not a trivial task because the reference videos contain only a number of characters and share similar scenes. Moreover, as users often reformat videos to smaller sizes before uploading them to the web, we observe a significant disparity between the qualities of the query and the reference videos. Finally, timely response to user queries is one important factor that fuels the popularity of web applications. A practical method must meet both of the speed and the accuracy aspects. Table IV shows the detection accuracy, in terms of precision and recall, and the process time of the proposed approach. It delivers the best possible result (precision 1.0 and recall 1.0) with a search time of only 3.11 minutes for 100 query videos. The acceleration approach achieves a 117.44x speedup. This experiment validates that the accelerated ASM is effective for detecting copy segments between web videos and TV series. In this scenario, the query is extremely short comparing to the video database, and, thus, our approach is very fast. Because the approach considers jointly the spatial and the temporal

coherence during the matching process, the localization result is very accurate. B. Shape Retrieval Now we demonstrate our approach on shape retrieval. We conducted experiments on the MPEG-7 shape database, the Core Experiment CE-Shape-1 part B, which measures the performance of similarity-based shape retrieval [20]. The database consists of 1,400 shapes in 70 categories, each of which consists of 20 shapes. Fig. 5 shows some examples in the dataset, where the images in the same row belong to the same class.

Fig. 5. Exemplar shapes in the MPEG-7 shape database for four different categories. Variations caused by scale, rotation, change of viewpoints, appearance of noise, and non-rigid motion are present.

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

7

Method

Fourier descriptors

Score

73.51%

TABLE V THE PERFORMANCE OF DIFFERENT METHODS FOR THE MPEG-7 CE-SHAPE -1 SC+bipartite SC+TPS+bipartite SC+chamfer SC+DP IDSC+DP [26] [3] [36] 71.81%

76.51%

79.57%

82.46%

85.40%

SC+ASM (Ours) 84.29%

TABLE VI AVERAGE RUNTIME FOR COMPARING A SEQUENCE PAIR Method ASM Accelerated-ASM Time (ms) 62.50 6.37

suburb coast forest highway inside city

TABLE VII C LASSIFICATION RESULTS FOR THE SCENE DATASET Sparse features

Dense features

4x4 partition (L2) [23]

66.42±0.62

79.72±0.63

Pyramid (L0+L1+L2) [23]

67.95±0.50

80.65±0.65

Bipartite (L2)

63.30±0.42

78.17±0.47

ASM (L2)

69.21±0.66

80.93±0.64

TABLE VIII C ATEGORY - LEVEL CLASSIFICATION ACCURACY OF [23] (ABOVE ) AND OUR APPROACH (BELOW ) 97.52 (1.22) mountain 83.72 (2.37) 59.74 (3.83) bedroom 98.09 (1.42) 85.91 (2.30) 64.14 (4.74) 83.92 (2.65) open country 73.77 (2.81) 57.87 (2.56) industrial 82.62 (3.56) 74.45 (2.94) 66.02 (3.16) 94.74 (1.27) street 87.45 (1.56) kitchen 66.82 (3.30) 94.87 (1.55) 88.23 (1.46) 67.18 (4.42) 84.50 (2.06) tall building 88.55 (1.17) living room 58.15 (3.04) 84.94 (2.09) 88.83 (1.50) 60.74 (2.83) 80.43 (3.23) office 87.74 (2.57) store 82.37 (2.23) 82.02 (2.09) 88.96 (1.84) 80.14 (2.35)

1) Representation From the contour of each shape, 100 points are uniformly sampled. Starting from the top point of the contour, the order is determined by traversing the contour clock-wise and the rotationally invariant shape context descriptor [3] are extracted at each sampled point. The shape context descriptor at a point is a 60-dimensional feature vector, resulting from combinations of five distance bins and twelve orientation bins (as proposed in [3]) which encode the relative coordinates of remaining points using the log-polar space. As a result, a shape is represented by 100 ordered, 60-dimensional shape context descriptors. As illustrated by the examples shown in Fig. 5, two contours, X and Y, may not be rotationally aligned at the top point. To address this issue, we duplicate one of the sequences, say Y, to form a longer sequence, Y-Y, resulting in a size 200. Applying our algorithm to match X and Y-Y would automatically find an optimal alignment fairly independent of the starting points of the sequences. 2) Results The performance is measured based on the so-called bull’s eye test in which each shape is compared to every other shape in the database, and the number of correct matches in the top 40 retrieved shapes is counted. The retrieval rate is calculated by the total number of correct matches divided by the total number of possible hits (20 x 1,400 = 28,000). Table V shows the reported results from state-of-the-art algorithms and those from our implementation of several

matching methods. We implemented Fourier Descriptors [14] to represent global features. For local features, the shape context descriptor (SC) was used in [3][40][36]. The inner-distance shape contexts descriptor (IDSC)—an extension of the shape context descriptor—was used in [26]. Table V shows that using a proper similarity measurement, the representation based on local features is more effective than that based on global features. In [3], bipartite graph matching is used to find point correspondence. The shape similarity is measured in conjunction with metrics based on both appearance and deformation energy after using the thin-plate-spline model (TPS) to estimate the transformation between the point correspondences. The point ordering is not considered in the matching. The method in [40] measures the shape similarity based on chamfer matching [2]. The figural continuity is further incorporated to improve the correspondence estimation. However, the chamfer matching is not invariant towards translation, rotation or scale and the incorporation of figural continuity does not solve the rotational alignment issue. Furthermore, the choice of chamfer matching does not always establish a one-to-one point correspondence. The complexity of the global search phase in chamfer matching is thus O(m2n) in [40] using the point ordering while our method runs in O(mn) and even faster after acceleration where m and n are sequence sizes. In the experiment we implemented a bipartite graph matching and a chamfer matching method, both with shape context descriptors, for direct comparison.

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

8 In comparison with methods in Table V, the proposed method is more robust to rotation variations and does not require a transformation model to align two shapes. Transformations based on point correspondences found by bipartite graph matching in [3] might result in a bad alignment since spatial information about the points is not considered. In [40][26][36], dynamic programming is also used to preserve the order of local descriptors. To address the rotational alignment issue, authors in [26] proposed to search for a good alignment by trying a number of points as the starting point, and then select the one that offers the best performance. This strategy would likely lead to a sub-optimal solution. Furthermore, those methods allow a small fraction of points to be unmatched and a small penalty is set for a point with no matching. The results of our experiments indicate that assessing a larger non-matching penalty would be a better choice. Matching of dissimilar shapes could be achieved by using a set of insertion/deletion operations. Thus, appropriately penalizing the insertion/deletion operations and reflecting these factors in the total cost is important for accurately measuring similarity. As shown in Table V, IDSC+DP [26] achieves 3% higher accuracy than SC+DP [36] because of their elaborate features design. The same improvement (3%) in a setting where the only difference is the descriptors used in the experiment was also reported in [26]. By replacing SC with IDSC in ASM, it seems reasonable to expect this approach would yield even better results. Table VI lists the average runtime for comparing one pair of shapes represented by sets of shape context descriptors. The methods were implemented using MATLAB. Since the acceleration ASM approach examines only a subset of descriptor pairs, it is around 10 times faster than ASM. Recently two methods were proposed and achieved high accuracy on the MPEG-7 shape database. Felzenszwalb et al. [12] proposed a hierarchical matching method and achieved 87.7% recognition rate. In [27], the authors used the Earth Mover's Distance as the ground distance for comparing a pair of local descriptors, and they achieved 86.56% accuracy. Comparing with those methods, our approach has a considerably lower computational complexity. For example, given two sequences of size m and n, the algorithm proposed in [12] runs in O(mn3) while our method runs in O(mn) before acceleration and even faster after acceleration. C. Scene Recognition Finally, we demonstrate the application of the proposed approach to scene recognition. The dataset that was used in the experiment consists of 15 scene categories that were provided by Lazebnik et al. [23]. Each category has 200 to 400 images. 1) Representation A scene image is represented by the spatial pyramid representation [23]. First, we extracted a set of local features from each image, each of which is described by SIFT [28]. We identified the local features by two means—the Harris-Hessian-Laplace detector (sparse features) [48] and a dense regular grid (dense features) [23]. We report

classification accuracy rates on both representations. Next, we followed the setup in [23] and performed K-means clustering on a random subset of local features from the training set to form a visual vocabulary with 200 visual words. Each local feature is then quantized to a visual word and an image is represented by a distribution of frequencies of visual words. This gives us a standard bag-of-features representation. Next, we obtained the spatial pyramid representation by repeatedly subdividing an image and computing the bag-of-features representation over the resulting sub-regions. In [23][6], the authors suggest that a level-2 partitioning (L2), which corresponds to 16 bags of features, achieves the best performance among all levels. Following their suggestion, we used the L2 representation in the experiment. That is, each image is represented by 16 ordered bags of features, each of which is 200-dimensional, following the raster-scan ordering. Since matching images in this dataset does not have the alignment problem, the basic algorithm was used. 2) Results We followed the setup in [23] to evaluate the proposed method. One hundred images per class were used for training; the rest were for testing. We used the edit kernel in [25] to derive kernel values of examples. As suggested in [25], we set the scaling parameter γ by five-fold cross validations to make the Gram matrix positive semi-definite. Multi-class classification is then executed using a Support Vector Machine (SVM) [7]. The experiment was repeated 10 times with randomly selected training images. The mean and the standard deviation of the classification accuracy are shown in Table VII. We compared the proposed method with our implementation of [23], and the bipartite graph matching method. Bipartite graph matching treats those bags as unordered and minimizes the cost of matching, but subject to the constraint that the matching is one-to-one. As shown in Table VII, introducing the ordering constraint in the matching process improves the performance. The bipartite graph matching method does not consider the locations of bags and performs the worst of all the methods tested. On the other hand, for the case of L2 alone, the approximate sequence matching approach achieves 7.21% and 9.34% improvements respectively in comparison with [23] and the bipartite graph matching method when the sparse features are used. The performance gain is less significant (1.5% and 3.53%) when we used the dense features. Our method is also better than the spatial pyramid kernel, which uses all levels. All methods are based on the same ground distance (χ2). One merit of our approach is that matching across adjacent bags is allowed. Using our framework, the kernel value derived in [23] is equal to summing up all diagonal elements in the directed graph. This assumes that bag-by-bag matching was performed. That is, a bag in the first image is only compared to the corresponding bag in the second image that is at the same location. However, the diagonal axis might not be the optimal path for minimizing the matching cost. Thus, we can interpret the method in [23] as a special case of our method in which only the substitution operation is allowed. Table VIII shows the

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

9 category-level classification accuracy of the method in [23] and our approach. Although two approaches have similar performance in most categories, we observe our approach offers a large improvement in the categories “bedroom” and “industrial.” In these categories, discriminative parts certainly exist but they may not have a fixed geometric arrangement. Fig. 6 shows some test samples in the industrial category for which our approach works but [23] fails. In these examples, chimneys are considered the discriminative object for the category; however, parts of chimneys may not just appear in a particular bag. The reduction of 2D image data to 1D sequence representation may remove important spatial information. In comparison with sophisticated approaches in computer vision that build 2D or higher-dimensional models for object recognition [4][12][49], the approximate sequence matching approach has a considerably lower computational complexity. Considering that the volume of data is growing exponentially, a practical solution for visual retrieval and recognition should balance the tradeoff between the accuracy and the speed.

[3]

[4]

[5]

[6]

[7] [8]

[9]

[10]

[11] [12]

[13] Fig. 6. Test samples in the industrial category for which our approach works but [23] fails.

V. CONCLUSION In this paper we have developed techniques for effective and computational efficient visual searches. We formulate a representation based on a sequence of features, and apply the approximate sequence matching method to measure the similarity based on such a representation. We have also presented a framework that can achieve a significant speedup for the matching process. The proposed techniques have been demonstrated for use in several visual retrieval applications, and have obtained promising performances in comparison with state-of-the-art methods. One future research direction we are pursuing is to extend the technique for aligning multiple sequences of visual data, and for finding essential content within multiple relevant visual streams. For example, in the context of video retrieval, this could be a useful tool for creating a summary from huge volumes of near-duplicate videos on video sharing websites.

[14] [15]

[16]

[17]

[18]

[19]

[20]

[21]

REFERENCES [1]

[2]

D. A. Adjeroh, M. -C. Lee, and I. King. A distance measure for video sequence similarity matching. In Proc. Int. Workshop Multi-Media Database Management Systems, pp. 72-79, 1998. H. G. Barrow, J. M. Tenenbaum, R. C. Bolles, and H. C. Wolf. Parametric correspondence and chamfer matching: Two new

[22]

[23]

techniques for image matching. In Proc. Int. Joint Conf. Artificial Intelligence, pp. 659-663, 1977. S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. IEEE Tran. Pattern Analysis and Machine Intelligence, 24(4): 509-522, 2002. A. Berg, T. Berg, and J. Malik. Shape matching and object recognition using low distortion correspondences. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pp. 26-33, 2005. M. Bertini, A. D. Bimbo, and W. Nunziati. Video clip matching using MPEG-7 descriptors and edit distance. In Proc. ACM Int. Conf. Image and Video Retrieval, pp. 133-142, 2006. A. Bosch, A. Zisserman, and X. Munoz. Representing shape with a spatial pyramid kernel. In Proc. ACM Int. Conf. Image and Video Retrieval, pp. 401-408, 2007. C. -C Chang and C.-J. Lin, LIBSVM: a library for support vector machines, 2001. O. Chum, J. Philbin, M. Isard, and A. Zisserman. Scalable near identical image and shot detection. In Proc. ACM Int. Conf. Image and Video Retrieval, pp. 549-556, 2007. G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision, in conjunction with ECCV, pp. 1-22, 2004. R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image retrieval: ideas, influences, and trends of the new age. ACM Computing Surveys, 40(2):1-60, 2008. R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd Edition. John Wiley and Sons, 2000. P. F. Felzenszwalb and J. D. Schwartz. Hierarchical matching of deformable shapes. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2007. R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from Google’s image search. In Proc. IEEE Int. Conf. Computer Vision, pp. 1816-1823, 2005. R. C. Gonzalez and R. E. Woods. Digital Image Processing, 2nd Edition. Prentice Hall, 2002. K. Grauman and T. Darrell. Fast contour matching using approximate earth mover's distance. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pp. 220-227, 2004. K. Grauman and T. Darrell. The pyramid match kernel: discriminative classification with sets of image features. In Proc. IEEE Int. Conf. Computer Vision, pp. 1458-1465, 2005. A. Joly, O. Buisson, and C. Frelicot. Content-based copy retrieval using distortion-based probabilistic similarity search. IEEE Tran. Multimedia, 9(2): 293-306, 2007. Y. Ke, R. Sukthankar, and L. Houston. Efficient near-duplicate detection and sub-image retrieval. In Proc. ACM Int. Conf. Multimedia, pp. 1150-1157, 2004. Y. Kim, and T. -S. Chua. Retrieval of news video using video sequence matching. In Proc. Int. Multimedia Modeling Conf., pp. 68-75, 2005. L. J. Latecki, R. Lakamper, and U. Eckhardt. Shape descriptors for non-rigid shapes with a single closed contour. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pp. 424-429, 2000. J. Law-To, O. Buisson, V. Gouet-Brunet, and N. Boujemaa. Robust voting algorithm based on labels of behavior for video copy detection. In Proc. ACM Int. Conf. Multimedia, pp. 835-844, 2006. J. Law-To, A. Joly, and N. Boujemaa. Muscle-VCD-2007: a live benchmark for video copy detection, 2007. http://www-rocq.inria.fr/imedia/civr-bench/. S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

10

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32] [33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

categories. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pp. 2169-2178, 2006. B. Leibe, A. Leonardis, and B. Schiele. An implicit shape model for combined object categorization and segmentation. Toward Category-Level Object Recognition. Springer, 2006. H. Li and T. Jiang. A class of edit kernels for SVMs to predict translation initiation sites in eukaryotic mRNAs. Journal of Computational Biology, 16(2):702-718, 2004. H. Ling and D. W. Jacobs. Using the inner-distance for classification of articulated shapes. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pp. 719-726, 2005. H. Ling and K. Okada. EMD-L1: an efficient and robust algorithm for comparing histogram-based descriptors. In Proc. European Conf. Computer Vision, pp. 330-343, 2006. D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. Journal on Computer Vision, 60(2), pp. 91-110, 2004. G. Mori, S. Belongie, and J. Malik. Shape contexts enable efficient retrieval of similar shapes. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pp. 723-730, 2001. D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 2161-2168, 2006. W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. In Proc. National Academy of Sciences of the United States of America, 85(8):2444-2448, 1988. Picasa. http://picasaweb.google.com S. Poullot, M. Crucianu, and O. Buisson. Scalable mining of large video databases using copy detection. In Proc. ACM Int. Conf. Multimedia, pp. 61-70, 2008. P. Quelhas, F. Monay, J. Odobez, D. Gatica-Perez, and T. Tuytelaars. A thousand words in a scene. IEEE Tran. Pattern Analysis and Machine Intelligence, 29(9):1575-1589, 2007. Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover's distance as a metric for image retrieval. Int. Journal of Computer Vision, 40(2):99-121, 2000. C. Scott and R. Nowak. Robust contour matching via the order-preserving assignment problem. IEEE Tran. Image Processing, 15(7):1831-1838, 2006. J. Sivic and A. Zisserman. Video Google: a text retrieval approach to object matching in videos. In Proc. IEEE Int. Conf. Computer Vision, pp. 1470-1477, 2003. A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. IEEE Tran. Pattern Analysis and Machine Intelligence, 22(12):1349-1380, 2000. T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1): 195-197, 1981. A. Thayananthan, B. Stenger, P. H. S Torr, and R. Cipolla. Shape context and chamfer matching in cluttered scenes. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 127-133, 2003. X. Wu, A. G. Hauptmann, and C. -W. Ngo. Practical elimination of near-duplicates from web video search. In Proc. ACM Int. Conf. Multimedia, pp. 218-227, 2007. M. Yeh and K. -T. Cheng. A string matching for visual retrieval and classification. In Proc. ACM Int. Conf. Multimedia Information Retrieval, pp. 52-58, 2008. M. Yeh and K. -T. Cheng. Video copy detection by fast sequence matching. In Proc. ACM Int. Conf. Video and Image Retrieval, 2009. M. Yeh and K. -T. Cheng. A compact, effective descriptor for video copy detection. In Proc. ACM Int. Conf. Multimedia, pp. 633-636, 2009.

[45] T. Yeh, J. Lee, and T. Darrell. Adaptive vocabulary forests for dynamic indexing and category learning. In Proc. IEEE Int. Conf. Computer Vision, pp. 1-8, 2007. [46] YouTube. http://www.youtube.com/ [47] D. -Q. Zhang and S. -F. Chang. Detecting image near-duplicate by stochastic attributed relational graph matching with learning. In Proc. ACM Int. Conf. Multimedia, pp. 877-884, 2004. [48] J. Zhang, M. Marsalek, S. Lazebnik, and C. Schmid. Local features and kernels for classification of texture and object categories: a comprehensive study. Int. Journal of Computer Vision, 73:213-238, 2007. [49] X. Zhou, N. Cui, Z. Li, F. Liang, and T. S. Huang. Hierarchical Gaussianization for Image Classification. In Proc. IEEE Int. Conf. Computer Vision, 2009.

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

Suggest Documents