A Fast Shot Matching Strategy for Detecting Duplicate ... - Irisa

18 downloads 4147 Views 302KB Size Report
ternet, terrestrial digital TV. ... tify the program segments from a recorded television stream, ... The image signature is thus defined as an integer descriptor.
A Fast Shot Matching Strategy for Detecting Duplicate Sequences in a Television Stream Xavier Naturel

IRISA-INRIA Rennes Campus de Beaulieu Rennes, France

ABSTRACT This article presents a method for detecting duplicate sequences in a continuous television stream. This is of interest to many applications, including commercials monitoring and video indexing. Repetitions can also be used as a way of structuring television streams by detecting interprogram breaks as a set of duplicate sequences. In this context, we present a shot-based method for detecting repeated sequences efficiently. Experiments show that this fast shot matching strategy allows us to retrieve duplicated shots between a 1 hour long query and a 24 hours database in only 10 ms.

1.

INTRODUCTION

Television experienced drastic changes recently with an ever increasing number of channels and new ways of delivering its content to the user, e.g. TV over ADSL, mobiles, internet, terrestrial digital TV. However managing this video content is still difficult, in particular for extracting small segments of interest from the continuous stream of digital television. When extracting content from a television stream, one desirable feature is to retrieve an entire program with its exact boundaries, e.g. from its first to its last frame. However, due to time-shifting in the broadcast, the precise time of broadcast is not precisely known, resulting in annoying commercials or channel events being recorded instead of the desirable program. It would be convenient to automatically identify the program segments from a recorded television stream, and to build an updated and enhanced program guide, with the exact durations and locations of the programs and the various inter-programs clips. This may be of interest for many applications, including an end-user vcr, but also for institutions that store material coming from television, regulation authorities or channels themselves, which may want to store or monitor precisely what has been broadcasted. All these tasks are usually done manually. This is a very tedious work especially if precise timestamps are needed.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CVDB ’05, Baltimore, MD, USA Copyright 2005 ACM 1-59593-151-1/05/06 ...$5.00.

Patrick Gros

IRISA-CNRS Campus de Beaulieu Rennes, France

In this study, inter-programs breaks are considered as a set of duplicate sequences. The various component of an interprogram break (commercials, channel events, sponsoring, trailers...) are indeed repeated a lot of times within a limited time scale and exhibit very different properties in terms of rerun frequency than plain programs. This property has already been used by [5], combined with a classification based on low-level features to isolate commercials from broadcast news. We believe that repetitions can provide an easy classification and structuring process for television streams as long as the process is run over a large dataset. This paper focuses on detecting duplicate sequences in a television stream. The problem is thus the following: given a query video, find all the sequences in this video that are “identical” to some defined sequences in a video database. A sequence is considered as a simple set of frames. Two sequences are considered to be identical if they originate from the same source, with possibly some transformations. The relevant transformations in our case are those which happen during broadcast and recording: additive Gaussian noise, color shift, digitization and compression artifacts, small temporal variations. The notion of identical sequences is fairly intuitive: it is not a bit-to-bit equality, but the fact that two sequences are visually the same. Detecting duplicate sequences is the first step of a video structuring process. For this process to make sense, the recognition process must be run on large video queries (e.g. 24 hours) and on large video dataset (e.g. several days, weeks, months). The algorithms for retrieving and matching sequences must thus be able to deal with large amount of video data efficiently. Based on these requirements we propose a complete method for detecting duplicate sequences, from descriptor definition to database organisation and sequence retrieval. The first step is to segment the video into shots and to compute a signature on each frame of each shot. Since the robustness requirement is quite weak, and low search complexity is important, the signature is designed to be as simple as possible. The image signature is thus defined as an integer descriptor computed from the low frequency coefficients of the DCT. The fact that the signature is an integer is the most important point of the method since it drives the database organisation and the retrieval method. The database is the combination of a hash table and a list of shots. The hash table is used to store the signatures and to efficiently retrieve a shot thanks to a single signature. Matching methods are then proposed to decide if this retrieved shot is identical to the query shot.

The paper is organised as follows. Related work on detecting duplicates video sequences is presented in section 2. Section 3 explains the core of our method: the signature definition in subsection 3.2, the retrieval and matching method in subsection 3.3. Section 4 presents some results on the method effectiveness and efficiency as well as its limitations.

2.

RELATED WORK

A lot of research has been done to identify similar video clips in a video database. Most of this research has been conducted for two main purposes: video copy detection and video similarity search for content-based retrieval. The first application aims at retrieving videos that originated from the same source, possibly with strong tampering between them. A solution is to build a robust fingerprint from the content, able to resist to some defined modifications [10]. Due to the necessity to capture the essential geometric properties of the image and to resist to tampering, fingerprints are usually based on interest points or computed from the frequency domain of some transform (wavelet, DCT). The second application is interested in retrieving from a database video clips that are similar to a query video clip. Since retrieval is based on similarity, the features used are far less discriminative, resulting in a wide use of global features such as color, texture and shape descriptors. The common point of these methods is that these descriptors are vectors in a high dimensional space in which the search is performed. Numerous smart indexing schemes and search methods have been proposed to efficiently search into these high-dimensional spaces [13] [3]. These methods are complex and dealing with large datasets in this context is still a difficult issue. Work by Joly at al. [7] is a notable exception. Joly used a 20-dimension descriptor computed from an interest point detector and propose a so-called statistical query to perform approximate search in this high dimensional space. The effectiveness of this method is shown by querying effectively a database of 50000 hours of video. The methods used in these two applications are however not suited to match identical sequences because they consider important differences between them. Detection and recognition of commercials is much closer to our topic. Lienhart et al. [9] used a fingerprint based on color coherence vector and compare the fingerprints using a standard approximate substring matching algorithm. Sanchez et al. [12] built a subspace by computing principal component analysis on key-frames of the spots to recognize. Matching is then done by computing the minimum euclidean distance of the query commercials key-frames representation in this subspace. These methods are unfortunatly restricted to commercials and do not consider seriously large dataset and efficiency issues. Very few works have considered indexing mechanisms based on mono-dimensional descriptors. Pua et al. [11] proposed a hashing mechanism based on color moment vectors to efficiently retrieve repeated video clips. Oostveen et al. [10] used a lookup table to store fingerprints based on mean luminance of image blocks. Both these methods uses exact matching of image fingerprints to locate a candidate sequence rather than exhaustive search in the fingerprint database.

3.

RECOGNITION METHOD

3.1 Shot Segmentation The first step is shot segmentation. A shot is a set of frames continuously recorded by a camera. A shot is therefore a small homogeneous set of frames, which is being used as the basic unit for recognition. The algorithm used for shot segmentation is a simple algorithm based on frame-toframe difference of luminance histogram. We do not require perfect temporal segmentation, however the algorithm must not miss important semantical boundaries (program change, channel lead-in, lead-out) and should have a property of repeatability, i.e. the algorithm should produce the same shot segmentation on two identical video sequences. A simple and fast algorithm is prefered over a more complex one, which would yield better results in shot segmentation but would not really improve the results in terms of recognition.

3.2 Signature definition The next step is to compute a signature on each image. As stated in the introduction, the signature has to be robust to broadcast noise (additive Gaussian noise, color shift, compression artifacts). It also has to allow exact retrieval, i.e. two identical images should have the same signature. One of the most common way to reduce dimensionality in image processing is to work in the Discrete Cosinus Transform (DCT) domain. This transform is widely used because for smooth, natural images, it approximates the KarhunenLoeve transform (KLT), which provide optimal energy compaction and coefficients decorrelation [6]. It has been shown that the DCT is one of the best approximation, without the computational burden of the KLT, which is data-dependent. For an image I of size (N, M ) , the coefficient DCT (u, v) is given by : α(u, v)

−1 N −1 M X X x=0 y=0

h h πu i πv i cos (2x + 1) cos (2y + 1) I(x, y) 2M 2M 

1 √ 2

for u = 0 1 otherwise Note that in the formula above the transformation is applied on the whole image, thus capturing global properties of the image. The next step is to use the dimensionality reduction property of the DCT in a rather extreme way, by extracting the superior left nxn sub-matrix in the DCT matrix and by quantizing it aggressively:  ´ ` 1 if DCT b ni c, i − b ni cn ≥ m for i ∈ [1, n2 ] σ(i) = 0 otherwise with α(u, v) =

√2 NM

C(u)C(v) and C(u) =

Where m is the median value of the first n2 coefficients. These first coefficients in the raster order are the most important ones, especially for natural images, because they carry the low frequency information. However, for our purpose, we remove the DC coefficient because it is useless in its quantized form. The DC coefficient, i.e. DCT(0,0) is the mean and is always the greatest coefficient. The DC coefficient is thus always quantized to 1 and does not bring any information. The following coefficient, DCT(n,0), is append to σ to make a 64-bit signature. A similar quantization scheme has already been proposed by Coskun et al. [4] and Barr et al. [2] for copy detection purposes, and has been shown to be robust to severe transformations. However, our emphasis is here more on size than robustness. This construction results in a vector of size n2 ,

We define two methods of database organisation: • Single Reference Scheme: In this scheme, we store the signature only if it is not already in the hash table. We thus have only one reference to a candidate shot by signature. • Multiple Reference Scheme: Several shot references are stored for a single signature. Though it has higher memory demands, we hope this scheme will lead to better results. This is also the approach used by [10] and [11] and is very close to the inverted file technique used in text retrieval.

Figure 1: Original Lena image

To sum up, a single signature is used to retrieve a reference to a shot (or a list of shot references in the multiple reference scheme) by a hash table lookup. This reference is then used to retrieve the complete shot information.

3.3.2 Hash function definition n=8

n=7

n=6

n=5

Figure 2: Reconstructed image of Lena based on its signature which we will use as the signature for one frame. We choose n between 5 and 8, allowing to map the binary vector σ to a 64-bit integer. Figure 2 shows a visual interpretation of the signature of the Lena image. These images are obtained by applying an inverse DCT on the quantized DCT matrix, where all coefficients are set to zero, except for the upper-left nxn sub-matrix which is quantized as explained above. It is interesting to see the information kept in the signature, especially for n = 8 where the shape of Lena can be guessed, although we have only 64 bits of data.

3.3 Shot Matching Method 3.3.1 Database organisation A desired property for the signature was that two identical images should have the same signature. This is not true in the general case because of noise. However there is a very high probability that two identical shots will share one common signature. This is the main assumption of the method and results in Section 4 show that this assumption is correct. The nature of the signature (an integer) allows the use of fast retrieval structures for signature retrieval, e.g. hash tables or lookup tables. In our case, the signatures are extracted for each frame and stored in a hash table with a reference to the shot they belong to.

A good hash function for a hash table is a function that produces uniformely distributed values of hash. This is essential for the performance of the hash table to choose a hash function that minimizes collisions. An easy way to solve this problem is to use the signature as the hash value, at the expense of mapping the 64-bits signature into a 32-bits value. To prevent collisions we must look at the initial distribution of the signatures and find a way of transforming it into a random distribution in the reduced space. The observation is the following. In a group of homogeneous shots, the signatures are very similar, with small variations on the last DCT coefficients, i.e. the least significant bits show some variations while the most significant bits remain quite stable. This remark leads to a very simple hash function. We define the hash value h(σ) as simply the first 32 least significant bits of σ. Note that in the case where n = 5 we always have σ ≤ 232 − 1 leading to the very simple hash function h(σ) = σ.

3.3.3 Retrieval Method Suppose now that a database of shots as defined in 3.3.1 has been constructed, with the signatures stored in a hash table Th . A query video stream is transformed into a sequence of query shots, with their associated signatures. For a specific query shot Sq , the general algorithm for finding if Sq is present in the database is the following. For each frame of Sq , it is tested if its signature belongs to Th . If yes, it returns a candidate shot Sc . To decide if Sq and Sc are the same, a distance between Sq and Sq is computed and thesholded. The process stops as soon as a candidate has been found, i.e. a shot Sc that verifies D(Sq , Sc ) ≤ α. The value of the threshold α is discussed in the next section.

3.3.4 Shot Distance definition Since the signatures can be compared using a simple equality test, a voting procedure, i.e counting the number of equal signatures in two shots, could be envisioned. However, because small variations in the frame may also cause very small variations in the signature, the number of frames with identical signature may be low. On the contrary, the hamming distance is well suited to measure small variations between the signatures vectors. Consider two shots Sq = {σq1 , ..., σqN } and Sc = {σc1 , ..., σcM } Suppose that the ideal case where N = M happens. The

0

M

Sc

j

1800

0

N

Sq

1600

i

1400 1200

common subset

1000

Figure 4: Shot alignment with search by positions

800 600 400 200 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2 6

x 10

which is often used for measuring string similarities is not always relevant and has a quite high complexity, even with dynamic programming. It is useful when looking for similarities, especially when complex video editing must be found, as in [1], but it is often important to keep the strict temporal frame order.

3.3.5.2 Alignement Method. Figure 3: Na¨ıve distances between a 100-frames commercial lead-in and a 24-hours video na¨ıve distance between two shots DN a (Sq , Sc ) is defined as: DN a (Sq , Sc ) =

N 1 X dh (σqi , σci ) N i=1

with dh (σqi , σci ) is the hamming distance between the binary vectors σqi and σci . Figure 3 shows the effectiveness of the hamming distance between the signatures. This figure represent the na¨ıve distances between a hand-picked shot of 100 frames and every possible positions of a 100-frames shot in a 24-hours video. The test shot belongs to the query (near frame 250000). Two others instances of this test shot are easily identified in the 24-hours video. This figure also helps to choose the threshold α, which represent the mean number of bit errors allowable by signature. We use α = 3 in our tests.

3.3.5 Temporal alignement of shots 3.3.5.1 Problems and existing methods. Shots segmented by the shot segmentation algorithm are not always comparable in a perfect way. Because of complex and various transition effects and because the property of repeatability desired for the shot segmentation algorithm is not always respected, two identical video sequences may have a very different temporal segmentation. For example, consider a channel lead-in that is broadcasted many times over one day, but which is introduced and ended with different transition effects. The shot segmentation algorithm will produce shots of different size, with a possibly high frame shifting. One shot may also be split in two or three shots because of false alarms in the shot segmentation, and conversely, shots may be concatened in only one because of missed transitions. Many authors have proposed to view a sequence of images as a string, and use string matching algorithms to resolve this problem of alignement [9], [8]. However, as some works pointed out [1], traditional algorithms of sequence matching are not well suited to the video domain. The edit distance

Only identical shots should be recognized and broadcast do not produce fancy editions effects. The temporal order of the strings should therefore be kept (no editions). The first method presented is a simple exhaustive search. Consider a query shot Sq and a candidate shot Sc of respective length N and M, N < M . We want to find the position in the candidate shot that the minimize the naive distance between Sq and a subset of Sc of length N . Formally, we look for the position kmin : kmin = Arg min k

N X

dh (σqi , σci+k ) with 0 ≤ k ≤ M − N

i=1

While this is a very basic search strategy, the exhaustive search is not intractable even for very large databases, because we are only comparing shots between themselves. M and N are therefore quite small in general. Note that what we called the exhaustive search can be interpreted as an edit distance where insertion is forbidden (infinite cost), deletion is only authorised at the beginnning of the string and the cost associated to the substitution is the hamming distance between the strings. Note that this search may be easily quickened up by stopping the search as soon as the distance falls below the threshold. Another search strategy is presented, the search by positions, which is not an improvement of the exhaustive search but a rather different way of searching. Finding a candidate shot Sc is based on the fact that a signature σci has matched to a signature σqj using the hash table, i.e. h(σci ) = h(σqj ). Suppose now the relative positions of these signature in their original shots, i.e. i and j, have also been stored. The search by positions uses i and j to align the shots and to compute the na¨ıve distance between the common subset of Sq and Sc . Figure 4 gives an example of a search by positions. Note that several positions may be tried, since every signature of the query shot is being tested. The same candidate shot may be found several times, with a different alignement.

4. RESULTS 4.1 Experiment

Table 1: Signature retrieval capacity Simple reference Multiple reference n Precision Recall Precision Recall 8 98.7 79.3 96.5 98.4 7 98.3 85.2 97 99.2 6 94.1 90.3 95.7 99.6 5 58.9 94.1 70 99.9

The tests were made on two video files recorded from the same french channel on 2 consecutive days. The file used as a database is 24 hours long (2180727 frames, 19046 shots) and the file used as a query is 1 hour long (85601 frames, 734 shots). The videos are encoded in MPEG-2, in PAL resolution (720x576) at 25 frames/s. Four sizes of signature n = 5, 6, 7, 8 were tested, producing binary vectors of size 25, 36, 49 and 64.

4.2 Retrieval Method Retrieval of a candidate shot is uniquely based on the capacity of the query signature to have at least one exact correspondance in the candidate shot. The retrieval capacity of the signature is tested by doing the following experiment: for each signature of the query file, if we have a candidate shot returned by the hash table, and based on ground truth, we decide if this was a good hit, false hit or a miss if no candidate shot was returned but an identical shot was present in the database. The results are expressed in terms of precision and recall in Table 1 for the simple and multiple reference schemes. It is important to understand that the recall do not need to be very high because only one signature by shot is enough to retrieve a shot. On the contrary, precision is important since false alarms will lead to useless computations. We cannot therefore conclude on the superiority of the simple or multiple reference scheme over the other for now. However, it is obvious that n = 5 produces too weak signatures (only 25 bits per image !). It leads to a considerable number of false alarms and thus cannot be trusted.

4.3 Shot Matching The results of the shot matching strategies are presented in Table 2 for the exhaustive search and in Table 3 for the search by positions. The most obvious fact is that the multiple reference scheme does not bring any ameliorations. Since it is also more demanding in memory and processing power, it has no interest at all. It is however very interesting to compare the results between the exhaustive search and the search by positions. It is quite surprising to observe that the search by positions is actually slightly better than the exhaustive search. Two main reasons explain this result. First, as we hoped, the position of the signature in the shot is quite accurate, explaining the fact the search by position may be as good as the exhaustive one. The other reason is that the search by positions may consider only a subset of the shots, thus finding alignements that the exhaustive search does not explore. This is especially helpful when identical shots are begun and ended with different progressive transition types. Let us now focus on the results of the search by positions. It does not come as a surprise that the precision decreases with n, but let us examine these false alarms more closely.

n 8 7 6 5

Table 2: Exhaustive search Simple reference Multiple reference Precision Recall Precision Recall 98.6 96.5 98.6 96.5 96.6 97.2 96.6 97.2 92.6 96.5 92.7 97.2 70.3 97.9 68.1 98.6

n 8 7 6 5

Table 3: Search by positions Simple reference Multiple reference Precision Recall Precision Recall 98.6 97.9 98.6 97.9 95.9 98.6 95.9 98.6 92.2 98.6 92.2 98.6 66.3 97.9 62.9 97.2

Figure 5 shows an example of such a false alarm. The result is not identical shots but an almost identical pair, often from the same program, i.e. the identical notion switched to very strong similarity. Monochrome shots are also an issue since they may appear anywhere, thus linking together two sequences that are not the same. These kind of false alarms may be easily filtered out. Precision is therefore not a critical parameter, except in the case of n = 5 where ’true’ false alarms appear due to the weakness of the signature. Recall is however much more important for the structuring process. In our experiments, the missed shots were caused by artifacts due to the interlaced mode of television, and incoherent shot segmentation. Static shots in which the signature is constant are also a problem because the retrieval becomes an all or nothing mode. In general, shots that contain natural images are very well identified, but shots containing very few information or synthetic images are far less well represented by the DCT, and may be missed or falsely recognized.

4.4 Time issues and scaling Speed was a key factor in designing the algorithms. Two process are distinguished: metadata extraction, and shot matching, which may run together or independently. Metadata extraction means shot segmentation and signature computation. Since video decoding is needed, the decoding process is also given as a reference (130 frames/s). The metadata extraction is done at a frame rate of 65 frames/s, twice as slow as decoding, but still much faster than real-time. To evaluate the shot matching complexity two performances are computed. The total time, which include database loading and memory management, and the search time, which is only the time spent in the matching process. The reason

Figure 5: False alarm example

40 search by positions (s) exhaustive search (s) detections (10)

35 30 25

6 hash table size (10e6) search time (s) total time (s) detections (10e2)

20 5 15 4

10 5 0

3 0

5

10 15 query size (hours)

20

25

Figure 6: efficiency comparison between exhaustive search and search by positions for query sizes ranging from 1 to 24 hours, with a 24 hours database.

of this distinction lies in the fact that for a monitoring application, only the search time is relevant. A comparison between the search by positions and the modified version of the exhaustive search, where we stop as soon as the threshold is crossed, is given in figure 6. Only the search time is given in this figure. Indeed, for the exhaustive search, the time spent in memory management becomes negligeable when the query grows. The figure shows that the exhaustive search cannot be used for large queries, while the search by positions is almost not affected by the query size (always below 1 s). A more comprehensive study of the query size influence on the search by positions is given below. It has however to be verified if this solution could scale to large video databases. The figure 7 reports the results of the database size influence on the retrieval time. The figure shows the time results of a 24-hours query on database size ranging from 24 to 144 hours, with n = 8 and using the search by positions. The figure also shows the size of the hash table (unit is the million) and the number of detections between the query and the database (unit is the hundred). One can easily see that as expected, the search time is constant. The total time is linear with the size of the hash table and database, because of the loading of the hash table into main memory. We also consider how our method scales with the size of the query. Figure 8 shows the performance of the matching process for query sizes ranging from 1 to 24 hours with a 24 hours database. What can be seen on this figure and also on figure 7 is that the search time is proportional to the number of detections. The method is thus relatively independant from the database and query size. Unfortunately, these performances cannot be compared to other approches because either of missing data or because the goals are too different. Most of the works close to our own are focused on video copy detection or similarity search and are concerned by retrieving a single shot or a short video clip from a large database. In our case, we want to identify common elements between a very large query (24 hours) and a large database, thus focusing on quite different aspects.

2

1

0 20

40

60

80 100 database size (hours)

120

140

160

Figure 7: Influence of the database size on the retrieval time with a 24 hours query

2500 search time (ms) total time (ms) detections 2000

1500

1000

500

0

0

5

10 15 query size (hours)

20

25

Figure 8: Influence of the query size on the retrieval time within a 24 hours database

5.

CONCLUSION

A very fast shot matching strategy to retrieve identical shots from a television broadcast has been presented. The algorithm computes a frame signature which can be mapped to a 64-bit integer, thus allowing exact retrieval and the use of a hash table. We also presented a very effective strategy to align two candidate shots, based on the signature position. Results show the effectiveness of our method on real data recorded from french television. This method could be used in many potential applications: commercials monitoring, intelligent digital vcr. Our first motivation was however to detect duplicate sequences for structuring television streams into programs. The next step in our research is therefore to find efficient ways of using this repetition information for video structuring.

6.

REFERENCES

[1] D. A. Adjeroh, M. C. Lee, and I. King. A distance measure for video sequences. Comput. Vis. Image Underst., 75(1-2):25–45, 1999. [2] J. Barr, B. Bradley, and B. Hannigan. Using digital watermarks with image signatures to mitigate the threat of the copy attack. In ICASSP, 2003. [3] S.-A. Berrani, L. Amsaleg, and P. Gros. Approximate searches: k-neighbors + precision. In Proc. of the 12th ACM International Conference on Information and Knowledge Management, pages 24–31, New Orleans, Louisiana, USA, 2003. [4] B. Coskun and B. Sankur. Robust video hash extraction. In EUSIPCO: European Conf. On Signal Processing, Vienna, 2004. [5] P. Duygulu, M. yu Chen, and A. Hauptmann. Comparison and combination of two novel commercial detection methods. Proceedings of the 2004 IEEE International Conference on Multimedia and Expo (ICME 2004), june 2004. [6] A. K. Jain. Fundamentals of digital image processing. Prentice hall information and system sciences series, 1989. [7] A. Joly, O. Buisson, and c. Fr´elicot. Robust content-based video copy identification in a large reference database. In Proceedings of the International Conference on Image and Video Retrieval, 2003. [8] Y.-T. Kim and T.-S. Chua. Retrieval of news video using video sequence matching. In International Multimedia Modelling Conference, 2005. [9] R. Lienhart, C. Kuhmunch, and W. Effelsberg. On the detection and recognition of television commercials. In International Conference on Multimedia Computing and Systems, pages 509–516, 1997. [10] J. Oostveen, T. Kalker, and J. Haitsma. Feature extraction and a database strategy for video fingerprinting. In VISUAL ’02: Proceedings of the 5th International Conference on Recent Advances in Visual Information Systems, pages 117–128. Springer-Verlag, 2002. [11] K. M. Pua, J. M. Gauch, S. E. Gauch, and J. Z. Miadowicz. Real time repeated video sequence identification. Comput. Vis. Image Underst., 93(3):310–327, 2004. [12] J. M. S´ anchez, X. Binefa, and J. Vitri` a. Shot partitioning based recognition of tv commercials.

Multimedia Tools Appl., 18(3):233–247, 2002. [13] J. Yuan, L.-Y. Duan, Q. Tian, and C. Xu. Fast and robust short video clip search using an index structure. In MIR ’04: Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval, pages 61–68. ACM Press, 2004.

Suggest Documents