Scene Duplicate Detection Based on the Pattern of Discontinuities in ...

1 downloads 0 Views 4MB Size Report
from ve TV channels, and compared it with a state-of- the-art reference ..... LIP-IS+OOS was originally used to detect general near duplicates ..... Stations. NHK. NTV TBS FTV TVA. Duration. 3 hours. Aired time 06:00 08:00. 06:00 09:00.
Scene Duplicate Detection Based on the Pattern of Discontinuities in Feature Point Trajectories Xiaomeng Wu

Masao Takimoto

Shin’ichi Satoh

National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku Tokyo 101-8430, Japan

The University of Tokyo 7-3-1 Hongo, Bunkyo-ku Tokyo 113-8654, Japan

National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku Tokyo 101-8430, Japan

[email protected]

[email protected] Jun Adachi National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku Tokyo 101-8430, Japan

[email protected] ABSTRACT

Recent advances in broadband networks, storage devices,

The paper is aiming to detect and retrieve videos of the same scene (scene duplicates) from broadcast video archives. Scene duplicate is composed of dierent pieces of footage of the same scene, the same event, at the same time, but from the dierent viewpoints. Scene duplicate detection would be particularly useful to identify the same event reported in different programs from dierent broadcast stations. The issue here is that the approach should be invariant to viewpoint changes. We focused on object motion in videos and devised a video matching approach based on the temporal pattern of discontinuities obtained from feature point trajectories. We developed an acceleration method based on the discontinuity pattern, which is more robust to variations in camerawork and editing than conventional features, to dramatically reduce the computation burden. We compared our approach with an existing video matching method based on the local feature of keyframe. The spatial registration strategy of this method was also used with the proposed approach to cope with visually dierent unrelated video pairs.

The perfor-

mance and eectiveness of our approach was demonstrated on actual broadcasted videos.

Keywords

and digital video broadcasting has created a demand for large-scale video databases and intelligent access. To achieve this, video semantic analysis is indispensable. Despite the efforts of many researchers, including the experimental projects on high-level feature extraction conducted by TRECVID, video semantic analysis still has insucient performance. Recently, near-duplicate shot detection has attracted the attention of researchers.

This form of detection does not

require semantic analysis of videos (e.g., it does not need to extract information such as what the shown object is, what this scenery is, etc.), yet it may enable semantic relations between shots to be evaluated without analyzing semantic content of each shot.

1.1

Applications of Near Duplicate Detection

There are several promising applications of near-duplicate detection to broadcast video streams. 1) Commercial lm detection and identication [4] This application is of particular interest to sponsoring companies of commercial lms. 2) News topic tracking and threading [7, 8, 23] Near-duplicate shots shared by several news topics im-

Video Matching, Scene Duplicate Detection, Time Series

ply latent semantic relations between these topics. This

Analysis, Feature Point Tracking

information can be used to dene the similarity between news topics; thus, it is useful for news topic

1.

INTRODUCTION

tracking and threading.

For instance, by detecting

common topics among news videos, near duplicates can be used to provide supplementary information to text similarity evaluation.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.

3) News topic ranking [26] Near-duplicate detection can be used to identify the same topic among news programs broadcasted by the dierent channels. A topic mentioned by many broadcast stations can be regarded as a highly ranked topic [26].

4) Novelty detection [22] The earliest broadcasted topic may be a novel occurrence among the same topics detected in news video archives for a certain period from several channels. Near-duplicate detection can be used and combined with text similarity based on language models for this purpose [22]. 5) Filler shot detection [17] Within a certain period of news video archives, repeatedly used shots can be regarded as redundant shots (or ller shots), e.g.

opening CG shots, anchor per-

son shots, weather charts, etc., and they are especially useful for news video parsing.

Near-duplicate detec-

tion can be used to detect repeatedly used shots [17].

1.2

Classification of Near Duplicates

Near duplicates can be classied as follows:

Figure 3: Examples of scene duplicates.

File footage of the same scene at the same time but from the dierent viewpoints.

a) Strict near duplicate

c) The same scene (Scene duplicate) This type of near duplicate is composed of dierent pieces of footage of the same scene, the same event, at the same time, but from the dierent viewpoints, e.g., taken by dierent cameras and possibly with temporal osets. Figure 3 shows examples of scene duplicates. This is a special case of b). Figure 1: Example of strict near duplicates.

File footage of the same video material.

1.3

Related Work

There are a number of near duplicate detection approaches. Type a) can be detected by almost all approaches.

Some

The same video material (footage) is used several times

approaches focus on detecting a) and pay more attention

in dierent programs.

to computational eciency [2, 5, 6, 9, 15, 24, 25]. Interest

In this case, various editing,

cropping, video captioning, etc.

may have been ap-

point approaches are typically used to detect types b) and

plied to the footage but no viewpoint change occurs.

c) [11, 12, 13, 16].

Figure 1 shows examples of strict near duplicates. Typ-

points and compute features, such as SIFT [14], which are

These approaches rst extract interest

ical cases include commercials and le footage used in

invariant to shifts, changes in scale, rotations, etc.

news.

key frames share a certain fraction of interest points having

If two

almost the same feature values, the shots corresponding to b) The same objects or background (Object duplicate)

the key frames are regarded as near duplicates. These approaches are invariant to viewpoint changes, camera work, etc., to some extent, and therefore they can detect types b) and c) as well as type a). However, not much attention has been paid to distinguishing types b) and c). Computer vision researchers have devised a method to match two videos of the same scene taken from dierent viewpoints [1].

However, this method esti-

mates the temporal oset and relative viewpoint dierence (such as homography), and is computationally extremely Figure 2: Example of object duplicates.

File footage of the same government spokesman mentioning a dierent topic on dierent days.

expensive. The method described in reference [3] matches videos using spatio-temporal segmentation and motion-based signatures, but cannot handle scene duplicate detection properly. The authors of [10] mentioned that detection of type c) is challenging, but it did not present any methods to

This type of near duplicate is composed of footage

distinguish c) from b). Reference [20] matches the temporal

taken of the same object or the same background, but

patterns of the ashes of cameras to detect scene duplicates,

taken at dierent times (and/or in dierent places).

but it cannot detect scenes without ash lights.

Figure 2 shows an example of object duplicates. This example shows two pieces of footage of a government spokesman speaking about two dierent topics on different days.

1.4

Research Purpose and Issue

If we could detect scene duplicates without object duplicates, they would be particularly useful to identify the same

event reported in dierent programs from dierent broadcast stations. Since type b) may include shots of the same anchorperson reporting on dierent topics or shots of the same object or the same person in a totally dierent context, type b) may have to be eliminated to identify shots corresponding to the same event. Figures 2 and 3 clearly show the dierences: an object duplicate in Fig. 2 does not correspond to the same event, while scene duplicates in Fig. 3 obviously relate to the same event.

Figure 5: Framework overview.

process (Fig. 5). The oine process rst decomposes given videos into shots with an arbitrary shot boundary detection method and applies interest point detection and tracking to obtain trajectories from each shot. We use KLT tracker [21] Frame

for interest point detection and tracking, but basically any interest point detector and tracker can be used. From each

Figure 4: Motion of object in scene duplicates.

Scene duplicate shots have the same object with the same motion, and thus feature points on the object and their trajectories have the same motion pattern.

trajectory, the approach extracts inconsistency sequence and discontinuity sequence (these terms will be explained later). They are all stored as sucient information for the online process to match shots. Note that a shot may correspond to several trajectories (in our case up to 200 trajectories), and each trajectory is then associated with an inconsistency

In this paper, we propose an approach to detect scene duplicates but eliminating object duplicates. The issue here is

sequence and a discontinuity sequence. The online process then matches all pairs of shots in the

that the approach should be invariant to viewpoint changes.

archive to obtain scene duplicates.

This issue can be treated by using invariant features such

the approach evaluates the similarity between shots assum-

as SIFT; however, it is dicult to distinguish b) from c) by

ing all possible temporal shifts (frame by frame). The simi-

using interest points and invariant features. Instead, we em-

larity between a pair of shots is then evaluated by matching

ploy the temporal pattern of discontinuities obtained from

all possible combinations of pairs of trajectories between the

trajectories of feature points (Fig. 4). If two shots are of the

two shots.

same scene but from the dierent viewpoints, a certain frac-

similar to some extent, the similarity between the shots will

tion of the temporal patterns of trajectories between these

become close to one (match). The inconsistency sequences

shots will match with the common temporal oset.

and discontinuity sequences generated in the oine process

Given a pair of shots,

If a certain fraction of the trajectory pairs are

The approach presented in this paper extends [18], where

are referred to for evaluating the similarity between trajec-

we used inconsistency [19] to detect discontinuities and to

tories. In addition, discontinuity sequences are used to ac-

match trajectories. Since application of the method in [18]

celerate the similarity evaluation.

to large-scale video archives is computationally expensive, we use a ltering approach using temporal discontinuity patterns to accelerate the matching.

Based on this idea,

2.2

Feature Calculation

Before we explain how the similarity between shots

and

S2

can detect scene duplicates but not object duplicates. We

tures used to evaluate similarity during the oine process.

demonstrated its performance on actual broadcast videos

Let's assume that the shot Si is composed of ni = |Si | traj jectories, Ti , j = 1 · · · ni . The issue is how to evaluate the j k similarity between two trajectories T1 and T2 .

from ve TV channels, and compared it with a state-ofthe-art reference technique under the same conditions. This reference technique is keyframe-based and uses local spatial

given a temporal oset

τ

S1

we developed a new near-duplicate detection method which

is evaluated, we explain the fea-

To detect motion discontinuities, we use inconsistency

features to detect near-duplicate keyframes in a large data

[19], which is known to work well. An inconsistency is de-

set. We furthermore exploit this keyframe-based technique

ned for small spatio-temporal patch in the spatio-temporal

in our approach to cope with visually dierent unrelated

intensity space. A typical patch size is 7[pixels]

video pairs, and thereby improve the speed and accuracy of

× 3[frames], so we use this size.

scene duplicate detection.

the following matrix is rst computed within a patch:

0

2.

SCENE DUPLICATE DETECTION

2.1

M

Overview

Our approach works on a video archive composed of video

=

P 2 P Px @ P Py Px Pt Px

P (x, y, t) is intensity ∇P = (Px , Py , Pt ) is its

×

7[pixels]

To obtain the inconsistency,

P PPx P2 y P Py Pt Py

1 P P Px Pt t A PPy P Pt2

(1)

where

in the spatio-temporal space,

programs, and it outputs scene duplicates within the archive.

and

gradients.

The approach is composed of an oine process and an online

if the corresponding patch contains a single uniform motion,

M

is rank-decient

i.e.,

rank(M ) < 3.

In addition, the following matrix is also

dened without temporal information:



M♦

=

P 2 P Px Py Px

each trajectory, we extract inconsistency values of patches on the trajectory (Fig. 6), resulting in a sequence of val-

« P PPx P2 y . Py

(2)

We need to handle a situation in which the corresponding

ues for each trajectory. We call such a sequence an inconj j sistency sequence, c(t; Ti ) of trajectory Ti , expressing the inconsistency value at time t. Figure 7 shows an example of inconsistency sequences.

The computation of inconsis-

image patch contains degenerated image structures (such as ♦ edges, not corners), i.e., rank(M ) < 2. The inconsistency

tency values is limited to trajectories only, and as explained

is dened as follows:

to only pairs of trajectories. After smoothing an inconsis-

Discrete measure:

∆r

=

Continuous measure:

∆r

=

rank(M ) − rank(M ♦ )(3) λ2 · λ3 (4) ♦ λ♦ 1 · λ2

♦ ♦ where λ1 ≥ λ2 ≥ λ3 and λ1 ≥ λ2 are eigenvalues of M and ♦ M . If a patch contains uniform motion, the inconsistency

will become 0, while if it contains inconsistent motion, the inconsistency will become 1 (discrete measure) or become close to 1 (continuous measure). We use a continuous measure for our purpose. Inconsistency was rst used to compare small (in terms of both image size and duration) template videos and target videos to see if the target contained a similar object with similar motion to that of the template.

To do so,

the study [19] decomposes template video as well as target video into patches and then evaluates the correlation

later, the comparison of inconsistency values is also limited tency sequence with a Gaussian, we detect the local maxima and regard them as discontinuities. We generate a discontij j nuity sequence d(t; Ti ) of Ti as a binary sequence where 1 corresponds to a maximum (discontinuity) and 0 otherwise.

2.3

Shot Similarity

By using inconsistency sequences and discontinuity sequences, we can then dene the similarity between trajectories. The basic idea is to check whether the motion discontinuities of each trajectory occur at almost the same timing.

To evaluate the similarity of trajectories

jectories.

am

fr

P

=

:Space-time Patch

and

τ

T1

and

T2

given a

is then dened as follows:

;T ))N CC(c(t;T1 ),c(t−τ ;T2 );t−w,t+w) t (d(t;T1 )+d(t−τ P 2 t (d(t;T1 )+d(t−τ ;T2 ))

(5)

N CC(c(t;T1 ),c(t;T2 );t1 ,t2 ) Pt2 t=t1 (c(t;T1 )−c(T1 ))(c(t;T2 )−c(T2 )) P (c(t;T1 )−c(T1 ))2 (c(t;T2 )−c(T2 ))2

= √P

:Feature Point & Trajectory where

pj

w,

Sim(T1 ||T2 ;τ )

patch pairs, which is computationally very demanding.

)

The similarity of trajectories

temporal oset

all patches, as well as a comparison of all combinations of

t e(

T2 ,

the average of these values is used as the similarity of tra-

This requires a computation of inconsistencies for

y

and

at all discontinuities within a certain window width

between all combinations of pairs of patches between the videos.

T1

we compute the local normalized cross correlation centered

N CC(c1 , c2 ; t1 , t2 )

(6)

is the normalized cross correlation

t1 ≤ t ≤ t2 , c(Ti ) is c(t; Ti ) for t1 ≤ t ≤ t2 . N CC is computed where d(t; T1 ) = 1, d(t − τ ; T2 ) = 1 or both, and

between inconsistency sequences for

pi

{consi(t) │ i=1,...,#features} x

consi(t)

the average of at frames averaged.

Figure 6: Calculation of features for trajectories in a shot.

On the basis of the similarity between trajectories, we then dene the similarity between a trajectory and a shot, i.e., a set of trajectories. From here onwards, the temporal oset

is omitted for the sake of readability. The similarity j th trajectory in S1 (T1j ) and another shot (S2 ) j is dened as the similarity between the trajectory T1 and k the most similar trajectory among T2 :

180 160 140

Inconsistency*256

τ

between the

120

Sim(T1j ||S2 ) = max Sim(T1j ||T2k )

100 80

(7)

k

60

The similarity between shot

40

S1

and

S2

is dened as follows:

20 0 1

Sim(S1 ||S2 ) =

18 35 52 69 86 103 120 137 154 171 188 205 222 239 256 273 290 Frame

Figure 7: Inconsistency of corresponding feature points.

The graph shows two inconsistency sequences corresponding to the feature points of two scene duplicate shots. Since the feature points are in correspondence, the two inconsistency sequences have almost identical temporal patterns with a slight temporal oset (5-10 frames).

avg

top ρ%

j,

the average.

the top

(8)

1 (Sim(S1 ||S2 ) + Sim(S2 ||S1 )). 2

Sim(S1 , S2 ) = Among

Sim(T1j ||S2 )

ρ%

of

Sim(T1j ||S2 )

(9)

will be used for

Between shots of scene duplicates, some of

the trajectories in one shot will match those of the other shot, but other trajectories will not match. (See Fig. 8 for examples of outlier trajectories.) The parameter

ρ should be

small enough to eliminate trajectories which cannot match, We borrow the idea of inconsistency, but use it in a dier-

yet at the same time,

ρ

should be large enough to ensure

ent way. We rst apply the KLT tracker to a shot, limiting

that a sucient fraction of trajectories match between the

the maximum number of trajectories to be 200.

detected scene duplicate shots.

Then for

Occluded in another camera

two assumptions:



Assumption 1: if two trajectories match in terms of trajectory similarity, they have to share at least a certain number of matching discontinuities, namely



θd .

Assumption 2: if two trajectories match in terms of trajectory similarity, they have to share at least a certain ratio of matching discontinuities, namely

Occluded by text caption

Figure 8:

Example of scene duplicate including feature

points which are occluded in one shot.

The two shots are scene duplicates. Feature points on faces match each other; i.e., they are inliers. However, feature points on the occluded background as well as feature points around the tie which are occluded by the video caption on the other shot are outlier trajectories.

θRoD .

Consider two dierent videos containing no moving object or both containing large background regions.

Many

trajectories, with few discontinuities and similar inconsistency patterns, will be extracted from these two videos. In this case, Equ. 5 and Equ. 6 will fail to distinguish them from each other and wrongly detect them as scene duplicates. Here, we use Assumption 1 to solve this problem. On the other hand, consider two dierent videos both containing a moving object with large or fast motion so that the trajectories extracted from these two videos have too many

Processings

Time (h:mm:ss)

Similarity calculation (Trajectory)

3:59:43

have, the more matching discontinuities they possibly share

Similarity calculation (Shot)

0:02:20

(e.g.

Result le output

0:09:08

to cope with this situation. We use Assumption 2 to ensure

Others

0:28:38

that the detection of matching discontinuities is not due to

Total

4:39:49

large motion or noise.

discontinuities. The more discontinuities these trajectories Thus, Assumption 1 by itself is not enough

With these assumptions, we can discard all trajectories

Number of shots: 50 Number of shot pairs: 1,225 Number of trajectory pairs: 8,883,931 Number of trajectory similarity calculation: 2,998,411,843

that do not satisfy the following conditions.



Condition 1:

N2

N ≥ θd

with

N = N1 ∩N2 , where N1

T2 ,

Filtering using Discontinuity Sequences

N is the number between N1 and N2 .

The dened shot similarity can be applied to all combinations of shot pairs in video archives, and pairs having larger similarities can then be detected as scene duplicates. However, an evaluation of shot similarity requires a computation-



and

are the numbers of the discontinuities within the

overlapping section between two trajectories

Table 1: Processing time of scene duplicate detection.

2.4

N ≥ θd ).

and

T1

and

of matching discontinuities

Condition 2: N Strict mode: N 1

≥ θRoD

2N Normal mode: N +N 1 2 N Loose mode: N 1

N and N 2

≥ θRoD

≥ θRoD

≥ θRoD

N or N 2

≥ θRoD

ally costly normalized cross correlation, even though only a piecewise evaluation around discontinuities (d(t; T1 )

d(t − τ ; T2 ) = 1

= 1,

or both) is performed. We tested our ap-

proach by using 50 video shots captured from TV news programs as the dataset for the experiment. The total length of these shots is around 8 minutes, and Table 1 shows the processing time of this experiment. The system took more than 4.5 hours to detect all scene duplicates from the 8-minute archive, which is far from practical. The trajectory-level similarity calculation took the most time (around 4 hours). This is mainly because we estimate

τ by sliding two N CC -based similarity of

the temporal oset

trajectories and calcu-

lating the

them with all possible

osets.

Moreover, this evaluation is made for all possible

combinations of trajectory and shot pairs within the video archive (totally 2,998,411,843 possible osets tested in our experiment).

In other words, the required computational

amount is the square of the size of video archive and the

Figure 9: Detection of matching discontinuities.

computational burden of this process is excessive. Further acceleration is thus required. We propose an acceleration method using discontinuity se-

For Condition 1, we can discard all trajectories having less than

θd

discontinuities. This will reduce the number of

quences. Since discontinuity sequences are binary sequences,

candidate trajectories, and it might reduce the number of

their comparison is computationally lighter than using incon-

candidate shots, which would be very helpful for accelera-

sistency-based trajectory similarity. In this study, we make

tion. After that, we can easily lter out a pair of trajectories,

either one of which has less than

θd

discontinuities within

the overlapping section of the two trajectories. This check can be achieved quite eciently by using a cumulative discontinuity sequence: each value of the sequence is the total number of discontinuities from the start of the sequence to the location of the value.

Next, we count the number of

matching discontinuities between two trajectories and lter

θd discontinuities. T2 is gradually slid

out pairs having less than

T1

tory

xed, trajectory

the temporal oset

τ

With trajecby changing

(Fig. 9). Here, we allow a one-frame

dierence to compensate for the variance of the location of discontinuities. This process is also very ecient. For Condition 2, we propose three modes corresponding to dierent extents of the condition restriction. Tightening the restriction reduces the number of noisy trajectory pairs but also eliminates many helpful ones. Relaxing the restriction tends to ensure more complete detection but decreases the eciency of noise ltering.

We tested all three modes in

our experiments to evaluate the performance of our ltering method (Sec. 3.5).

3.2

Reference Technique

As mentioned in Sec. 1.3, less attention has been paid to distinguishing object duplicates and scene duplicates. In this paper, as a reference to compare with the proposed approach, we used an interest-point-based technique with a local description. The algorithm, named LIP-IS+OOS, was proposed by Ngo et al. [16], and it was applied to the same 50-video dataset described above. LIP-IS+OOS was originally used to detect general near duplicates including both object duplicates and scene duplicates. Videos are segmented into keyframes, and a DoG detector is used to detect interest points from these keyframes. A one-to-one symmetric (OOS) interest point matching strategy is used to match interest points across frames.

The

matching patterns of keyframes were captured with two histograms of matching orientations of interest points.

His-

tograms are constructed by aligning two frames horizontally and vertically. Depending on the alignment, a histogram is composed of the quantized angles formed by the matching lines of interest points and the horizontal or vertical axis. The homogeneity of histogram patterns is measured as the

3.

EXPERIMENTS

keyframe similarity by evaluating the mutual information between the two histograms. Entropy is used to reveal the

3.1

Setup

mutual information. The similarity between a pair of shots

We tested our approach with actual broadcasted videos. 50 video shots were captured from TV news programs, and they composed the dataset for the experiment. length of these shots was around 8 minutes.

The total

The dataset

generated 1,225 shot pairs, and it included 20 scene dupli-

is then evaluated based on keyframe extracted from the two shots.

3.3

Reference Technique Implementation

In [16], LIP-IS+OOS was tested by using a keyframe database

cate pairs (29 shots), which were selected manually. Other

instead of a video archive. By observing our experimental re-

shots included 49 object duplicate pairs. These videos were

sults, we found that how to select keyframe has a signicant

stored in MPEG-1 format at 30 frames per second and had

impact on the accuracy of near-duplicate“ shot ”detection.

352 × 240

All experiments were

Therefore, we propose to perform the experiment by chang-

performed on a PC (Dell Precision T3400, 2.99 GHz, 3.25

ing the density of keyframe selection. We extracted multiple

GB RAM, Windows XP). All code was written in C++ and

keyframes from each video shot in the 50-video dataset. As

compiled by GCC.

a way to change the density, the shot length was equally

an image size of

pixels.

The experiment required certain parameters to be set rst.

divided, and the frames at the points of division were se-

The KLT tracker needs the number of feature points to be

lected as the keyframe. In equation terms, given the shot i·L th length L, the N +1 frames are extracted as the keyframe, with i = 1 · · · N and N = 1 · · · 5. The similarity of each shot

tracked.

This value was set to 200, as mentioned before.

The approach required two shots to overlap by more than a certain number of frames less than

θo = 120,

θo

[frames].

If the overlap is

the similarity between the pair of shots

is dened to be zero (no match).

pair was evaluated by calculating the maximum similarity among all possible keyframe pairs of the two shots:

A shot was considered

Sim(S1 , S2 ) = max Sim(K1i , K2j )

to be lacking enough motion information and unsuitable for our trajectory-based approach if it contained

θt = 20

i,j

tra-

jectories or less of which the discontinuities were more than

θDoT = 5. The window size to calculate N CC was set to w = 14 (totally 2w + 1 = 29 frames). The window size to calculate the local maxima of inconsistency sequences to

3.4

Accuracy Analysis Methods

AveP (%)

w = 5 (totally ρ which was used to calculate the similarity between shots was set to ρ% = 50%. The minimum number of the agreed discontinuity θd was set to 3. The minimum matching discontinuity ratio θRoD was set from

LIP-IS+OOS (N=1)

31.30

LIP-IS+OOS (N=2)

22.91

LIP-IS+OOS (N=3)

32.84

LIP-IS+OOS (N=4)

36.10

LIP-IS+OOS (N=5)

29.09

0 to 1. All parameters including the ones mentioned above

PROPOSED METHOD (SDD)

59.58

obtain the discontinuity sequences was set to

2w + 1 = 11

(10)

frames).

were determined empirically. The feature extraction phase rejected 7 shots because they had

θt

Table 2: Average precision comparison.

trajectories or less, the discontinuities of which were

more than

θDoT .

One of the 29 shots that made up the

20 scene duplicates was rejected in this phase so that one

Figure 10 shows the performance of our scene duplicate

true scene duplicate was rejected as false. Therefore, among

detection as well as that of LIP-IS+OOS. The comparison

43 C2

shown in the gure does not involve ltering by discontinuity

= 946

pairs of shots, 19 pairs were scene duplicates

and remained to be detected by our approach.

sequence. An evaluation using average precision (AP) was

Figure 10: Precision-recall curve (without acceleration).

also performed (with Equ. 11), and Table 2 illustrates its results.

PN

AveP where

r

=

r=1 (P (r) · rel(r)) number of relevant documents

is the rank,

N

the number retrieved,

(11)

rel() a biP () the

nary function on the relevance of a given rank, and precision at a given cut-o rank.

From Fig. 10, we can see that the proposed approach achieves a high precision rate (78.57%) with the recall being xed to 55%.

From Fig. 10 and Table 2, it can be

Figure 11: Examples of detected scene duplicates.

said that the proposed approach enables more accurate and more complete scene duplicate detection than LIP-IS+OOS. Even without the ltering method, our approach outper-

the relationship between processing time and average preci-

forms LIP-IS+OOS for almost all xed recalls. One thing

sion of three ltering modes and is generated by shifting the

that we should note here is that LIP-IS+OOS was originally

threshold

proposed for detecting general near duplicates, not only the

curves of scene duplicate detection with and without lter-

scene duplicates taken of the same event, which is the main

ing, with

purpose of our study.

respectively indicate the strict, normal and loose mode l-

For this reason, it stands to reason

that LIP-IS+OOS underperforms the proposed approach in

θRoD . θRoD

The latter illustrates the precision-recall being set to 0.5.

SMF, NMF, and LMF

tering.

terms of scene duplicate detection. This is also considered to

From Fig. 13, it is obvious that the ltering method of

be the reason why the precision-recall curve of LIP-IS+OOS

all three modes can dramatically accelerate scene duplicate

shows such a random shape in Fig. 10 instead of a standard

detection without decreasing detection accuracy.

trade-o relationship.

ing loose mode ltering using discontinuity sequences, when

Examples of detected scene duplicates are shown in Fig. 11.

θRoD

Regard-

is set to 0.5, the approach achieves a 10.58 times

This shows our approach is suciently invariant to camera

speedup. In this case, the processing time is reduced from

location, color information, and object scale. On the other

more than 4.5 hours (Table 1) to around 26 minutes. The

hand, one current problem is that the recall rate is not so

number of possible osets, which is evaluated in the trajec-

high. In the upper example shown in Fig. 12, because the

tory similarity calculation phase, drops from 2,998,411,843

viewing angle of the object is too dierent, the discontinuity

to 164,531,932 (5.49%).

sequences of most of the detected trajectories do not match.

Also, in Fig. 13 and Fig. 14, loose mode (SDD+LMF)

We have to note that scene duplicates where too large view-

showed the best accuracy among the three ltering modes.

point change occurs cannot be detected by our approach.

In Fig. 13, given a certain xed processing time (e.g.

In the lower example in Fig. 12, although the most informa-

hour), LMF gives higher average precision than SMF and

tive motion is in the head part, most of the feature points

NMF. On the other hand, LMF takes the shortest time to

are detected in the body part, background region, and video

obtain a certain xed average precision (e.g. 60%). The rea-

captions. Therefore, the similarity between these two shots

son that LMF outperforms the other two modes is that the

becomes low. This problem should be solved by applying a

loose mode is more invariant to changes in scale and motion

further treatment to the trajectory detection phase, which

intensity. As dened in Sec. 1.2, scene duplicates are videos

is a future issue.

taken by dierent cameramen.

3.5

Acceleration Evaluation

We evaluated the eects of our accelerations. The results are shown in Fig. 13 and Fig. 14.

The former illustrates

1

This leads to possibly dif-

ferent camerawork, e.g. close-up, mid-range and far range, which causes variations in the scales of moving objects in th the video. The upper shot pair in Fig. 3 and the 4 pair in Fig. 11 serve as a prime example.

Since the motion of

Figure 12: Examples of false negatives.

Figure 14: Accuracies with and without ltering.

θRoD

is set to 0.5 in all cases.

From the experiments described in Sec. 3, we found that LIP-IS+OOS cannot distinguish object duplicates, e.g. the example in Fig. 2, from scene duplicates. On the other hand, the proposed trajectory-based matching approach detects duplicates without depending on visual or spatial similarity. It can reject object duplicates of dierent events even if the two shots are very similar in terms of image appearance.

Figure 13: Comparison of three ltering modes.

The graph is generated by shifting the threshold θRoD and evaluating the average precision and processing time of scene detection. The closer the curve is from the top-left corner, the better performance (less processing time and higher average precision) it indicates.

video objects in the real world is uniform, a close-up view tends to show larger or faster motion than mid-range and far-range views. By observing the experimental results, we

Figure 15: Object duplicates with similar motion pattern.

found that inconsistency and discontinuity features are sensitive to large motions such that more noise discontinuities will be generated from a trajectory with larger motion. The dierence between the discontinuity numbers

N1

and

N2

in

Condition 2 (Sec. 2.4) makes the true scene duplicate more likely to be rejected by the strict mode and normal mode lterings. Therefore, we chose to use the loose mode in further experiments. Moreover, the acceleration treatments improved the detection accuracy. In the case of loose mode ltering, when

θRoD

is set to 0.5, the approach achieves a better average

precision (61.87%) than without ltering (59.58%). The best average precision (62.73%) is obtained when

θRoD

= 0.35.

However, there are still limitations on the approach that only depends on trajectory motion. For instance, as illustrated in Fig. 15, the two videos are completely unrelated but are falsely recognized as scene duplicates only because the person's faces have similar motions.

Conversely, LIP-

IS+OOS can easily avoid such false positives because it depends on not motion information but the local-feature-based visual similarity. In particular, since it uses spatial coherence as the video matching criteria, LIP-IS+OOS is much more sensitive to the dierence between completely unrelated videos.

In other words, although LIP-IS+OOS can-

not distinguish object duplicates from scene duplicates, the local-feature-based registration strategy enables it to eec-

4.

FUSION OF LOCAL-FEATURE-BASED REGISTRATION WITH THE PROPOSED APPROACH

tively distinguish general near duplicates from noisy video pairs that are entirely visually dierent. We tried to use the local-feature-based method for removing noise and improve the precision of scene duplicate de-

two-fold reduction. Moreover, the average precision goes up from 61.88% to 71.63%. Therefore, we recommend

N =3

for the LIP-IS+OOS preprocessing.

5. APPLICATION TO BROADCAST VIDEO ARCHIVE Stations

NHK

NTV

Duration Aired time

TBS

FTV

TVA

3 hours 06:0008:00

06:0009:00

19:0020:00 #Shots

80

83

75

89

97

Table 4: Dataset obtained from video archive. Figure 16:

Fusion of local-feature-based registration with

the proposed approach.

θRoD

We then applied the proposed scene duplicate detection

is set to 0.5 in all cases.

to videos obtained from actual broadcast video archive. Table 4 describes the videos used in this experiment. We em-

Methods Time (h:mm:ss) AveP (%) LIP-IS+OOS (N=1) 0:01:14 31.30 LIP-IS+OOS (N=2) 0:04:58 22.91 LIP-IS+OOS (N=3) 0:11:11 32.84 LIP-IS+OOS (N=4) 0:19:30 36.10 LIP-IS+OOS (N=5) 0:30:50 29.09 PROPOSED METHOD (SDD) 4:39:49 59.58 LIP-IS+OOS+SDD+LMF (N=1) 0:02:01 53.63 LIP-IS+OOS+SDD+LMF (N=2) 0:05:58 57.91 LIP-IS+OOS+SDD+LMF (N=3) 0:12:23 71.63 LIP-IS+OOS+SDD+LMF (N=4) 0:20:45 70.42 LIP-IS+OOS+SDD+LMF (N=5) 0:32:22 74.34 SDD+LMF 0:26:27 61.88 Table 3:

Time:

ployed news programs from ve channels, three hours each, in total 15 hours. They were aired on the same day, thus some of news topics were expected to be shared among different channels. After shot decomposition, we manually excluded anchor person shots to reduce the number of candidate shots for computation saving. Although we excluded these shots manually, since anchor person shot detection as well as camera motion analysis was studied and many good algorithms were already proposed, this process can be automated if needed. Finally the number of shots extracted and used is shown in Table 4. Processings

Time (hh:mm:ss)

LIP-IS+OOS (N=3)

19:24:34

Matching discontinuity detection

00:02:15

Similarity calculation (Trajectory)

00:15:36

Similarity calculation (Shot)

00:05:10

Result le output

00:16:59

Others

00:02:28

Total

20:07:02

fusion of local-feature-based registration

with the proposed approach.

θRoD

is set to 0.5 in all cases.

tection by ltering out entirely visually dierent video pairs. LIP-IS+OOS was used as preprocessing, and the same experiments as described in Sec. 3 were performed. To ensure completeness of detection, we loosened the threshold setting of LIP-IS+OOS so that it could generate as many nearduplicate candidates as possible.

The proposed approach

was used with only these candidates as the target, and the

Table 5: Processing time of scene duplicate detection to 15hour broadcast video archive.

LIP-IS+OOS+SDD+LMF (N=3): θRoD is set to 0.5. Note that the estimate processing time of the original system (SDD) to the same 15-hour archive will be more than 341 hours (more than 14 days).

performance of scene duplicate detection was then evaluated. The results are illustrated in Fig. 16 and Table 3. From Fig. 16, we can see the performance improvement of the proposed approach with LIP-IS+OOS as preprocessing. Object duplicates whose inconsistency patterns are very similar, e.g. the examples in Fig. 15, are successfully ltered out. When extracting three keyframes per shot (N setting

θRoD

= 3) and

to 0.5, we can obtain very accurate scene dupli-

cate detection with 100% precision and 60% recall. Moreover, as illustrated in Table 3, the detection speed is also increased by removing noise and thereby reducing the number of candidate shot pairs. The minimum processing time is for

N =1

keyframe per shot, but the accuracy is lower.

In the case of

N = 3,

the processing time falls from around

26 minutes to less than 13 minutes, which is more than a

We then applied scene duplicate detection to the videos. We chose to use the loose mode in this experiment (θRoD

0.5)

and

N = 3

for the LIP-IS+OOS preprocessing.

=

The

processing time is illustrated in Table 5. LIP-IS+OOS took the most time (more than 19 hours). After that, scene duplicate detection took about 42 minutes. The estimate processing time of the original system (SDD) to the same 15hour archive will be more than 341 hours (more than 14 days). Based on the ltering using discontinuity sequences and the local-feature-based registration, the processing time fell from more than 14 days to around 20 hours, which was around a 17 times speedup.

6.

CONCLUSIONS

We proposed an approach to detect scene duplicates as a variant of near duplicates. The approach is invariant to camera angle dierences, background complexity, and video captions. Experiments show that our approach can successfully detect scene duplicates and exclude object duplicates. The current trajectory-based matching technique can han-

sequence identication. In ICIP, pages 31853188, 2006. [10] A. Jaimes, S.-F. Chang, and A. C. Loui. Duplicate detection in consumer photography and news video. In ACM Multimedia, pages 423424, 2002. [11] A. Joly, O. Buisson, and C. Frelicot. Content-based

dle frequent discontinuous motion, and so far it works well

copy retrieval using distortion-based probabilistic

mostly for videos showing faces. Adaptation to other types

similarity search. IEEE Transactions on Multimedia,

of video is a future issue. Another issue is the application to regular broadcast streams. When we assume 5 channels of broadcast video archive containing three-hour programs per day, the total amount will be 15 hours for one day (if we focus on only news programs that are more possible to contain topics shared among dierent channels than other types of programs). It is also reasonable to assume that the news programs one week before the current day are out-ofdate and do not share scene duplicates with today's 15-hour videos. Then, the issue is to detect scene duplicates, rstly from today's videos (1 day), and secondly from between today's videos and the videos of last week (7 days). The total processing time is equal to

1+7 = 8

times of the time

to perform detection from 15-hour videos, which is around

8 × 20 = 160

hours (N

=3

and

θRoD = 0.5).

Since the shot

similarity evaluations of dierent shot pairs are independent calculations, scene duplicate detection may be performed in parallel.

By using 7 CPUs, the current system will take

160/7 < 23

hours to detect all duplicates based on the one

day archive, which is faster than the real time.

7.

ACKNOWLEDGMENTS Our thanks go out to Professor Chong-Wah Ngo at City

University of Hong Kong for providing us with the binary of the LIP-IS+OOS method.

8.

REFERENCES

[1] Y. Caspi and M. Irani. Spatio-temporal alignment of sequences. IEEE Trans. Pattern Anal. Mach. Intell., 24(11):14091424, 2002. [2] S.-C. S. Cheung and A. Zakhor. Ecient video similarity measurement and search. In ICIP, 2000. [3] D. DeMenthon and D. S. Doermann. Video retrieval using spatio-temporal descriptors. In ACM Multimedia, pages 508517, 2003. [4] L.-Y. Duan, J. Wang, Y. Zheng, J. S. Jin, H. Lu, and C. Xu. Segmentation, categorization, and identication of commercial clips from tv streams using multimodal analysis. In ACM Multimedia, pages 201210, 2006. [5] A. Hampapur and R. M. Bolle. Comparison of distance measures for video copy detection. In ICME, 2001. [6] T. C. Hoad and J. Zobel. Fast video matching with signature alignment. In Multimedia Information Retrieval, pages 262269, 2003. [7] W. H. Hsu and S.-F. Chang. Topic tracking across broadcast news videos with visual duplicates and semantic concepts. In ICIP, pages 141144, 2006. [8] I. Ide, H. Mo, N. Katayama, and S. Satoh. Topic threading for structuring a large-scale news video archive. In CIVR, pages 123131, 2004. [9] K. Iwamoto, E. Kasutani, and A. Yamada. Image signature robust to caption superimposition for video

9(2):293306, February 2007. [12] J. Law-To, O. Buisson, V. Gouet-Brunet, and N. Boujemaa. Robust voting algorithm based on labels of behavior for video copy detection. In ACM Multimedia, pages 835844, 2006. [13] J. Lejsek, F. H. Asmundsson, B. T. Jonsson, and L. Amsaleg. Scalability of local image descriptors: a comparative study. In ACM Multimedia, pages 589598, 2006. [14] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91110, 2004. [15] A. Nagasaka and T. Miyatake. A quick scene classication method based on compact encoding of video feature sequence. Systems and Computers in Japan, 31(1):102108, 2000. [16] C.-W. Ngo, W. Zhao, and Y.-G. Jiang. Fast tracking of near-duplicate keyframes in broadcast domain with transitivity propagation. In ACM Multimedia, pages 845854, 2006. [17] S. Satoh. News video analysis based on identical shot detection. In IEEE International Conference on Multimedia and Expo, pages 6972, 2002. [18] S. Satoh, M. Takimoto, and J. Adachi. Scene duplicate detection from videos based on trajectories of feature points. In Multimedia Information Retrieval, pages 237244, 2007. [19] E. Shechtman and M. Irani. Space-time behavior based correlation. In CVPR, pages 405412, 2005. [20] M. Takimoto, S. Satoh, and M. Sakauchi. Identication and detection of the same scene based on ash light patterns. In ICME, pages 912, 2006. [21] C. Tomasi and T. Kanade. Detection and tracking of point features. Carnegie Mellon University Technical Report CMU-CS-91-132, April 1991. [22] X. Wu, A. G. Hauptmann, and C.-W. Ngo. Novelty and redundancy detection with multimodalities in cross-lingual broadcast domain. Computer Vision and Image Understanding, to appear. [23] X. Wu, C.-W. Ngo, and Q. Li. Threading and autodocumenting news videos: a promising solution to rapidly browse news topics. IEEE Signal Processing Magazin, 23(2):5968, March 2006. [24] F. Yamagishi, S. Satoh, and M. Sakauchi. A news video browser using identical video segment detection. In PCM, pages 205212, 2004. [25] J. Yuan, L.-Y. Duan, Q. Tian, and C. Xu. Fast and robust short video clip search using an index structure. In Multimedia Information Retrieval, pages 6168, 2004. [26] Y. Zhai and M. Shah. Tracking news stories across dierent sources. In ACM Multimedia, pages 210, 2005.

Suggest Documents