An Evaluation of Color Histogram Based Methods in ... - CiteSeerX

22 downloads 0 Views 276KB Size Report
Sep 15, 1996 - R. Kasturi, S.H. Strayer, U. Gargi, S. Antani. Department of Computer Science and Engineering. Technical Report. CSE-96-053. September 15 ...
An Evaluation of Color Histogram Based Methods in Video Indexing

R. Kasturi, S.H. Strayer, U. Gargi, S. Antani

Department of Computer Science and Engineering Technical Report CSE-96-053

September 15, 1996

Contents LIST OF TABLES LIST OF FIGURES Abstract 1 Introduction 2 Objectives

2.1 Color spaces . . . . . . . . . . . . . . . . 2.1.1 RGB color space . . . . . . . . . 2.1.2 HSV color space . . . . . . . . . 2.1.3 YIQ color space . . . . . . . . . 2.1.4 XYZ color space . . . . . . . . . 2.1.5 L*a*b* color space . . . . . . . . 2.1.6 L*u*v* color space . . . . . . . . 2.1.7 Munsell color space . . . . . . . . 2.1.8 Opponent color axes color space 2.2 Frame di erence measures . . . . . . . .

3 Approach

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

3.1 Installation and con guration of equipment . . . . . . . . . . . . . . 3.2 Database generation and experimental protocol design . . . . . . . . 3.2.1 Pre{evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Mapping ground truth to detected cuts . . . . . . . . . . . . . . . . 3.5 Frame di erence measures . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Bin to bin di erence . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Chisquare test histogram di erence . . . . . . . . . . . . . . . 3.5.3 Histogram intersection . . . . . . . . . . . . . . . . . . . . . . 3.5.4 Average color . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.5 Reference color table . . . . . . . . . . . . . . . . . . . . . . . 3.5.6 Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.7 Baseline comparisons . . . . . . . . . . . . . . . . . . . . . . . 3.6 Threshold selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Signi cance clustering . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Optimal threshold . . . . . . . . . . . . . . . . . . . . . . . . 3.6.3 Storyboard creation . . . . . . . . . . . . . . . . . . . . . . . 3.7 Study of object motion-based and compressed video based indexing .

i

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii iv 1 2 2 3 3 3 3 3 3 4 4 4 4

4

5 5 6 7 8 8 9 9 9 10 10 10 11 11 12 12 12 12 13

4 Results/Progress 4.1 4.2 4.3 4.4 4.5 4.6

Frame di erence method evaluation . . . Color space evaluation . . . . . . . . . . . Timing considerations . . . . . . . . . . . Thresholding . . . . . . . . . . . . . . . . Final indexed outputs . . . . . . . . . . . Image object characterization using color

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

13 16 19 19 20 20 20

5 Directions

21

Appendices A Color conversions

24 24

5.1 Database and experimental protocol modi cations . . . . . . . . . . . . . . . . . . 21 5.2 Object-motion based and compressed-video indexing . . . . . . . . . . . . . . . . . 21

A.1 A.2 A.3 A.4 A.5 A.6 A.7

HSV color space . . . . . . . . . YIQ color space . . . . . . . . . . XYZ color space . . . . . . . . . L*a*b* color space . . . . . . . . L*u*v* color space . . . . . . . . Munsell color space . . . . . . . . Opponent color axes color space

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

B Handout to volunteers C Storyboard representations D Additional tables

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

24 24 24 25 25 26 27

28 29 30

ii

List of Tables 1 2 2 3 4 4 5 6 7

Description of the original sequences used to generate the dataset . . . . . . . . . . False Alarms and Missed Detections using Automatically selected Thresholds for Generated Sequence (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . False Alarms and Missed Detections using Automatically selected Thresholds (concluded) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average values of weighted false alarm{missed detection error for di erent methods False Alarms and Missed Detections using Per-Sequence Optimal Thresholds for Generated Sequence (continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . . False Alarms and Missed Detections using Per-Sequence Optimal Thresholds (concluded) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average values of separability metric for di erent methods . . . . . . . . . . . . . . Average values of weighted false alarm{missed detection error for di erent color spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Running times for a 1000 frame sequence . . . . . . . . . . . . . . . . . . . . . . .

iii

7 14 15 16 17 18 19 19 20

List of Figures 1 2 3 4 5 6 7

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : False Alarms and Missed Detections for generated sequence : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Best 20 color space-method combinations : : : : : : : : : False alarms and missed detections using optimal thresholding for generated sequence : : : : : : : : : : : : : : : : : : : : : : : : Average Efamd using automatic thresholding for di erent color spaces : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Processing time for 1000 frame sequence for di erent color spaces : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Automatic thresholds obtained for nine original sequences : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Optimal thresholds obtained for nine original sequences

iv

Abstract

A number of automated methods for indexing a video sequence to facilitate browsing and for rapid retrieval of subsequences have been developed in recent years. Many of these methods are based on features computed from color histograms. The methods di er in their choice of discriminating features and the color space in which these are computed. Each of the methods reported in the literature draw conclusions based on their limited experiments. There has not been a formal evaluation of the performance of these methods and a rigorous comparison of all methods using a common database. In this report we describe the results of our e ort in doing such an evaluation. We implemented 11 color histogram based algorithms de ned over 8 color spaces. The performance of these algorithms was evaluated by comparing the segmentation results obtained by the algorithms to those marked by several human volunteers. A number of short video sequences were used for this phase of testing. The performance is measured using a metric de ned as a weighted combination of false alarms and missed detections. In addition to this quantitative measure, for each algorithm, we also output a story-board comprising representative frames within each subsequence marked by the algorithm. These story boards are useful for subjective appraisal of the ability of various algorithms in capturing the essence of the video sequences. To decrease possible bias introduced by the humans by their use of contextual information in marking scene cuts, we also tested the algorithms by randomly assembling a long sequence by concatenating a number of short but homogeneous subsequences. Complete results on this sequence are also presented in this report.

1

1 Introduction Video data is becoming increasingly important in many commercial and scienti c arenas with the advent of applications such as digital broadcasting, interactive-TV, video-on-demand, computerbased training, video-conferencing and multimedia processing tools, and with the development of the hardware and communications infrastructure necessary to support visual applications. The existence of bandwidth to access vast amounts of video multimedia data will lead to the need for video database management techniques to allow browsing and search of video in digital libraries, much as current text databases allow online browsing and keyword search. Finding methodologies to handle the temporal segmentation, storage, retrieval, searching and browsing of digitized video data has been an active area of recent research. Most schemes for video management rely on temporal segmentation, or indexing, of video sequences to impose a temporal semantics and allow representation by key frames to facilitate browsing and search for image objects. Several techniques for video indexing have been proposed in the literature. This research project aims to evaluate these techniques and characterize their usefulness for a video database system. Before proceeding with our evaluation report, we de ne some relevant terminology. A video frame is a single two dimensional (2D) image. A video sequence is a set of continuous video frames that makes up an episode of a particular topic. In any episode, the human vision system can identify and interpret the various scene changes also called video scene cuts. A set of frames between two scene cuts is called a subsequence. Identifying the scene cuts and indexing the video sequence based on a representative frame for the subsequence will facilitate the retrieval of subsequences of interest to the user. Previous researchers have employed histogram di erences to temporally segment video sequences. The input data is converted to one of a number of di erent color space representations such as RGB, HSV, YIQ, L*a*b*, L*u*v* and Munsell and a histogram in one or more dimensions is computed using the resulting data. Di erence measures are applied to a uniformly subsampled sequence of the computed frame histograms to measure the corresponding change. Large frame di erences correspond to a possible scene change. The video sequence is then segmented into subsequences based on \suciently" large di erence values. The scene cuts generated by the algorithm are to be compared against cuts generated by humans, which we consider as the ground truth. The comparison is based on two quantities, missed detections (MDs) which are cuts that were detected by humans, but missed by the algorithms and false alarms (FAs) which are cuts detected by the algorithm that do not correspond to ground truth cuts. In the rst phase of this project, we perform an evaluation of these color histogram-based techniques.

2 Objectives The following were the objectives of this phase: 1. Implementation of color histogram{based frame di erencing methods. 2. Evaluation of di erent histogram di erence measures. 3. Evaluation of di erent color spaces as providing useful metrics for frame di erences. 2

4. Use of color to characterize video images and image objects. 5. Research into object{motion based and compressed video based methods for temporal video segmentation in phases two and three.

2.1 Color spaces

The following color spaces were used to represent the color information of video frames. Histograms were computed in the color space after conversion from the RGB data in each video frame. Equations for color space conversions are presented in Appendix A.

2.1.1 RGB color space

This is the original NTSC Red,Green,Blue triples provided by the video frame grabber. This is not to be confused with the CIE RGB space which has di erent tristimulus values. A natural model for the RGB space would be a uniform cube with the three primaries forming the axes.

2.1.2 HSV color space

This is the Hue, Saturation, Value space often used in computer graphics. It is convenient in that it corresponds to the intuitive notion of color as being composed of a hue component (yellow vs. turquoise) and a saturation component (light pink vs. deep red) as well as the value component representing luminance. The color model can be represented by a single inverted hexagonal cone with the tip at the bottom representing black and the center of the hexagon at the top representing white. The axis of the cone then represents Value while angle around the hexagon represents hue and distance from the axis represents saturation.

2.1.3 YIQ color space

This is the NTSC transmission standard for encoding television pictures. Y represents the luminance while IQ represent the in{phase and quadrature{phase components of the chrominance signal.

2.1.4 XYZ color space

This is a CIE arti cial primary system in which all three tristimulus values are always positive. Y represents the luminance of the color and is the same as the 'Y' in the YIQ model (which was de ned to match the XYZ Y primary).

2.1.5 L*a*b* color space

This is a perceptually uniform color space developed by the C.I.E. to provide a computationally simple measure of color in agreement with the Munsell color system [1]. L* approximately represents the luminance component while a* correlates with redness{greenness and b* correlates with yellowness{blueness. 3

2.1.6 L*u*v* color space

This is another uniform color space that evolved from the L*a*b* system and became a CIE standard in 1976.

2.1.7 Munsell color space

The Munsell Renotation System [2, 3] also uses a conceptual breakdown of color into Hue, Value and Chroma (HVC) components. Hue can be expressed not only as a number but also as a name. The Munsell Book of Color realizes this space as a set of color chips laid out in rows and columns for di erent hue values. For a given constant hue, the chips are intended to be perceived with constant brightness in one dimension and constant chroma in another dimension.

2.1.8 Opponent color axes color space

The RGB space may be converted into an opponent color axis space that approximately separates the luminance and chrominance components. This space was presented in [4] as a useful space for performing video indexing.

2.2 Frame Di erence Measures

The following frame di erence methods were implemented and evaluated: 1. 2. 3. 4. 5.

Bin to bin di erence Chisquare test histogram di erence Histogram intersection Average color Multiple dimensionalities: where applicable the above methods were implemented in 1-D, 2-D and 3-D versions.

6. Pixel to pixel square di erence. 7. Likelihood ratio change detection. The last two methods were implemented for comparison purposes. These methods are described in detail in section 3.5.

3 Approach The following tasks were identi ed as being necessary to ful ll the objectives de ned above. 1. Installation and con guration of equipment for video capture. 4

2. Development of a ground truthed database and establishment of the experimental protocols required to create the database and analyze the processing results. 3. Implementation of algorithms including: (a) Conversion between color spaces. (b) Computation of histogram frame di erence measures. (c) Automatic threshold selection to interpret the computed frame di erence values. (d) Color clustering to characterize video frames and frame objects. 4. Evaluation of histogram{based algorithms against the ground{truthed database. 5. Literature survey into object{motion based methods. We discuss our approach to completing each task and our current progress in the following sections. Our evaluation results and analysis are given in section 4.

3.1 Installation and con guration of equipment

Four 200 MHz SGI Indy workstations with R4400 MIPS CPUs, from 64 MB (three machines) to 128 MB (one machine) of main memory and 1 GB disks were received in the rst week of January 1996. We created user and administration environments, installed the requisite software and integrated the new machines into our existing computing environment. Large, fast external disk drives were attached and con gured to use the XFS lesystem product from SGI. An Indy Video video input board and Cosmo compression board were received and installed. Use of this product allows real-time video capture. Our current system capacity for storing video data is 8 GB. We can now capture video at the full frame rate with little or no frame dropping and process the data using the suite of SGI processors. All of the current video sequences were captured from VHS tape. We now have a cable TV feed into our lab which o ers a cleaner and more convenient source for future sequence capture.

3.2 Database Generation and Experimental Protocol Design

To evaluate the performance of the scene cut detection algorithms, we generated a database of video sequences and the corresponding \ground-truth" information showing the \true" scene cuts. The data we obtained is NTSC video captured from VHS tape at (nearly) 30 frames per second through a JPEG compression board. These video sequences, which range from a minute to 5 minutes in length, were ground truthed with respect to temporal segmentation by human volunteers who viewed the sequences and marked the frames at which they considered a scene cut to have occurred. We have developed tools which are then able to automatically generate groundtruthed sequences of arbitrary length and complexity using this set of short, manually annotated sequences. Creation of the ground-truth data is an extremely time-consuming and challenging process. The process is time-consuming due to the size of the data. Even a short, one-minute sequence contains nearly two thousand frames. The process is challenging due to the subjective nature of the 5

task. Di erent human subjects may mark the same sequence very di erently. The same subjects may mark the same sequence slightly di erently on successive viewings. Also, having access to semantic content, humans will tend to mark many more scene cuts on a given sequence than will be detected by any algorithm using primarily structural information. In an earlier experiment [5] a ground-truth data set was obtained using a ground-glass screen placed in front of the screen to blur the image and block some portion of its semantic content. Volunteers then tended to mark scene changes based on events such as dramatic color di erences. We chose not to take this approach as we felt that the resulting ground-truth was biased toward methods such as color histogram-based approaches. Our goal was to produce a data set and corresponding ground-truth which would be usable for evaluating algorithms during all phases of the project and would be as \consistent" as possible. In order to establish the optimum experimental conditions under which human volunteers perform their best, a pre{evaluation was carried out.

3.2.1 Pre{evaluation

The purpose of this pre{evaluation was to determine the best way for human subjects to establish a ground truth for temporally segmenting video sequences. The parameters of the experiment and the associated questions to be answered were: Playback speed Was full speed or some slower speed the best at which the subject could be most accurate ? Number of previewings Many subjects might nd it useful to view the sequence one or more times before actually performing the segmentation, especially for very active or fast moving sequences. How many previewings would be best? Manner of evaluation Was it feasible to ask people to rank frame di erences in order of significance to compare with the ranking that an algorithm might produce ? Response time What response delay do humans exhibit ? This would be useful in mapping the marked ground truth scene cuts to the scene cuts found by an algorithm. Nine video sequences were shown to each of nine volunteers and they were asked to mark scene cuts in them. Each volunteer saw each sequence at a di erent combination of playback speed and number of previewings. The playback speed was either 1.0 (full speed), 0.5 (half speed) or 0.25 (quarter speed). The number of previewings was either 0, 1 or 2 (the previewings themselves were at full speed). The handout given to the volunteers is in Appendix B. The interface for the volunteers consisted of a program running on an SGI that played the sequence at the desired speed in a window on the monitor and allowed mouse clicks to record scene cuts. These 9 sequences were then separately ground truthed by three of the researchers viewing the sequences frame by frame. This was taken as the reference to evaluate the humans' performance. The results of this experiment clearly showed that we achieved the best quality and most consistent results with the following: Playback speed Allow subject to mark the scene changes at half-speed. 6

Number of previewings Allow subjects to view the sequence once at full-speed before marking scene cuts.

Manner of evaluation Frame di erence ranking was too subjective and too time consuming.

The control group produced highly inconsistent results. Hence this was not pursued further. Response time For half-speed marking the average delay time was about 10 frames. It ranged from about 7 to 20 and was fairly consistent for a given subject. We thus factored in this delay into the process of associating ground truth frame cuts with the corresponding frame di erence values.

3.3 Dataset

Our database consisted of two sets of data. The nine sequences used in the pre{evaluation are listed in Table 1. These were used in the pre{evaluation process described in section 3.2.1. They Code Sequence Name Length (s) Description a savannah-birdcage 70 trailer b seinfeld-jacket 70 sitcom c deep space nine 140 drama d robit-rategate 50 advertisement e cnncompany 161 news clip f cnn-space 120 news clip g predators 120 outdoor wildlife h mtvnews-a 107 news show i mtvnews-b 90 news clip Table 1: Description of the original sequences used to generate the dataset were exhaustively ground truthed on a frame-by-frame inspection to create raw sequences from which our actual evaluation dataset was constructed. This dataset, a new frame sequence, was generated by a random sequence generator program which was written to generate test sequences from individual ground truthed sequences. It works by splicing together extracted clips from each of the nine sequences. At each splice point, one of the nine sequences was randomly chosen. A starting point within the sequence was also randomly chosen. A random length from that starting point was chosen subject to the constraint that the extracted clip not cross any scene cuts as given in the ground truth le for the chosen sequence. The extracted clip was then attached to the sequence being generated. The generated sequence thus contained scene cuts only at the points where splices were performed. Of course, there is a slight possibility that two consecutive extractions would select from the same portion of the same sequence and thus the frame content would be similar across the splice. Apart from this, the generated sequence was clearly a sequence with well{de ned scene cuts which is well suited for our purposes. The advantage of using this long sequence generated from short, accurately ground{truthed sequences was that the scene cuts present were all clear and well{de ned; having such a long sequence ground{truthed by humans 7

would have led to cuts marked on the basis of contextual information. Using this program, a 21000 frame sequence with 200 ground truth cuts was generated.

3.3.1 Evaluation Criteria

Another goal of the mini-evaluation process was the determination of our evaluation criteria. Two basic types of performance factors are considered: 1. Correctness of results: We compute number of missed detections, number of false alarms, and other related metrics. 2. Computational eciency: We compute real CPU time taken. For histogram-based methods, a scene change is determined by computing the di erence between the histograms of selected frames and comparing the di erence value to a threshold. If the di erence exceeds the threshold, then a scene change has occurred between the given frames. A missed detection is thus de ned as a ground truth frame whose corresponding frame di erence lies below the scene cut threshold. A false alarm is then a frame whose di erence value exceeds the scene cut threshold and for which there is no corresponding ground-truth frame. The nal criterion would be some weighted sum of the number of false alarms,FA, and the number of missed detections, MD : Efamd = FA + wMD (1) Since missed detections are in general a more expensive error than false alarms, w should be chosen to be greater than 1. A value of 3.0 was used in this evaluation. In addition to the above, we also compute a separability metric which is de ned as: (2) Msep = Cfdavg ? NCfdavg avg where NCfdavg is the average Non-Cut Frame Di erence, Cfdavg is the average cut frame di erence and avg is the mean standard deviation of both Cut and Non-Cut frame di erences. This metric measures how well the method being evaluated separated cut frame di erences from non{ cut frame di erences. A good measure of frame di erence would produce large values of the above metric.

3.4 Mapping ground truth to detected cuts

When a set of ground truth cuts for a sequence is available and a set of detected cuts needs to be evaluated (by whatever metric), a mapping from the one to the other needs to take place. This is because the detected cut frame number will rarely match the ground truth cut frame number exactly. This occurs either because of the di erence in interpretation of when a scene cut occurs when multiple choices over a (small) range are possible, or because of the temporal resolution used in processing sequences (every 10th frame) or because of the response time for a human to hit the mouse button upon recognizing a scene cut. The algorithm we used to perform the mapping is described below: 8

The ground truth le contains the frame numbers at which the scene changes. The data le contains the di erence values generated by an algorithm using a particular color space, a particular method and a coarse frame comparison o set. For example a data le could contain the di erence values generated by comparison of frames at an interval of every 10 frames using the RGB color space and the Bin-to-Bin comparison in 1 dimension (B2B1D). Due to the frame interval used in computing frame di erences, a detected cut could possibly be located at a di erent frame number than the one at which the actual scene cut occurred. In addition, human response time would further delay the location of a ground truth cut when ground truth is generated by viewing a sequence. This was not the case for this phase since the ground truth was obtained during the pre{evaluation stage by a frame-accurate examination of sequences, but would be true in general. It is thus necessary to provide the matching algorithm with a range. This range is used as the maximum possible interval from the ground truth cut within which a detected cut may be mapped. It is also necessary to specify the range for both directions, since we need to match the frame closest to the ground truth. These ranges are termed as positive and negative ranges. The term positive range de nes the maximum range where the cut frame number greater than the reference ground truth frame number would be matched, while the negative range is in the opposite direction. When mapping a ground truth cut to a set of frame di erences, the frame cut whose frame di erence value is the largest is chosen.

3.5 Frame Di erence Measures

Several color histogram based frame di erence measures, some more intuitive than others, were implemented based on descriptions given in the original papers. As a baseline comparison, direct frame di erence measures were also implemented. These are described in detail below. These frame di erence measures were computed every kth frame, with k = 10.

3.5.1 Bin to bin di erence (B2B)

The sum of the absolute di erence of values in corresponding bins of histograms is used. Given two histograms h1 and h2 X abs (h1 [i] ? h2 [i]) (3) fdb2b (h1 ; h2 ) = 21N i

where N is the number of pixels in a frame and the factor of 2 is to ensure that even for completely non{intersecting histograms, the frame di erence is less than or equal to 1.0.

3.5.2 Chisquare test histogram di erence (CHI)

The histogram bin di erences values are normalized as de ned in [6] to sharpen the frame di erences being computed and \make the evaluation value more strongly re ect the di erence of two frames" [6]. X (h1 [i] ? h2 [i])2 h2 [i] 6= 0 (4) fdchi = N12 h2 [i] i 9

X (h1 [i] ? h2 [i])2 = N12 h2 [i] = 0; h1 [i] 6= 0 h1 [i] i

(5)

(6) The normalization by N 2 is necessary for singular cases as when one histogram consists of exactly one bin and the other histogram has exactly one pixel in the same bin.

3.5.3 Histogram intersection (INT)

As described in [7] the intersection between two color histograms is; P (7) Intersection (h1 ; h2 ) = i MIN N(h1 [i]; h2 [i]) We used a slightly di erent de nition: X MIN (h1 [i]; h2 [i]) (8) Intersection (h1 ; h2 ) = i MAX (h1 [i]; h2 [i]) The frame di erence would then be: fdint (h1 ; h2 ) = 1 ? Intersection (h1 ; h2 ) (9) The intersection of two histograms is a measure of their similarity; for two identical histograms the intersection would be 1.0 and the di erence would be 0.0; for two frames which did not share even a single pixel of the same color (bin), the di erence according to this measure would be 1.0. Again, histogram intersection may be performed between 1-, 2- and 3-dimensional histograms.

3.5.4 Average Color (AVG)

In [8] it is shown that the di erence of average colors of a histogram de ned as the 3 x 1 vector: X h[i] c[j; i] j = 0; 1; 2 (10) havg [j ] = i

where j is the color axis (eg. R, G, or B), c[j; i] is the value of color tristimulus j at bin i (for RGB colors, c[j; i] = i ), is a lower bound on the quadratic form histogram distance using a color similarity matrix. The frame di erence was computed as the square of h1avg ? h2avg .

3.5.5 Reference Color Table (REFC)

As described in [9], a reference color table is essentially a coarse quantization of RGB color space into 27 di erent color triples which form the bins h[i] of a 3-D histogram. The di erence measure is then: 27 q X fdrefc = !i (h1 [i] ? h2 [i])2 (11) i

!i = h2 [i]; if h1 [i]; h2 [i] > 0 = 1 otherwise 10

3.5.6 Dimensionality

Each of the above histogram di erence measures (except for the average color which is applied only to 1-dimensional histograms and the reference color method which is used only for 3-dimensional RGB data) can be applied to 1-dimensional, 2-dimensional or 3-dimensional histograms. That is, for a 3{dimensional color space, the color distribution in a frame may be represented as: either three independent 1-dimensional distributions in each of the primaries; or a distribution in two dimensions over the two axes that are not correlated with luminance; or as a distribution over all three dimensions simultaneously. A 1D representation has three separate histograms for each tristimulus color component. A 2D representation neglects the luminance component (if one exists) and uses only the chrominance components simultaneously (the RGB color space has no 2D representation). A 3D histogram has bins representing cells in color space. The idea behind a two dimensional histogram is that often changes in the brightness of the frame are not correlated with real content change. For example, ash photography may momentarily increase the brightness of the recording of a news conference. The quantization levels of color space were as follows: for 1-dimensional histograms, 256 bins; for 2-dimensional and 3-dimensional histograms each axis was divided into 32 bins. We abbreviate the method and dimensionality into a single code such as B2B1D, INT3D etc.

3.5.7 Baseline comparisons

To have a baseline against which to compare the color histogram based methods, we also implemented two other methods: Pixel to pixel (P2P) The sum of the squares of pixel to pixel di erences for R, G and B components is normalized by the number of pixels in the frame. Likelihood ratio change detection (LHR) As described in [10] the di erences between corresponding blocks of the two frames are compared as below: For corresponding areas of two frames the following value is computed: 2 2 [ 1 +2 2 + ( 1 ?2 2 )2 ]2 (12)  = 12 22 where  denotes the mean gray value and  denotes the square root of the variance from the frames for the sample areas. The block di erences are thresholded: bdj = 1 if j >  (13) = 0 otherwise (14) (15) Then, the likelihood ratio frame di erence is the normalized sum of the number of thresholded di ering blocks. nB X fdlhr = n1 bdj (16) B j=1

where nB is the number of blocks in a frame. A block size of 40 x 40 was used to capture large scale frame content changes. 11

3.6 Threshold Selection

Given the frame di erences computed by the above methods, thresholding is performed to identify the signi cant frame di erences marking scene cuts. The following algorithms were used to select this threshold.

3.6.1 Signi cance clustering

The frame di erences computed over the entire sequence are clustered into two groups using the k{means clustering algorithm [11]. This method assumes that at least one scene cut exists in the sequence. The smallest and the highest frame di erence values are used as the initial centroids of an insigni cant and a signi cant cluster respectively. Frame di erences are then added to the cluster to which they are closest. After each iteration, the cluster centroids are recomputed as the mean of the cluster members. This is repeated until the change in the cluster centroids between iterations falls below a threshold (the clustering process has converged). The frame di erences in the signi cant cluster correspond to scene cuts. A \sensitivity" parameter is available to make the process more sensitive to scene cuts. By weighting distances to the insigni cant cluster highly, more frame di erences may be assigned to the signi cant cluster). A sensitivity value of 0.5 corresponds to no bias in either direction. A sensitivity value of 0.75 was used in this project to match the w = 3:0 weighting of missed detections to false alarms.

3.6.2 Optimal Threshold

In evaluating di erent color spaces and histogram di erence measures, the e ect of the threshold selection algorithm should be minimized. Hence, for evaluation purposes, an optimal thresholding algorithm that minimizes Efamd given the ground truth was implemented. The performance of various methods using this algorithm would then be the best possible. This evaluation method will tend to make up de ciencies in the frame di erence method itself and requires ground truth data and so is of limited usefulness but can serve as a broad indicator and supporter of the results from Efamd computed using signi cance clustering.

3.6.3 Storyboard Creation

After thresholding has been performed, the number of cuts may still be very large for a very \active" video sequence. For example, for a sequence in which there is a lot of camera panning, a cut frame may be detected every 10th frame. To condense this information into a more semantic set of cuts suitable for creating storyboards, very short subsequences, those whose starting (cut) frames are close together (separated by less than some parameterized number of frames) are merged; 10 was used for the storyboards in this report) are merged to create just one subsequence. If this sequence consisting of multiple scene cuts (an \active" subsequence) satis es a minimum length criterion (say 2 seconds) then it is included in the nal storyboard. Otherwise it is discarded as containing too many representative frames over too short an interval to be useful. This technique has been found to work well at creating more representative storyboards without including a distractingly large number of cut frames. For general video sequences (ones where very rapid 12

frame changes are not part of the lexicon) very short and very active subsequences just add clutter that would decrease the usefulness for browsing of the storyboard representation.

3.7 Study of Object Motion-based and Compressed Video based Indexing

In phase II, we will evaluate algorithms which detect scene changes based on object motion such as changes in the relative positions of objects. In phase three, we will investigate the extraction of useful image and video features directly from the compressed data sequence without prior decompression. Speci cally, we will explore their use to detect scene changes and object appearances/disappearances. Research articles in both these areas are being studied in preparation for the next phases of the project. One of the key technical challenges in using object motion to segment video sequences is in the detection of objects of interest. Due to the current state of object recognition and the wide variety of images within video sequences, the object identi cation is done semi-automatically in many cases. Both Lee and Kao [12] and Swanberg, Shu and Jain [13] use operator interaction to identify objects of interest either for indexing or querying. These objects are then tracked by motion analysis techniques such as using the pattern of intensity values in the object region to locate its position in the next frame. We will also investigate the use of standard optical ow algorithms such as the Lucas{Kanade algorithm [14] to compute scene changes. We will compute optical ow across the entire frame and compare the magnitudes and the variances of the pixel displacement vectors of the ow elds. A large number of large but consistent displacement vectors would indicate signi cant object or camera motion which might constitute a cut. The MPEG [15, 16] compression scheme ties in closely with motion{based methods due to its use of motion vectors to capitalize on temporal correlation between frames. Thus, motion{based temporal segmentation can be achieved using motion vectors in an MPEG bit{stream previously computed using block motion estimation [17]. A displacement vector will then be available for a block of pixels and can be processed as above for the single-pixel displacement case. Also, in the case of an MPEG sequence, the existence of a large number of backward or bidirectional predicted blocks would indicate the occlusion, disappearance or appearance of an object. More direct use of MPEG coecients [18, 19, 20, 21] is also being studied.

4 Results/Progress This section presents the results of running the frame di erence methods described above working in each of the color spaces described earlier (where applicable) to pairs of frames taken at a temporal resolution of one out of every ten frames. A note on normalization: The normalized values of Efamd in the tables are the FA and MD numbers divided by the number of true noncut frames and the number of true cut frames, respectively. This normalization was carried out to avoid excessive dependence of the nal numbers on one very long and/or very active sequence in the dataset. This normalization explains why the optimal threshold selection algorithm sometimes appears to do worse than the automatic threshold selection algorithm: the former attempts to minimizes the (weighted) sum of FA and MD over just one sequence and therefore does not need to normalize, whereas for purposes of combining results from multiple sequences a normalization

Algorithm Color Space Method HSV AVG1D HSV B2B1D HSV B2B2D HSV B2B3D HSV CHI1D HSV CHI2D HSV CHI3D HSV INT1D HSV INT2D HSV INT3D LAB AVG1D LAB B2B1D LAB B2B2D LAB B2B3D LAB CHI1D LAB CHI2D LAB CHI3D LAB INT1D LAB INT2D LAB INT3D LUV AVG1D LUV B2B1D LUV B2B2D LUV B2B3D LUV CHI1D LUV CHI2D LUV CHI3D LUV INT1D LUV INT2D LUV INT3D MTM AVG1D MTM B2B1D MTM B2B2D MTM B2B3D MTM CHI1D MTM CHI2D MTM CHI3D MTM INT1D MTM INT2D MTM INT3D

Errors FA MD 2 74 36 6 54 7 52 5 0 152 0 188 1 53 85 5 392 5 522 4 0 96 39 6 40 6 39 5 0 142 0 187 0 140 122 5 343 5 447 5 0 112 20 7 41 8 36 5 0 152 0 192 0 150 101 5 257 5 404 5 2 63 32 6 38 6 38 5 0 133 0 191 1 128 96 5 291 5 338 5

Norm FA MD 0.001050 0.368159 0.018908 0.030000 0.028361 0.035000 0.027311 0.025000 0.000000 0.756219 0.000000 0.935323 0.000525 0.265000 0.044643 0.025000 0.205882 0.025000 0.274160 0.020000 0.000000 0.480000 0.020483 0.030000 0.021008 0.030000 0.020483 0.025000 0.000000 0.706468 0.000000 0.930348 0.000000 0.696517 0.064076 0.025000 0.180147 0.025000 0.234769 0.025000 0.000000 0.560000 0.010504 0.035000 0.021534 0.040000 0.018908 0.025000 0.000000 0.756219 0.000000 0.955224 0.000000 0.746269 0.053046 0.025000 0.134979 0.025000 0.212185 0.025000 0.001050 0.315000 0.016807 0.030000 0.019958 0.030000 0.019958 0.025000 0.000000 0.661692 0.000000 0.950249 0.000525 0.636816 0.050420 0.025000 0.152836 0.025000 0.177521 0.025000

Efamd 1.105527 0.108908 0.133361 0.102311 2.268657 2.805969 0.795525 0.119643 0.280882 0.334160 1.440000 0.110483 0.111008 0.095483 2.119404 2.791044 2.089551 0.139076 0.255147 0.309769 1.680000 0.115504 0.141534 0.093908 2.268657 2.865672 2.238807 0.128046 0.209979 0.287185 0.946050 0.106807 0.109958 0.094958 1.985076 2.850747 1.910973 0.125420 0.227836 0.252521

Table 2: False Alarms and Missed Detections using Automatically selected Thresholds for Generated Sequence (continued)

Algorithm Color Space Method OPP AVG1D OPP B2B1D OPP B2B2D OPP B2B3D OPP CHI1D OPP CHI2D OPP CHI3D OPP INT1D OPP INT2D OPP INT3D RGB AVG1D RGB REFC RGB P2P RGB LHR RGB B2B1D RGB B2B3D RGB CHI1D RGB CHI3D RGB INT1D RGB INT3D XYZ AVG1D XYZ B2B1D XYZ B2B2D XYZ B2B3D XYZ CHI1D XYZ CHI2D XYZ CHI3D XYZ INT1D XYZ INT2D XYZ INT3D YIQ AVG1D YIQ B2B1D YIQ B2B2D YIQ B2B3D YIQ CHI1D YIQ CHI2D YIQ CHI3D YIQ INT1D YIQ INT2D YIQ INT3D YYY AVG1D YYY B2B1D YYY CHI1D YYY INT1D

Errors FA MD 6 79 28 6 57 6 44 5 0 151 0 170 0 137 72 5 430 4 520 4 6 81 37 26 550 7 0 200 80 5 68 5 0 148 1 108 151 5 577 3 5 93 81 5 47 5 45 5 0 147 0 152 0 137 147 5 314 5 467 5 5 91 26 6 65 6 60 5 0 141 0 175 0 141 124 5 433 4 502 4 5 103 96 5 0 160 241 5

Norm FA MD 0.003151 0.395000 0.014706 0.030000 0.029937 0.030000 0.023109 0.025000 0.000000 0.751244 0.000000 0.845771 0.000000 0.681592 0.037815 0.025000 0.225840 0.020000 0.273109 0.020000 0.003151 0.405000 0.019433 0.130000 0.288866 0.034314 0.000000 0.995025 0.042017 0.025000 0.035714 0.025000 0.000000 0.736318 0.000525 0.537313 0.079307 0.025000 0.303046 0.015000 0.002626 0.465000 0.042542 0.025000 0.024685 0.025000 0.023634 0.025000 0.000000 0.731343 0.000000 0.756219 0.000000 0.681592 0.077206 0.025000 0.164916 0.025000 0.245273 0.025000 0.002626 0.452736 0.013655 0.030000 0.034139 0.030000 0.031513 0.025000 0.000000 0.701493 0.000000 0.870647 0.000000 0.701493 0.065126 0.025000 0.227416 0.020000 0.263655 0.020000 0.002626 0.512438 0.050420 0.025000 0.000000 0.796020 0.126576 0.025000

Efamd 1.188151 0.104706 0.119937 0.098109 2.253732 2.537313 2.044776 0.112815 0.285840 0.333109 1.218151 0.409433 0.391808 2.985075 0.117017 0.110714 2.208954 1.612464 0.154307 0.348046 1.397626 0.117542 0.099685 0.098634 2.194029 2.268657 2.044776 0.152206 0.239916 0.320273 1.360834 0.103655 0.124139 0.106513 2.104479 2.611941 2.104479 0.140126 0.287416 0.323655 1.539940 0.125420 2.388060 0.201576

Table 2: False Alarms and Missed Detections using Automatically selected Thresholds (concluded)

Method B2B1D B2B2D B2B3D CHI1D CHI2D CHI3D INT1D INT2D INT3D AVG1D

Efamd 0.186740 0.231656 0.211146 0.723525 0.891969 0.632793 0.374678 0.661413 0.720397 0.457307

Table 3: Average values of weighted false alarm{missed detection error for di erent methods is required, which e ectively changes the relative weighting of FA and MD .

4.1 Frame Di erence Method Evaluation

The values of Efamd using automatic thresholding for the generated sequence dataset are in Table 2. The number of ground truth cuts for this sequence was 200. Figure 1 plots Efamd values for selected color spaces for the generated subsequence. The average values (over the seven color spaces that could be used in any method) of Efamd are shown in Table 3. It can be seen that the bin-to-bin set of methods outperform the others regardless of dimensionality, closely followed by the histogram intersection class of methods. The chisquare class of methods does relatively poorly as do the average color methods. The top 20 performing method{color space combinations are shown in Figure 2. In comparison, the pixel to pixel square error and likelihood ratio change detection methods had Efamd values of 2.243148 and 0.995025 which are substantially poorer than the color histogram based methods. The values of Efamd using optimal thresholding are listed in Tables 4 and shown in Figure 3. The trend here is not as clear due to the compensating e ect of the optimal thresholding algorithm but it is consistent with the evaluation above. The values of Efamd using automatic and optimal thresholding for the nine original sequences are presented for reference in Appendix D. Note that for the generated sequence, only one threshold was used, not di erent thresholds for the di erent constituent clips. Averaging the performance of the di erent frame di erence methods over 7 color spaces using the Efamd metric values obtained for automatic thresholding, we get the chart in Figure 4. Again, the bin-to-bin methods appear to be the best. Using the separability metric Msep de ned in section 3.3.1 and averaging the values over the seven color spaces that could be used in all methods, we get the values in Table 5. Here the 3-D versions of the methods appear to do better and the histogram intersection appears to have the edge.

Color Space Method HSV HSV HSV HSV HSV HSV HSV HSV HSV HSV LAB LAB LAB LAB LAB LAB LAB LAB LAB LAB LUV LUV LUV LUV LUV LUV LUV LUV LUV LUV MTM MTM MTM MTM MTM MTM MTM MTM MTM MTM

AVG1D B2B1D B2B2D B2B3D CHI1D CHI2D CHI3D INT1D INT2D INT3D AVG1D B2B1D B2B2D B2B3D CHI1D CHI2D CHI3D INT1D INT2D INT3D AVG1D B2B1D B2B2D B2B3D CHI1D CHI2D CHI3D INT1D INT2D INT3D AVG1D B2B1D B2B2D B2B3D CHI1D CHI2D CHI3D INT1D INT2D INT3D

Errors FA MD 18 24 3 13 4 15 4 10 4 20 19 18 12 10 3 13 4 15 4 10 24 35 4 12 6 12 3 9 10 14 32 20 13 26 4 12 6 12 3 9 24 41 2 16 6 14 3 8 13 15 27 33 10 31 2 16 6 14 3 8 20 35 3 11 5 10 6 6 5 18 23 25 18 13 3 11 5 10 6 6

Norm FA+wMD FA MD 0.008555 0.120000 0.368555 0.001426 0.065000 0.196426 0.001901 0.075000 0.226901 0.001901 0.050000 0.151901 0.001901 0.100000 0.301901 0.009030 0.090000 0.279030 0.005703 0.050000 0.155703 0.001426 0.065000 0.196426 0.001901 0.075000 0.226901 0.001901 0.050000 0.151901 0.011407 0.175000 0.536407 0.001901 0.060000 0.181901 0.002852 0.060000 0.182852 0.001426 0.045000 0.136426 0.004753 0.070000 0.214753 0.015209 0.100000 0.315209 0.006179 0.130000 0.396179 0.001901 0.060000 0.181901 0.002852 0.060000 0.182852 0.001426 0.045000 0.136426 0.011407 0.205000 0.626407 0.000951 0.080000 0.240951 0.002852 0.070000 0.212852 0.001426 0.040000 0.121426 0.006179 0.075000 0.231179 0.012833 0.165000 0.507833 0.004753 0.155000 0.469753 0.000951 0.080000 0.240951 0.002852 0.070000 0.212852 0.001426 0.040000 0.121426 0.009506 0.175000 0.534506 0.001426 0.055000 0.166426 0.002376 0.050000 0.152376 0.002852 0.030000 0.092852 0.002376 0.090000 0.272376 0.010932 0.125000 0.385932 0.008555 0.065000 0.203555 0.001426 0.055000 0.166426 0.002376 0.050000 0.152376 0.002852 0.030000 0.092852

Table 4: False Alarms and Missed Detections using Per-Sequence Optimal Thresholds for Generated Sequence (continued)

Color Space Method OPP OPP OPP OPP OPP OPP OPP OPP OPP OPP RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB XYZ XYZ XYZ XYZ XYZ XYZ XYZ XYZ XYZ XYZ YIQ YIQ YIQ YIQ YIQ YIQ YIQ YIQ YIQ YIQ YYY YYY YYY YYY

AVG1D B2B1D B2B2D B2B3D CHI1D CHI2D CHI3D INT1D INT2D INT3D AVG1D REFC P2P LHR B2B1D B2B3D CHI1D CHI3D INT1D INT3D AVG1D B2B1D B2B2D B2B3D CHI1D CHI2D CHI3D INT1D INT2D INT3D AVG1D B2B1D B2B2D B2B3D CHI1D CHI2D CHI3D INT1D INT2D INT3D AVG1D B2B1D CHI1D INT1D

Errors FA MD 33 25 4 9 8 11 5 9 14 17 30 51 25 22 4 9 8 11 5 9 33 27 10 40 63 26 64 65 8 14 0 14 11 20 13 20 8 14 0 14 40 27 9 15 5 12 5 10 13 20 13 28 11 31 9 15 5 12 5 10 35 25 7 6 8 14 1 13 14 18 29 51 12 36 7 6 8 14 1 13 34 47 7 18 12 24 7 18

Norm FA+wMD FA MD 0.015684 0.125000 0.390684 0.001901 0.045000 0.136901 0.003802 0.055000 0.168802 0.002376 0.045000 0.137376 0.006654 0.085000 0.261654 0.014259 0.255000 0.779259 0.011882 0.110000 0.341882 0.001901 0.045000 0.136901 0.003802 0.055000 0.168802 0.002376 0.045000 0.137376 0.015684 0.135000 0.420684 0.004753 0.200000 0.604753 0.029943 0.130000 0.419943 0.030418 0.325000 1.005418 0.003802 0.070000 0.213802 0.000000 0.070000 0.210000 0.005228 0.100000 0.305228 0.006179 0.100000 0.306179 0.003802 0.070000 0.213802 0.000000 0.070000 0.210000 0.019011 0.135000 0.424011 0.004278 0.075000 0.229278 0.002376 0.060000 0.182376 0.002376 0.050000 0.152376 0.006179 0.100000 0.306179 0.006179 0.140000 0.426179 0.005228 0.155000 0.470228 0.004278 0.075000 0.229278 0.002376 0.060000 0.182376 0.002376 0.050000 0.152376 0.016635 0.125000 0.391635 0.003327 0.030000 0.093327 0.003802 0.070000 0.213802 0.000475 0.065000 0.195475 0.006654 0.090000 0.276654 0.013783 0.255000 0.778783 0.005703 0.180000 0.545703 0.003327 0.030000 0.093327 0.003802 0.070000 0.213802 0.000475 0.065000 0.195475 0.016160 0.235000 0.721160 0.003327 0.090000 0.273327 0.005703 0.120000 0.365703 0.003327 0.090000 0.273327

Table 4: False Alarms and Missed Detections using Per-Sequence Optimal Thresholds (concluded)

Method B2B1D B2B2D B2B3D CHI1D CHI2D CHI3D INT1D INT2D INT3D AVG1D

Msep 3.681348 3.332162 4.162935 0.931090 0.574293 1.037919 4.001511 3.598642 4.063465 1.539882

Table 5: Average values of separability metric for di erent methods Color Space Efamd HSV 0.374410 LAB 0.380042 LUV 0.343853 MTM 0.343432 OPP 0.351537 RGB 0.455340 XYZ 0.429060 YIQ 0.390347 YYY 0.466325 Table 6: Average values of weighted false alarm{missed detection error for di erent color spaces

4.2 Color Space Evaluation

The values of Efamd using w = 1 averaged over the best methods | B2B1D, B2B3D, INT1D, INT3D | (except for YYY (luminance alone) which was averaged only over B2B1D and INT1D since the other methods cannot use a single dimensioned space) for each color space are in Table 6 The gures show that the color spaces can be divided into two performance groups: The best performing group consists of the L*u*v*, MTM, OPP, HSV, L*a*b*, and YIQ spaces. The RGB, XYZ and YYY (luminance alone) spaces have high values for this error metric.

4.3 Timing Considerations

The main computational component of the video indexing process is the color conversion. A sample run on a 1000 frame sequence using an interval of 10 frames for di erent color spaces yielded the times (from the 'user' component of the time returned by the UNIX /bin/time command) shown in Table 7 and plotted in Figure 5. From this data, it can be seen that the algorithms that are more oating point intensive take the longest time. For example, the di erence between a conversion from the RGB to the L*a*b* space and a conversion from the RGB to the L*u*v* space is just one oating point \pow()" (exponentiation) call, but the former takes more than twice as long as the latter. Thus, the evaluation results indicate that best performance would be obtained by using Color Space L*u*v* for best performance, OPP for speed (only integer computations). Method Bin-to-bin 1-dimensional histogram di erencing.

Color Space Time (s) RGB 99 YIQ 118 OPP 119 HSV 122 XYZ 130 LUV 235 LAB 519 MTM 718 Table 7: Running times for a 1000 frame sequence

4.4 Thresholding

The optimal threshold values and the automatically identi ed (e ective) threshold values for the nine sequences are shown in Appendix D and plotted in Figures 6 and 7. The following conclusions can be drawn. 1. The automatic thresholding yields threshold values that are comparable to the optimal ones. Since the optimal threshold algorithm returns one single value where a range of values would probably be optimal (the range around the optimal threshold such that within this range FA or MD do not change because there are no frame di erences in this interval), an exact match should not be looked for. 2. The threshold values vary across di erent sequences. This precludes the use of a xed static threshold for indexing purposes.

4.5 Final Indexed Outputs

The storyboard representations of two of the nine sequences (a and f) in our raw dataset are shown in Appendix C.

4.6 Image Object Characterization Using Color

Preliminary studies are under way for using color histograms for image object characterization in a content-based image retrieval context. Color histograms are robust characterizations of video frames but they lack spatial information. The essential idea is to add information on the spatial distribution of color to allow images to be matched by similarity. Currently clustering in both the color and image dimension spaces is being used. It has been found that with a small number of clusters, a representation of a video frame that is perceptually very close to the original is possible. Work on matching such representations is planned.

20

5 Directions

5.1 Database and Experimental Protocol Modi cations

This report describes the implementation of color histogram based video indexing methods scheduled for phase I of the project as well as the evaluation carried out so far. Further evaluation is scheduled for phase IV of the project. Some issues in the area of video indexing that will be investigated during the nal phase are: 1. E ect of color quantization (for color histogram based methods only). A preliminary investigation has been performed indicating that histogram based methods are robust to quantization e ects in 1-D. Full results will be presented with the evaluation slated for phase IV. 2. Scalability to higher/lower frame rates, coarser/ ner image resolutions. 3. Controllability: The existence of intuitive algorithm tuning parameters which may be varied under user control, e.g., sensitivity to small scene changes. 4. Domain of applicability: whether knowledge of the type of sequence (news, music video, sports, etc.) can be used to choose a model that allows better indexing.

5.2 Future Phases: Object-motion based and Compressed-video indexing

Based on our research, we plan to implement the following algorithms: 1. Operator-driven object selection method - user selects rough boundary of object. Algorithm does appropriate boundary tracking to cleanly enclose region. We perform feature extraction on region and track it through sequence to obtain cut point at object appearance/ disappearance points. 2. Investigate use of optical ow for temporal segmentation. 3. Direct method [19] for using JPEG coecients for video sequence segmentation. 4. Modi cation of the above for MPEG data [18]. 5. Akutsu et al [17] algorithm for video indexing using motion vectors.

References [1] W.K. Pratt. Digital Image Processing, 2nd Ed. John Wiley & Sons, 1991. [2] R.L. Kuehn and H.R. Luxenberg. Visual Experience and Colorimetry. Computer Graphics Systems Development Corporation, Mountain View, CA, USA. [3] M. Miyahara and Y. Yoshida. Mathematical Transform of (R, G, B) Color Data to Munsell (H, V, C) Color Data. Proceedings of SPIE Visual Communications and Image Processing, 1001:650{657, 1988. 21

[4] B. Furht, S. W. Smoliar, and H.J. Zhang. Video and Image Processing in Multimedia Systems. Kluwer Academic Publishers, 1995. [5] U. Gargi, S. Oswald, D. Kosiba, S. Devadiga, and R. Kasturi. Evaluation Of Video Sequence Indexing and Hierarchical Video Indexing. In Proceedings of IS&T/SPIE Conference on Storage and Retrieval for Image and Video Databases III, Vol. SPIE 2420, 1995. [6] A. Nagasaka and Y. Tanaka. Automatic Video Indexing and Full{Video Search for Object Appearances. Proceedings of IFIP 2nd Working Conference on Visual Database Systems, pages 113{127, 1992. [7] M.J. Swain and D.H. Ballard. Color Indexing. International Journal of Computer Vision, 26(4):461{470, 1993. [8] J. Hafner, H.S. Sawhney, W. Equitz, M. Flickner, and W. Niblack. Ecient Color Histogram Indexing for Quadratic Form Distance Functions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7, 1995. [9] B.M. Mehtre, M. S Kankanhalli, A. D. Narasimhalu, and G. C. Man. Color Matching for Image Retrieval. Pattern Recognition Letters, 16:325{331, 1994. [10] R. Kasturi and R.C. Jain. Computer Vision: Principles, pages 469{480. IEEE Computer Society Press, 1991. [11] R.O. Duda and P.E. Hart. Pattern Classi cation and Scene Analysis. Wiley, 1993. [12] S.-Y. Lee and H.-M. Kao. Video Indexing { An Approach Based on Moving Object and Track. Proceedings of IS&T/SPIE Conference on Storage and Retrieval for Image and Video Databases I, Vol. SPIE 1908, pages 25{36, 1993. [13] Deborah Swanberg, Chiao-Fe Shu, and Ramesh Jain. Knowledge Guided Parsing in Video Databases. Proceedings of IS&T/SPIE Conference on Storage and Retrieval for Image and Video Databases I, Vol. SPIE 1908, pages 13{24, 1993. [14] B.D. Lucas and T. Kanade. An Iterative Image Registration Technique with an Application to Stereo Vision. DARPA Image Understanding Workshop, pages 121{130, 1981. [15] D.J. Le Gall. The MPEG Video Compression Algorithm: A Review. In SPIE Conference on Image Processing Algorithms and Techniques II, volume 1452, 1991. [16] N. Haddadi and C.-C. Jay Kuo. Fast Computation of Motion Vectors for MPEG. volume 2094, 1993. [17] A. Akutsu et al. Video Indexing using Motion Vectors. In Proceedings of SPIE Visual Communications and Image Processing, volume 1818, pages 1522{1530. SPIE, 1992. [18] H.J. Zhang et al. Video Parsing using Compressed Data. In SPIE Symposium on Electronic Imaging Science and Technology: Image and Video Processing II, pages 142{149, 1994. 22

[19] F. Arman, A. Hsu, and M.-Y. Chiu. Feature Management for Large Video Databases. Proceedings of IS&T/SPIE Conference on Storage and Retrieval for Image and Video Databases I, Vol. SPIE 1908, pages 2{12, 1993. [20] V.M. Bove, Jr., B.D. Granger, and J.A. Watlington. Real{Time Decoding and Display of Structured Video. In IEEE Conference on Multimedia, pages 456{462, 1994. [21] G. Iyengar and S. Panchanathan. Codec Designs for Image Browsing. In IEEE Conference on Multimedia, pages 473{481, 1994. [22] J.D. Foley, A. van Dam, S.K. Feiner, and J.F. Hughes. Computer Graphics: Principles and Practice, 2nd Ed. Addison Wesley, 1990. [23] Computer Graphics Systems Development Corporation, Mountain View, CA, USA. Color Science Library Reference Manual, Version 1.2, 1994.

23

A Color Conversions The equations for conversions between the RGB original frame data and the color spaces used in this evaluation are presented here for reference.

A.1 HSV color space

The color model can be represented by a single inverted hexagonal cone with the tip at the bottom representing black and the center of the hexagon at the top representing white. The axis of the cone then represents Value while angle around the hexagon represents hue and distance from the axis represents saturation. The conversion from the RGB space involves a straightforward algorithm as opposed to conversion matrices and is given in [22]. Let max be the maximum of R, G and B and min the minimum. If max is zero, the saturation and hue are 0.0. Else ? min) (17) saturation = (maxmax (G ? B ) If max = R hue = (max (18) ? min) (B ? R) + 1=3 (19) If max = G hue = (max ? min) (R ? G) + 2=3 If max = B hue = (max (20) ? min) Also, value = max=255 (21)

A.2 YIQ color space

The conversion from the NTSC RGB primaries is [1]:

Y = 0:299  R + 0:587  G + 0:114  B

(22)

I = 0:596  R ? 0:274  G ? 0:322  B Q = 0:211  R ? 0:523  G + 0:312  B

(23) (24)

A.3 XYZ color space

The conversion matrix depends on the alignment white value chosen. The possible reference whites are [23]:

Equal Energy White CIE Illuminant A Tungsten Lamp at 2856 K. CIE Illuminant B Noon Sunlight. 24

CIE Illuminant C Overcast sky at noon. CIE Illuminant D Black body at 6504 K. Using illuminant C the conversion equations are:

X = 0:607  R + 0:174  G + 0:200  B

(25)

Y = 0:299  R + 0:587  G + 0:114  B Z = 0:000  R + 0:066  G + 1:116  B

(26) (27)

A.4 L*a*b* color space

The XYZ space may be conveniently used as an intermediate in the conversion of RGB to the L a b and LU  V  spaces:   13 Y  (28) L = 25  100  Y ? 16

a = 500  b = 200 

 X  13

0

 Y  31 

 Z  13 

X0

? Y 0

(29)

Y0

? Z 0

(30)

 Y  31

A.5 L*u*v* color space

The conversion formulae are:

  13 L = 25  100  YY ? 16 0 L = 903:3 YY 0

where

Y  0:008856 Yo Y < 0:008856 Yo

u = 13 L (u0 ? u00 ) v = 13 L (v0 ? v00 )

(31) (32) (33) (34)

4X u0 = X + 15 (35) Y + 3Z 9Y v0 = X + 15 (36) Y + 3Z In the above equations X0 ; Y0 ; Z0 ; u00 and v00 correspond to the values for the reference white which was again chosen as CIE Illuminant C. 25

A.6 Munsell color space

The mathematical method for transformation from RGB data to the Munsell system (Mathematical Transform to Munsell, or MTM) presented in [3] was used. The RGB color data is rst converted to the XYZ space as before. Then the following steps of computation are followed: Let

Xc Zc H1 H2

= = = =

1:020  X 0:8472  Z V (Xc) ? V (Y ) V (Zc) ? V (Y )

(37) (38) (39) (40) (41)

where

V (x) = 11:6  x 31 ? 1:6

(42)

Let

M1 M2 M3 

= = = =

H1 0:4H2 0:23H3  M1 tan?1 M 2

(43) (44) (45) (46) (47)

and

S1 = (a + b  cos())M1 S2 = (c + d  sin())M2

(48) (49) (50)

a = 8:880; b = 0:966 c = 8:025; d = 2:558

(51) (52) (53)

where

Then,

  Hue = tan?1 SS1 2 V alue = M3 1 Chroma = (S12 + S22 ) 2

26

(54) (55) (56) (57)

A.7 Opponent color axes color space

The RGB space may be converted into an opponent color axis space de ned as:

rg = R ? G

by = 2B ? R ? G

wb = R + G + B

This space is presented in [4] as a useful space for performing video indexing.

27

(58)

B Handout to Volunteers This is the handout to human evaluators providing the ground truth for our database.

Generating Ground Truth We are evaluating various image recognition algorithms for detecting changes in video sequences. We need your input to evaluate their performance against human judgment in detecting a change in a scene. You will be shown video clips of a few (from 1 to 15) minutes duration at di erent speeds. These clips may be a mix of news broadcasts, commercials, music videos or sitcoms. When seeing these clips you should watch for signi cant changes in the composition of the video frame. These changes may vary for each type of program that you are viewing. Examples of scene changes are :  In a News sequence, a change in the anchor person.  Any signi cant change in the image such as addition of graphics (possibly in a music video)  movement from an indoor location to an outdoor location.  A sudden change in camera viewpoint causing signi cant change. For example, a sudden change from a wide{angle scene to the close{up of an actor.  A change in the color or appearance of the frame or of dominant objects in the frame. When you detect a change in the scene press the mouse button. This will record the point in the sequence at which you marked a scene change. We will use these marked points as data for our comparison. To aid your judgment, you may be asked to view the sequence zero, one or two times before you mark the scene changes. After you have marked a sequence you will be asked to rate your con dence in your marking on a scale of 1 to 5 where 1 = not at all con dent 2 = not very con dent 3 = somewhat con dent 4 = quite con dent 5 = very con dent. Give your response depending on whether you feel you missed any scene cuts (as might happen if the sequence was a fast moving music video for instance) or whether you marked a cut by mistake.

28

C Storyboard Representations The storyboard representations for two sequences (a and f, see Table 1) generated by the storyboard creation algorithm described in Section 3.6.3 appear on the following sheets.

29

D Additional Tables The following pages contain tables not shown in the results section.

30

Suggest Documents