Boosting Multiple Experts by Joint Optimization of Decision Thresholds1

1 downloads 0 Views 756KB Size Report
and the SUPERMAN sequence is a clip from the TV show “Superman. ... SUPERMAN are what we term “real-world” ..... Systems, Man and Cybernet- ics, Part A: ...
Boosting Multiple Experts by Joint Optimization of Decision Thresholds1 J. Kittler, Y. Yusoff, W. Christmas, T. Windeatt, and D. Windridge Center for Vision, Speech and Signal Processing, School of Electronics, Computing, and Mathematics, University of Surrey, Guildford GU2 7XH, United Kingdom e-mail: [email protected] Abstract—We consider a multiple classifier system which combines the hard decisions of experts by voting. We argue that the individual experts should not set their own decision thresholds. The respective thresholds should be selected jointly as this will allow for compensation of the weaknesses of some experts by the relative strengths of the others. We perform the joint optimization of decision thresholds for a multiple expert system by a systematic sampling of the multidimensional decision threshold space. We show the effectiveness of this approach on the important practical application of video shot cut detection. 1

1. INTRODUCTION

Among the many combination rules suggested in the literature [1–25, 30], voting is very popular. It operates on class labels assigned to each pattern by the respective experts by hardening their soft decision outputs using the maximum value selector. The vote rule output is a function of the votes received for each class in terms of these single expert class labels. Many versions of the vote combination rule exist, such as unanimous vote, threshold voting, weighted voting, and simple majority voting [14, 20]. In addition to these basic rules, the authors in [20] propose two voting methods claimed to outperform the majority voting. The first method assigns a pattern to a class by the unanimity vote; otherwise, the sample is rejected. In the second method, the authors propose a winning class to be one that has the highest vote, such that its votes are larger than the second largest vote by a particular threshold. Lam and Suen [14] give a comprehensive analysis of the behavior of the majority vote (Vote), under the assumption of conditional independence of the experts. They show that Vote with an odd number of experts produces the highest recognition rate, while voting with an even number of experts produces a better result only when errors are more costly than rejections. In this paper, we argue that individual experts should not be allowed to set their own decision thresholds. These should be selected jointly, as in this manner the respective weaknesses of some experts may be compensated for by the relative strengths of the others. We perform the joint optimization of decision thresholds for a multiple expert system by a systematic sampling of the multidimensional decision threshold space. 1 This

paper was submitted by the authors in English.

Received March 18, 2001 Pattern Recognition and Image Analysis, Vol. 11, No. 3, 2001, pp. 529–541. Original Text Copyright © 2001 by Pattern Recognition and Image Analysis.

We show the effectiveness of this approach on the important practical application of video shot cut detection. In implementing this, five different experts are deployed to express opinions as to whether the visual content of video material remains the same or has changed from one frame to another. The paper is organized as follows: in the next section we introduce the necessary formalism and develop the basic theory of classifier combination by joint optimization of thresholds of the voting experts. The application of the proposed methodology to the problem of video shot cut detection is discussed in Section 3. In Section 4, we draw the paper to a conclusion. 2. THEORETICAL FRAMEWORK Consider a pattern recognition problem where pattern Z is to be assigned to one of the m possible classes {ω1, …., ωm}. Let us assume that we make R vector observations xi i = 1, …, R on the given pattern and that the i th measurement vector is the input to the i th expert modality. We shall assume that these observations are provided by different logical sensors. Logical sensors, of course, could also generate features that are correlated. However, for the sake of simplicity, we shall assume that the vectors of measurements extracted by different logical sensors are conditionally statistically independent. Although this assumption does not need to hold fully in practice, the effectiveness of the method depends on the expert outputs exhibiting diversity. In the measurement space each class ωk is modeled by the probability density function p(xi |ωk) and its a priori probability of occurrence is denoted by P(ωk). We shall consider the models to be mutually exclusive, meaning that only one model can be associated with each pattern. Now according to the Bayesian theory, given measurements xi , i = 1, …, R, the pattern, Z, should be assigned to class ωj (i.e., its label θ should assume

530

KITTLER et al. x1

P(ω|x1)

Expert1

.........................................

Expert1

xR

decision

Σ

xR

P(ω|xR)

ExpertR

P ( θ = ω j x 1, …, x R )

if

= max P ( θ = ω k x 1, …, x R ).

(1)

k

It has been shown elsewhere [13] that under the assumption of independence the decision rule (1) can be expressed or approximated as R

assign θ



ω j if

i=1 m

= max k=1

P ( θ = ω j xi ) ------------------------------- P ( θ = ω j ) P(θ = ω j )

P ( θ = ωk xi )

R

∏ ------------------------------P(θ = ω )

(2)

Let us consider the vote fusion rule in the two class case. The ith expert will cast vote v(ωj |xi ) for class ωj according to v ( ω j xi )

P ( θ = ω k ),

which combines the individual classifier outputs in terms of a product. Under the additional assumption of the sensor measurement information content being low, the decision rule becomes assign θ

ωj

if

( 1 – R )P ( θ = ω j )

1 =  0

2

P ( θ = ω j x i ) = max P ( θ = ω k x i )

if

∑ P(θ = ω

j

xi )

otherwise.

The vote decision rule is then given as assign θ

(3)

i=1

(4)

k=1

R

+

vote max

Although this fusion rule operates directly on the soft decision outputs of the experts, it has recently been shown in [26] that for heavy tail error distributions the vote fusion rule, which combines multiple expert outputs hardened by the max operation, gives better performance. This combination rule, shown in Fig. 2, also has the advantage that it can be utilized by multiple classifiers that do not compute or provide access to the a posteriori class probabilities.

k

i=1

P(ω|xR)

decision

Fig. 2. Vote fusion rule.

value θ = ωj ), on the condition that the a posteriori probability of that interpretation is maximum; i.e., ωj

vote

max

Σ

ExpertR

Fig. 1. Sum fusion rule.

assign θ

P(ω|x1)

.........................................

x1

ωj

R

if

∑ v (ω

R

m

j

x i ) = max

i=1

k=1

∑ v (ω

k

(5)

x i ).

i=1

R

m

= max ( 1 – R )P ( θ = ω k ) + k=1

∑ P(θ = ω

k

xi ) ,

i=1

which combines the expert outputs in terms of a sum. In (2) and (3) P(θ = ωk |xi ) is the kth class a posteriori probability computed by each of the R classifiers. In the following discussion, we shall focus on the benevolent fusion strategy represented by (3). It is referred to as benevolent because it is less sensitive to estimation errors than the product rule in (2) as shown in [13]. The sum rule computation is captured schematically in Fig. 1.

Note that in Eq. (4) we implicitly use a fixed decision threshold for hardening each expert output before fusion. It is therefore conceivable that the fused system performance could be enhanced by allowing these thresholds to be optimized. In the absence of the knowledge of the a posteriori probability distributions this can be done empirically by sampling the multidimensional parametric space of thresholds and evaluating the performance at each sampled point using an independent set of data. Once these thresholds ti , i = 1, …, R have been determined, the fusion rule, schematically repre-

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 11

No. 3

2001

BOOSTING MULTIPLE EXPERTS BY JOINT OPTIMIZATION x1

P(ω|x1)

vote

.........................................

Expert1

xR

ExpertR

Σ

P(ω|xR)

decision

vote

Fig. 3. Vote fusion rule with jointly optimised thresholds.

sented in Fig. 3, derives the optimal decision using (5) with v(ω1|xi ) defined as 1 v ( ω1 xi ) =  0

if

P ( θ = ω1 xi ) ≥ ti

otherwise

(6)

v(ω2 |xi ) is then simply computed as v ( ω 2 x i ) = 1 – v ( ω 1 x i ).

(7)

In the practical application of this approach described in detail in the next section, instead of the a posteriori class probabilities, we evaluate a dissimilarity measure d(θ = ωj |xi ) j = 1, 2. Thus, Eq. (6) is replaced by 1 v ( ω1 xi ) =  0

if

d ( θ = ω1 xi ) ≤ ti

otherwise.

(8)

In the following section, we apply the proposed method to the problem of video shot cut detection to demonstrate its effectiveness. 3. VIDEO SHOT CUT DETECTION 3.1. Introduction The partitioning of video sequences into shots is considered an integral part of indexing and annotating video [31–40]. Shot boundary detection is often the starting point in constructing a content-based video

indexing system. Its primary aim is to remove the temporal redundancy of frames recording the content of a scene from a relatively stable viewpoint. Shot cut detection is a prerequisite for any attempts to exploit the hierarchical nature of digital video—this hierarchy consists of the whole video sequence at the top which can be broken down into segments, then, scenes followed by shots and, finally, the individual frames. Apart from the individual frames which make up the video sequence, the shot is the lowest denominator within the hierarchical structure. In [27], the authors described the shot as the fundamental film component, and Picard [28] described the shot as an unbroken sequence of frames from one camera. The moment of change from one shot to another, the shot boundary, can be created in several ways. The simplest of these is the camera cut. Figure 4 shows an example of this. Frames (a) and (b) belong to the same shot and frames (c) and (d) to another. The content of the frames (b) and (c) are unrelated. There are other effects used for the demarcation of a shot boundary such as the cross-fade (dissolve) or zoom. While the change for a shot cut takes place between two frames, other shot boundary effects take place over a number of frames. The number of frames in which the shot change occurs depends on the producer of the video sequence. In this paper, we restrict ourselves to the study of shot cuts only. There are several approaches to the problem as discussed in [29]. We can define a shot detection method as a process or system that employs a dissimilarity measure over some feature of the video sequence. Shot cuts are deemed to be detected if an adopted dissimilarity measure computed between two consecutive frames exceeds a specified threshold. For a shot detection method to be successful, it needs to be as accurate as possible. This accuracy is normally measured in terms of the percentage of true shot changes that it is able to detect, as well as the number of false positives. Needless to say, the choice of a threshold directly affects these performance measures. In the work reported in [29], we studied a selection of these methods and evaluated their individual strengths and weaknesses. While each method had the capability of performing quite well, there was still

Fig. 4. Example of a camera cut/break. PATTERN RECOGNITION AND IMAGE ANALYSIS

531

Vol. 11

No. 3

2001

532

KITTLER et al.

scope for improvement. In particular, it became clear that different methods performed well in diverse circumstances. In other words, no simple approach outperformed another in all situations. Rather, the superiority of a particular method was data dependent. This immediately suggested the potential benefit of using these various methods together; to fuse them in such a way that the strengths of each method are consolidated and the weaknesses muted.

togram of successive frames we can have a measure of their similarity. A number of shot cut detection techniques proposed in the literature are based on histogram comparison [39–41]. Histogram comparison methods are quite popular because they are fast. In addition, some researchers prefer such methods because they are motion insensitive. Our implementation is similar to that detailed in [40]. However, we extended it by including color components as well. 4. Likelihood Ratio [LH]

3.2. Video Shot Cut Experts Five separate algorithms are used to detect shot changes. These algorithms calculate different features of the video data, and can by themselves be used as stand-alone shot boundary detection systems. These methods are: 1. Average Intensity Measurement [AIM] We implemented this method based on that suggested in [37]. The algorithm computes the average of the intensity values for each component (YUV, RGB, etc.) in the current frame and compares it with that for the following frame. This is then divided by the value of the comparison of the current frame with that of the previous frame. 2. Euclidean Distancing [ED] For this method, we divide the frames into blocks and perform the discrete cosine transform on each block. In [38], the authors observed that a Euclidean distance measure can be used to calculate the similarities between two images by comparing the mean of the DC values for all the blocks in the frame. Thus, we use the DC coefficients from the DCT calculations for each component (luminance and chrominance) of successive frames for the operation. 3. Histogram Comparison [HC] The histogram of a frame gives the distribution of the intensities within the frame. By comparing the his-

Likelihood Ratio 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0 100 200

300

400 500 600 Frame no.

Fig. 5. Example of a response graph using the Likelihood Ratio method.

This algorithm [42] generates a measure of the likelihood that two corresponding regions are similar. Each region is represented by second-order statistics under the assumption that this property remains constant over the region. We divide the frames into blocks, and carry out the likelihood ratio calculation over the blocks. 5. Motion Estimation / Prediction Error [ME] In this method, we estimate the next frame in a video sequence based on the motion information in the current frame. Then, we reconstruct the next frame using the motion estimation vectors. Essentially, we predict what the next frame would look like given the information that we have. The prediction error of the reconstructed frame gives us a measure of how far off our prediction is. Here, the motion estimation is similar to that used in current video coding standards. For our implementation, we used the block-based n-step search algorithm for a ±2n search window as described in [43]. After the motion estimation is done, the motion vectors are used to construct the next frame in the sequence. To obtain the prediction error, the absolute difference between the reconstructed frame and the original frame are calculated and summed. As mentioned earlier, these methods calculate a dissimilarity measure which we can use to make shot cut decisions. Therefore, the dissimilarity measure is the response of the system when given two consecutive frames as input. We could then plot a response graph for the set of values generated by each method over a whole video sequence (Fig. 5). The rationale behind using a dissimilarity measure is that we expect two consecutive frames belonging to the same shot to have a low value. Vice versa, for two consecutive frames of different shots, the dissimilarity measure value should be large. Figure 5 shows an example of this where the peaks in the graph are points where the dissimilarity measure values are high, and this would correspond to a high probability that these points are shot change boundaries. Each of these methods employs a global threshold at the final stage of the processing to make the shot change boundary decision. The choice of the thresholds is not easily specified. In view of this, we constructed a receiver operating characteristic curve for each method, obtained by setting the thresholds to various possible values.

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 11

No. 3

2001

BOOSTING MULTIPLE EXPERTS BY JOINT OPTIMIZATION (a) Cartoon

Undetected true positives, Pu

1.0

533

(b) CHF

0.8 0.6

Equal Error line

Equal Error line

0.4 0.2 0 (c) DV

Undetected true positives, Pu

1.0

(d) Rugby

0.8 0.6

Equal Error line

Equal Error line

(e) SKY

(f) Superman

0.4 0.2 0

Undetected true positives, Pu

1.0 0.8 0.6

Equal Error line

Equal Error line

0.4

AIM ED HC LH ME

0.2

0

0.2

0.4

0.6

0.8 1.0 0 False positives, Pf

0.2

0.4

0.6

0.8 1.0 False positives, Pf

Fig. 6. The ROC curves for the video sequences.

Table 1 shows the data set used for the experiments. The CARTOON sequence is a collection of cartoon animations. The CHF sequence is a children’s program. The DV sequence is a collection of daytime soap operas and documentaries. The RUGBY sequence is a rugby union match. The SKY sequence is a news program, and the SUPERMAN sequence is a clip from the TV show “Superman.” The sequences CHF, DV, SKY, and PATTERN RECOGNITION AND IMAGE ANALYSIS

SUPERMAN are what we term “real-world” sequences. We exclude CARTOON from this grouping for obvious reasons, and RUGBY since it is a high speed rapid shot changing sports program. We will explain our rationale for this division in later sections. To construct the ROC curves, we calculate the proportion of undetected true shot boundaries pu against

Vol. 11

No. 3

2001

534

KITTLER et al. (a)AIM Undetected True Positives, Pu 1.0 Threshold = 2.80 0.8

(b)ED Undetected True Positives, Pu 1.0 Threshold = 1.31 0.8

0.6

0.6

0.4 0.2 0

CHF

0.4

RUGBY SUPERMAN DV SKY

CARTOON

0.5

0.2

CHF RUGBY SUPERMAN DV SKY

0

CARTOON

1.0 1.5 2.0 2.5 False Positives, Pf (c)HC Undetected True Positives, Pu 1.0 Threshold = 0.170

0.5

1.0 1.5 2.0 False Positives, Pf (d)LH Undetected True Positives, Pu 1.0 Threshold = 1.260

0.8

0.8

0.6

0.6

RUGBY

0.4

CHF

0.2

SUPERMAN DV SKY RUGBY

0

0.4

0.2

0.2 CARTOON

CARTOON

CHF DV SKY SUPERMAN

0.2 0.4 0.6 0.8 1.0 0 False Positives, Pf (e)ME Undetected True Positives, Pu 1.0 Threshold = 33.0

0.4 0.6 0.8 1.0 False Positives, Pf

0.8 0.6 0.4 0.2 0

RUGBY

SUPERMAN DV SKY

0.5

CHF CARTOON

1.0 1.5 2.0 2.5 False Positives, Pf

Fig. 7. Plots of threshold values for the different sequences.

the proportion of incorrectly identified shot boundaries pf S p u = -----u , Sa

(9)

S p f = -----f , Sa

(10)

where Su is the number of undetected true shot boundaries, Sf is the number of falsely identified ones, and Sa is the number of actual shot boundaries. We then set the thresholds to different values and plot pu against pf . Note that none of the experts require any parameters to be set (apart from the threshold) and,

therefore, none need any training. We plot the ROC curves for each of the five algorithms on the video sequences (Fig. 6). With the help of the ROC curves, we can decide on a particular operating point and obtain the corresponding threshold for each expert. For example, in terms of equal error performance, we would choose the threshold that corresponds to the point on the curve nearest to the origin from both axes, since we would ideally want to minimize both pu and pf . We can observe that there is a variation in the performance of the algorithms over the different sequences. We can expose these variations further by showing the points on which the algorithms are plotted given the same threshold value, as demonstrated in Fig. 7.

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 11

No. 3

2001

BOOSTING MULTIPLE EXPERTS BY JOINT OPTIMIZATION

(a)

(b)

(c)

(d)

2.6

535

(e)

(f)

60

2.4

50

2.2 40

2.0

30

1.8 C

1.6 B 1.4 A 1.2 1.0

Threshold = 32 C

B

3600

20

Threshold = 1.28

4000

4400

4800

A

10

5200 Frame no.

0

3600

4000

4400

4800

5200 Frame no.

(h)ME

(g)LH

Fig. 8. Example of response graphs to the LH and ME algorithms I.

For the four “real-world” sequences, the best individual expert would be either the HC method or the LH method. The observation of this inconsistent behavior led to other authors exploiting multiple experts in tandem. For example, two methods were used in [44] to create a generalized sequence trace of the video sequence. The two features used were the luminance histogram difference and standard deviation difference. This trace is defined as the sum of the square root of the difference between two frames for each feature. Then, using a technique based on mathematical morphology, the authors constructed what they termed a morphological laplacian graph for the sequence. The morphological laplacian algorithm essentially calculates the difference between the gradient of dilation and the gradient of erosion for the generalized sequence trace, which corresponds to an approximation to the second derivative of the aforementioned trace. The zero crossings on the graph indicated shot boundaries. A threshold was applied to distinguish between zero crossings due to true shot boundaries and noise. In [45], the authors proposed a two step shot detection strategy whereby a histogram comparison method was used in the first step and a likelihood ratio method was selectively used as the second step. In their implementation, the histogram comparison results were subject to two thresholds, TH and TL, TH being the higher threshold. If the comparison result was higher than TH, a cut was declared immediately. If, on the other hand, it PATTERN RECOGNITION AND IMAGE ANALYSIS

was lower than TH but higher than TL, a likelihood ratio operation was carried out. If the results of the likelihood ratio were above a set threshold TR, then a cut was declared. In another work, [46], a shot detection scheme employing two algorithms was described. Using the histogram comparison as well as a pixelwise differencing algorithm (similar to AIM), the authors employed a K-means clustering algorithm to classify the results into two clusters. Following this, an elimination step based on a heuristic observation was employed to reduce the number of false positives. Table 1. Video sequences used in the experiments Format

QCIF – 176 × 144 YUV 4:2:0

Frame rate

25fps

name

no. of frames

time, min

no. of shot cuts

total

CARTOON CHF DV RUGBY SKY SUPERMAN

41750 36007 36000 40490 45000 44921

27.8 24 24 27 30 29.9

256 199 268 257 289 404

1673

Vol. 11

No. 3

2001

536

KITTLER et al.

(a)

(b)

1.9

(c)

(d)

50 45 40 35 30 25 20 Threshold = 1.28 15 A B C 10 5 0 31000 31100 31200 31300 30900 Frame no. (g)LH

1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 30900

(e)

B

(f)

C

A

Threshold = 32

31000

31100

31200

31300 Frame no.

(h)ME

Fig. 9. Example of response graphs to the LH and ME algorithms II.

The three works mentioned above [44–46] do not thus generalize to combination of arbitrary experts. 3.3. Experimental Results The graphs in Figs. 8 and 9 show the responses of the LH and ME algorithms over different sections of the same video sequence. The pairwise frames (a) and (b),

It is also worth noting that in both examples there are peaks (marked as A) that are below the thresholds of both algorithms. Our manual inspection shows that these peaks are at bona fide shot boundaries. The figures are representative of those situations which we would like to improve upon by using experts in a cooperative manner to detect shot boundaries such as those of B and C, where not all of the experts are in agreement as well as using suboptimal threshold positions to detect shot boundaries such as those of type A.

Undetected True Positives, Pu 1.0 DV 0.8

p5

0.6 0.4

p4

0

p3 p2

0.2 0.2

0.4

(c) and (d), (e) and (f) correspond to respective shot boundary peaks A, B, and C on the graphs. The horizontal line shows our optimum equal error threshold for each algorithm. In Fig. 8, we can see that there are two peaks (B and C) above the threshold in LH that are below the threshold in ME. Correspondingly, in Fig. 9, the opposite holds true where there are two peaks (again B and C) above the threshold in ME that are under the threshold in LH.

p1 0.6

0.8 1.0 False Positives, Pf

Fig. 10. Threshold values at the points specified at the lines pn where n = 1, …, 5. (DV sequence.)

This requires the joint optimization of the thresholds (operating points) for each of the cooperating experts. Since the space of all the operating points is n-dimensional and therefore difficult to explore exhaustively, we optimize the CME (Cooperating Multiple Experts) by sampling the space quite coarsely. To this end we select the threshold values at five points on the ROC curve for each of the algorithms. These threshold values are taken at the points p1, p2, p3, p4 and p5 as shown in Fig. 10.

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 11

No. 3

2001

BOOSTING MULTIPLE EXPERTS BY JOINT OPTIMIZATION

Undetected true positives, Pu

0.40

(a) Cartoon

(b) CHF

(c) DV

(d) Rugby

(e) SKY

(f) Superman

537

0.35 0.30 0.25 0.20 0.15 0.10 0.05 0

Undetected true positives, Pu

0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0

Undetected true positives, Pu

0.40

AIM ED HC LH ME CMB

0.35 0.30 0.25 0.20 0.15 0.10 0.05 0

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0 False positives, Pf

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 False positives, Pf

Fig. 11. Close-up of the CME algorithm against the ROC curves.

We use a base-N codeword designation to identify our thresholds. Since there are n experts, each with N possible thresholds, this leaves us with Nn possible combinations of threshold values. For our N = 5 and n = 5, this gives us 3125 possible combinations. For example, from Table 2, the value 04322 would mean that we use the results from the AIM method using a threshold value of 1.5, ED using 5.91, HC using 0.170, LH using 1.240, and ME using 33.0. PATTERN RECOGNITION AND IMAGE ANALYSIS

Each of the individual methods would signal a shot change at values above the given threshold. The CME method would signal a shot change only when the majority of the algorithms signal a shot change. As we wish to find an optimal setting of expert operating points to get the best CME performance, we require training data. We used the DV sequence as our training set; i.e., we took the value of the thresholds from the ROC curves generated from this set. The CME

Vol. 11

No. 3

2001

538

KITTLER et al.

Undetected true positives, Pu

(a) AIM 0.34 0.32 0.30 0.28 0.26 0.24 0.22 0.20 0.18 0.16

(b) HC

0.20

6.500 6.300

5.700 2.900 2.800 1.700

0.10

1.600

2.900 3.000

2.900

2.800

0.215

2.600

1.500 2.600

2.500

0.05

0.20

0.24

0.125 0.135 0.125

0.185 0.180 0.170 0.160 0.145 0.140 0.130

2.400 2.300

0.16

0.140 0.165

0.130 0.175 0.155 0.150 0.170 0.165 0.160 0.140 0.150 0.145

0.200 0.195

1.400

0.130 0.245

0.170

0.19 0.185 0.180

0.210 0.205

2.700

0.260 0.135 0.255

0.160 0.155 0.195 0.150 0.210 0.185 0.145 0.180

0.15

5.900 3.100

3.000

0.140

0.220

6.100

1.800

0.270

0.230 0.165

0

0.28 0.32 False positives, Pf Cartoon CHF DV

0.05

0.10

0.15 0.20 False positives, Pf

Rugby SKY Superman

Fig. 12. Operating points.

CHF + DV + SKY + Superman

Undetected true positives, Pu

0.40

AIM ED HC LH ME CMB

0.35 0.30 0.25 0.20 0.15 0.10 0.05 0

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 False positives, Pf

Fig. 13. Combined results of the “real-world” sequences.

algorithm was then applied to the training set as well as the remaining data sets. For each of the combinations, we calculate values for pu and pf . Figure 11 is a plot of the CME algorithm against the ROC curves that we had constructed earlier showing the region near the origin. Each point on the graph corresponds to a different CME combination. Table 2. Threshold values used in the CME algorithm n

0

1

2

3

4

AIM ED HC LH ME

1.5 0.78 0.095 1.105 25.5

1.9 0.88 0.111 1.160 29.5

2.8 1.31 0.170 1.240 33.0

5.15 2.11 0.285 1.560 39.5

13.2 5.91 0.452 2.220 55.0

From Fig. 7, noting that the operating points for a given threshold varies from one sequence to another, we would expect the performance of the CME algorithm to display the same behavior. This is demonstrated in Fig. 11, for example, in the CHF sequence, where the CME cluster is shifted up and to the left of the graph compared to DV and SKY. For the CARTOON and RUGBY sequence, the variation is even more pronounced. We can also see in Fig. 11 that in the case of the realworld sequences, there are a significant number of points from the CME algorithm that are nearer to the origin than those of any of the individual experts. Since the aim is to reduce the equal error rate by multiple expert fusion, it is clear that the proposed scheme achieves that objective for these sequences. Looking more closely at the CARTOON sequence, it is observed that the threshold values for a given oper-

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 11

No. 3

2001

BOOSTING MULTIPLE EXPERTS BY JOINT OPTIMIZATION (a) CHF

Undetected true positives, Pu

0.30

539

(b) DV

0.25 0.20 0.15 0.10 CME-34100

0.05

CME-34100

0 (c) SKY

Undetected true positives, Pu

0.30

(d) Superman

0.25 0.20

AIM ED HC LH ME CMB CME1. sel

0.15 0.10 0.05 0

CME-34100

CME-34100

0.05

0.10

0.15

0.20 0.25 0.30 0 False positives, Pf

0.05

0.10

0.15

0.20 0.25 0.30 False positives, Pf

Fig. 14. Optimal CME operating point.

ating point for each algorithm are higher in comparison to those of the other sequences. This is further illustrated in Fig. 12. As such, the CME algorithm is less efficient in this case, with only a few points demonstrating any advantage over the individual experts. In the case of the RUGBY sequence, the performance of the HC algorithm far outstripped the other experts. In addition, the LH and ME algorithms performed especially badly for this particular sequence. Consequently, our experiments demonstrated that the CME algorithm is unable to improve on the performance displayed by HC. In view of this, we can conclude that the CME algorithm is capable of achieving a greater performance gain in detecting shot changes compared to individual experts when applied to “real-world” sequences. Since each operating point in the CME algorithm represents a combination of operating points from the individual experts, we still need to find an optimal combination. To do this, we combined the results of the CME algorithms and the individual experts for the four sequences. We then plotted the CME against the ROC curves again (Fig. 13). Having found the optimal operating point, we apply the combination on the individual sequences again. PATTERN RECOGNITION AND IMAGE ANALYSIS

From Fig. 14 it is evident that all the points produce better results than any of the individual algorithms. Even though the improvement is not as dramatic for all the sequences (Fig. 14d being a case in point), we have demonstrated that no “tuning” is required to get a good performance. Thus, it is shown that the CME algorithm is also robust towards slight shifts in operating points. 4. CONCLUSIONS We considered a multiple classifier system which combines the hard decisions of experts by voting. We argued that the individual experts should not set their own decision thresholds. The respective thresholds should be selected jointly, as this allows compensation of the weaknesses of some experts by the relative strengths of the others. We performed the joint optimization of decision thresholds for a multiple expert system by a systematic sampling of the multidimensional decision threshold space. We showed the effectiveness of this approach on the important practical application of video shot cut detection. In this application, five different experts were deployed to express opinions as to whether the visual content of video material had remained the same or had changed shot from one frame

Vol. 11

No. 3

2001

540

KITTLER et al.

to another. We demonstrated that the proposed approach significantly increased the true positive shot cut detection rate while reducing the false positive rate (this being the criterion of performance improvement). ACKNOWLEDGMENTS The support via EPSRC Grants 6R/L61095 and GR/M61320 and via EU Framework V Project Assavid is gratefully acknowledged. REFERENCES 1. Alexandre, L., Campilho, A., and Kamel, M., Combining Independent and Unbiased Classifiers Using Weighted Average, Proc. ICPR15, vol. 2, IEEE, 2000, no. 9, pp. 495–498. 2. Ali, K. and Pazzani, M., On the Link between Error Correlation and Error Reduction in Decision Tree Ensembles, Technical Report 95-38, University of California at Irvin, 12, 1995. 3. Alkoot, F.M. and Kittler, J., Multiple Expert System Design by Combined Feature Selection and Probability Level Fusion, Proc. Fusion 2000 Conf., Paris, 2000, vol. 7. 4. Bauer, E. and Kohavi, R., An Empirical Comparison of Voting Classification Algorlthms: Bagging, Boosting and Variants, Machine Learning, 1998, pp. 1–38. 5. Breiman, L., Bagging Predictors, Machine Learning, 1996, vol. 24, pp. 123–140. 6. Dietterich, T., An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization, Machine Learning, 1998, pp. 1–22. 7. Duin, R.P.W. and Tax, D.M.J., Experiments with Classifier Combining Rules, in Multiple Classifier Systems, Kittler, J. and Roli, F., Eds., Springer, 2000, pp. 16–29. 8. Friedrich, C.M., Ensembles of Evolutionary Created Artificial Neural Networks and Nearest Neighbour Classifiers, in Roy, R., Furuhashi, T., and Chawdhry, P.K., Eds., Advances in Soft Computing, Springer, 1998, vol. 6. 9. Hansen, L.K. and Salamon, P., Neural Network Ensembles, IEEE Trans. Pattern Analysis and Machine Intelligence, 1990, vol. 12(10), pp. 993–1001. 10. Hashem, S. and Schmeiser, B., Improving Model Accuracy Using Optimal Linear Combination of Trained Neural Networks, IEEE Trans. Neural Networks, 1995, vol. 6(3), pp. 792–794. 11. Ho, T.K., Hull, J.J., and Srihari, S.N., Decision Combination in Multiple Classifier Systems, IEEE Trans. Pattern Analysis and Machine Intelligence, 1994, vol. 16(1), pp. 66–75. 12. Kittler, J., Combining Classifiers: A Theoretical Framework, Pattern Analysis and Applications, 1998, vol. 1, pp. 18–27. 13. Kittler, J., Hatef, M., Duin, R., and Matas, J., On Combining Classifiers, IEEE Trans. Pattern Analysis and Machine Intelligence, 1998, vol. 20(3), pp. 226–239. 14. Lam, L. and Suen, C., Application of Majority Voting to Pattern Recognition: An Analysis of Its Behaviour and Performance, IEEE Trans. Systems, Man and Cybernetics, Part A: Systems and Humans, 1997, vol. 27(5), pp. 553–568.

15. Quinlan, J., Bagging, Boosting and c4.5, Proc. 13th National Conf. on Artificial Intelligence, Portland, OR, AAAI, Menlo Park, CA, 1996, vol. 1, pp. 725–730. 16. Rahman, A.F.R. and Fairhurst, M.C., Enhancing Multiple Expert Decision Combination Strategies through Exploitation of a priori Information Sources, IEE Proc. Vision, Image, and Signal Processing, 1999, vol. 146-1, pp. 40–49. 17. Sharkey, A.J.C., On Combining Artificial Neural Nets, Connection Science, 1996, vol. 8(3), pp. 299–314. 18. Skalak, D.B., Prototype Selection for Composite Nearest Neighbor Classifiers, PhD Thesis, Department of Computer Science Univ. of Massachusetts at Amherst, 1997. 19. Suen, C., Legault, R., Nadal, C., Cheriet, M., and Lam, L., Building a New Generation of Handwriting Recognition Systems, Pattern Recognition Letters, 1993, vol. 14, pp. 303–315. 20. Xu, L., Krzyzak, A., and Suen, C.Y., Methods of Combining Multiple Classifiers And Their Applications to Handwriting Recognition, IEEE Transaction. SMC, 1992, vol. 22(3), pp. 418–435. 21. Yu, K., Jiang, X., and Bunke, H., Lipreading: A Classifier Combination Approach, Pattern Recognition Letters, 1997, vol. 18(11–13), pp. 1421–1426. 22. Breiman, L., Friedman, J.H., Olsen, R.A., Stone, C.J., Classification and Regression Trees, Wadsworth, California, 1984. 23. Kittler, J., Combining Classifiers: A Theoretical Framework. Pattern Analysis and Applications, 1998, vol. 1, pp. 18–27. 24. Wolpert, D.H., Stacked Generalization, Neural Networks, 1992, vol. 5, pp. 241–260. 25. Woods, K.S., Bowyer, K., and Kergelmeyer, W.P., Combination of Multiple Classifiers Using Local Accuracy Estimates, Proc. CVPR96, 1996, pp. 391–396. 26. Kittler, J. and Alkoot, F., Relationship of Sum and Vote Fusion Strategies, Proc. Workshop on Multiple Classifier Systems, 2001 (in press). 27. Davenport, G., Smith, T.A., and Pincever, N., Cinematic Primitives for Multimedia, IEEE Computer Graphics and Applications, 1991, pp. 67–74. 28. Picard, R.W., Light-years from Lena: Video and Image Libraries of the Future,” Proc. IEEE Int. Conf. Image Processing, 1995, vol. I, pp. 310–317. 29. Yusoff, Y., Christmas, W., and Kittler, J., A Study on Automatic Shot Change Detection,” Proc. 3rd. European Conf. Multimedia Applications, Services, and Techniques (ECMAST), Hutchison, D. and Schafer, R., Eds., no. 1425 in LNCS, Springer, May 1998, pp. 177–189. 30. Huang, T.S. and Suen, C.Y., Combination of Multiple Experts for the Recognition of Unconstramed Handwritten Numerals, IEEE Trans. Pattern Analysis and Machine Intelligence, 1995, vol. 17, pp. 90–94. 31. Zabih, R., Miller, J., and Mai, K., A Feature-based Algorithm for Detecting and Classifying Scene Breaks, Proc. ACM Multimedia’95, 1995, pp. 189–200. 32. Zabih, R., Miller, J., and Mai, K., A Feature-based Algorithm for Detecting and Classifying Production Effects, ACM Multimedia Systems, 1999, vol. 7, no. 2, pp. 119–128. 33. Hanjalic, A. and Zhang, H., Optimal Shot Boundary Detection Based on Robust Statistical Models, Proc. 6th Int. Conf. on Multimedia Computing and Systems (ICMCS), Florence, IEEE, 1999, vol. 2, pp. 710–714.

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 11

No. 3

2001

BOOSTING MULTIPLE EXPERTS BY JOINT OPTIMIZATION 34. Sethi, I.K. and Patel, N., A Statistical Approach to Scene Change Detection,” Proc. IS&T/SPIE Storage and Retrieval for Image and Video Databases III, 1995, vol. 2420, pp. 329–337. 35. Ngo, C.W., Pong, T.C., and Chin, R.T., Camera Break Detection by Partitioning of 2D Spatiotemporal Images in Mpeg Domain, Proc. 6th Int. Conf. on Multimedia Computing and Systems (ICMCS), Florence, IEEE, 1999, vol. 2, pp. 750–755. 36. Kim, H., Park, S.-J., Kim, W.M., and Song, S.M.-H., Processing of Partial Video Data for Detection of Wipes, Proc. IS&T/SPIE Conf. on Storage and Retrieval for Image and Video Databases VII, San Jose, CA, 1999, vol. 3656, pp. 280–289. 37. Hampapur, A., Jain, R., and Weymouth, T., Digital Video Segmentation, Proc. ACM Multimedia’94, ACM Press, 1994, pp. 357–364. 38. Vellaikal, A. and Kuo, C.-C.J., Joint Spatial-Spectral Indexing for Image Retrieval, Proc. IEEE Int. Conf. Image Processing, 1996, pp. 867–870. 39. Nagasaka, A. and Tanaka, Y., Automatic Video Indexing and Full-Video Search for Object Appearances, Visual Database Systems II, 1992, pp. 113–127. 40. Zhang, H., Kankanhalli, A., and Smoliar, S.W., Automatic Partitioning of Fullmotion Video, Multimedia Systems, Springer, 1993, vol. 1, pp. 10–28. 41. Lienhart, R., Comparison of Automatic Shot Boundary Detection Algorithms, Proc. IS&T/SPIE Conf. on Storage and Retrieval for Image and Video Databases VII, San Jose, CA, 1999, vol. 3656, pp. 290–301. 42. Kasturi, R. and Jain, R., Eds., Computer Vision: Principles, IEEE Computer Society, 1991. 43. Tekalp, A.M., Digital Video Processing, Prentice-Hall, 1995. 44. Takiran, C. and Delp, E.J., Video Scene Change Detection Using the Generalized Sequence Trace, Proc. IEEE Int. Conf. Image Processing, 1998, pp. 2961–2964. 45. Dugad, R., Ratakonda, K., and Ahuja, N., Robust Video Shot Change Detection, IEEE Workshop on Multimedia Signal Processing, 1998. 46. Naphade, M.R., Mehrotra, R., Ferman, A.M., Warnick, J., Huang, T.S., and Tekalp, A.M., A High-performance Shot Boundary Detection Algorithm Using Multiple Cues, Proc. IEEE Int. Conf. Image Processing, 1998, vol. 2, pp. 884–887. Josef Kittler. Graduated from the University of Cambridge in Electrical Engineering in 1971. Obtained his PhD in Pattern Recognition in 1974 and the ScD degree in 1991 both from the University of Cambridge. Professor at the Department of Electronic and Electrical Engineering of Surrey University, in charge of the Center for Vision, Speech and Signal Processing. His current research interests include pattern recognition, image processing, and computer vision. Author of more than 400 papers and coauthor of a book. Member of the Editorial Boards of Pattern Recognition Journal; Image and Vision Computing; Pattern Recognition Letters; Pattern Recognition and Artificial Intelligence; Pattern Analysis and Applications; and Machine Vision and Applications. PATTERN RECOGNITION AND IMAGE ANALYSIS

541

Bill Christmas. Obtained his PhD in Mathematics from the University of Surrey. Holds a University Fellowship in Technology Transfer in the Center for Vision, Speech, and Signal Processing at the University of Surrey. After studying Engineering Science at the University of Oxford, he spent some years with the British Broadcasting Corporation as a Research Engineer. He then moved to BP Research International as a Senior Research Engineer, working on research topics that included hardware aspects of parallel processing, real-time image processing and computer vision. Other scientific interests include integration of machine vision algorithms to create complete applications. Author of more than 20 papers. Currently he is working on projects concerned with region-based video coding, and automated, contentbased annotation of video and multimedia material. Yusseri Yusoff. Graduated from the University of Essex in 1997. He is currently a PhD student at the Center for Vision, Speech, and Signal Processing of the University of Surrey, working on video processing.

David Windridge. Obtained his BSci degree in Physics from the University of Durham in 1993 and PhD in Astronomy from The University of Bristol in 1999. He is now Research Fellow at the Center for Vision, Speech, and Signal Processing, University of Surrey where he is working on problems in multiple classifier fusion.

Terry Windeatt. Received the BSci degree in Applied Science from University of Sussex, MSci degree in Electronic Engineering from University of California, and PhD degree from University of Surrey. After lecturing in Control Engineering at Kingston University, UK, he worked in the USA on intelligent systems at the Research and Development Departments of General Motors and Xerox Corporation in Rochester, NY (1976– 1984). His industrial R&D experience is in modeling/simulation for intelligent automotive and office-copying applications. Now lectures in Machine Intelligence at the Department of Electrical and Electronic Engineering at the University of Surrey. He has worked on various research projects in the Center for Vision, Speech, and Signal Processing, and his current research interests include neural nets, pattern recognition, and computer vision.

Vol. 11

No. 3

2001

Suggest Documents