same time, it is different from video coding which aims at representing .... If, however, the batter hits the ball, then the scene will be switched ... The length limit could be, for example, 30 seconds in the ... With the ground truth (state labeling) for.
Event Detection and Summarization in Sports Video Baoxin Li and M. Ibrahim Sezan Sharp Laboratories of America 5750 N.W. Pacific Rim Blvd. Camas, Washington 98607, USA {bli,sezan}@sharplabs.com Abstract We propose a general framework for event detection and summary generation in broadcast sports video. Under this framework, important events in a class of sports are modeled by “plays”, defined according to the semantics of the particular sport and the conventional broadcasting patterns. We propose both deterministic and probabilistic approaches for the detection of the plays. The detected plays are concatenated to generate a compact, timecompressed summary of the original video. Such a summary is complete in the sense that it contains every meaningful action of the underlying game, and it also servers as a much better starting point for higher-level summarization and/or analysis than the original video does. We provide experimental results on American football, baseball, and sumo wrestling.
1. Introduction With the increasing amount of audio-visual information that are broadcasted or available in prerecorded media, there is an emerging need for efficient information management including browsing, filtering, indexing, and retrieval, etc. All of these tasks can benefit from video summarization. The goal of summarization is to obtain a compact representation of the original video which usually contains a large volume of data and is not amenable for the aforementioned information processing tasks. Notice that, although summarization often achieves compression at the same time, it is different from video coding which aims at representing the original video with less data. In fact, summarization is more about the compact representation of the "content" in the video, whereas video coding is about representing the video signal itself as accurately and as bandwidth-efficient as possible. A video summary can have varying amount of detail depending on the requirements in a specific problem. For example, a very high level of summarization could contain only a few key frames which highlight the most important events in a video. Thus some coarse information about the video can be inferred from the key frames. On the other hand, a low level summarization may consist of many video segments each of which is a continuous portion in the
original video, allowing some detailed information in the video to be viewed by the user. Generally speaking, highlevel summarization is useful for tasks such as navigation, indexing and retrieval, while low-level summarization makes it possible to quickly consume the content of a video by browsing only the summary. There has been some prior work focusing on sports video (e.g. basketball [14], soccer [17], baseball [5][13]). However, most of them are specific to one type of sports, and some of them are about specific tasks such as annotation, highlights extraction, or indexing. Most recently, unified approaches have also been proposed. For example, in [18], a framework using domain-specific knowledge for structure analysis of sports video was proposed, with experimental results on tennis and baseball. In this paper, we focus on the problem of video summarization for a class of sports. In our definition, summarization is not intended to obtain highlights or keyframes, which are often subjective and hence are not always well-defined. Instead, we focus on objectively modeling and detecting every action that is essential to a game. Then we concatenate the detected segments of video, with proper post-processing for the connection, to form a summary of the original program. We attempt to use a unified framework to address the summarization problem in sports video, with the definition of “play”, and propose approaches to event (play) detection under this framework. In Section 2, we model a class of sports video using the concept “play”, and discuss how a compact summary is generated in terms of plays. In Section 3, we explain, with examples, how plays can be characterized by low-level visual features, and propose using rule-based inference for detecting the plays. We also consider two alternative approaches based on probabilistic inference (Section 4). As case studies, we present experimental results in Section 5 for American football, baseball, and sumo wrestling. We conclude with discussion in Section 6.
2. Modeling Sports Video Using “Plays” In many types of sports broadcasting, one can have the following interesting observation: although a typical game lasts say, a few hours, only part of the time is of importance
in terms of understanding, following, or even appreciating the game. These important parts occur semi-periodically but sparsely during the game, but they contain the moments of intense action and are the essence of a game. The remaining time is typically less important (e.g., idle time when the ball is not in play, change of players, pre-game performance or ceremonies, commercials, time-outs, etc). Therefore, we can model the video as a sequence of “plays” interleaved with non-plays, with “play” being defined as the basic segment of time during which an important action occurs. American football, baseball and wrestling are among those that belong to such class of sports. A play can be, for example, a pitch in a baseball game, an attempt of offense (i.e., a play) in football, or a bout in wrestling. This modeling is illustrated in Figure 1. Note that in the “frame Å shot Å event Å video” hierarchy, play is at the same level as event, since a play is a complete action and can contain multiple shots.
End of Video
Beginning of Video Play
Non-play
Figure 1. A simple model of a class of sports video in terms of “play”, defined as the most basic segment of time during which an important action occurs in the game. The inner loop (in dashed lines) indicates the possibility that two plays can occur consecutively. While it is indeed fun to sit in a stadium for a number of hours to watch a game live, most people watch most of the games on TV, and they would find it difficult, if not impossible, to watch all the games they are interested in, even if they are loyal fans. A compact summary, containing only the plays, may be appealing to many people who are in this situation. Compared with highlights or key frames, such a summary is consumable by itself since it still retains every important event in the game. Further, the summarized video provides a better starting point for other high-level tasks such as extracting the most exciting segments of the video.
No
Input Video
Start-of-play Detection
End-of-Play Detection
End of Video
Summary Description Yes
Figure 2. The most basic procedures of a video summarization algorithm based on the detection of plays.
Figure 2 shows the most basic procedures needed for obtaining such a summary, where the summary description contains only the start and the end points of all the plays (in more complex situations, it may contain other information such as the names of the players, etc.) One can find that the summarization so defined will be primarily a low level one. And it has to be so since it would be hard for people to appreciate a game from the summary unless the summary contains sufficient detail. Obviously, the problem defined as above is one of "event-based summarization", and thus requires the detection of an event (a play in this case). In contrast to a more generic summarization scheme which uses for example color (histogram) as the cue for key frame detection or scene classification, here frames in one play may sweep a large range of color (in terms of histogram), yet all the frames belong to the same event, and form an uninterrupted video clip.
3. Detection of the Plays From Figure 2, the major task of summarization is the detection of the start and the end points of a play. We now discuss how a play is characterized by a set of low-level visual features, which can be used to detect a sequence of shots that is potentially a play (Section 3.1). Rule-based inference is proposed for integrating the shot detection results (Section 3.2).
3.1 Characterizing Plays By Low-level Features The problem of play detection is one of event detection. In general, event detection in video is a difficult problem. Here, for a class of sports video, by exploiting the general video capturing and production patterns that have been adopted by almost all the broadcast companies, we show how the event “play” can be characterized by low-level visual features that are relatively invariant. To shed some light on how the event “play” can be detected, we now use baseball as an example to show how low-level features characterize a play. A baseball game is usually captured by cameras positioned at fixed locations of the field, although the cameras can pan, tilt, and zoom. A play typically starts with a pitch. A pitching scene (in which the pitcher is about to throw the ball) is usually captured from behind the pitcher. This is because it is much easier to follow the movements of all the parties involved (the pitcher, the batter, the catcher, and the umpire) from this viewpoint than from any other angle. Thus, a play typically starts with a frame like those shown in Figure 3(a)-(c) (an example of special cases, base-stealing, is illustrated in Figure 3(d)). One can immediately identify some features
for detecting such frames (for example, the fixed pattern of the field colors). Since different sports usually have different start scenes, for clarity, we will discuss features used in our work in Section 5 where we present the experiments for the specific sports.
(a)
(c)
(b)
(d)
the pitching result (which is very difficult) by using camera breaks. We now show how this simple modeling works in practice, using Figure 4 as an example. The curve in Figure 4 shows the color histogram differences of a 1000-frame video clip, in which the peaks correspond to scene cuts. A pitcher threw the ball at around frame 170, which was detected as a pitching scene. The batter did not swing, and after the catcher caught the ball, there was a scene cut at frame 322. After the pitcher was ready for another throw, the camera was switched back (resulting in a scene cut at frame 428). A new pitching scene was detected at frame 520. This time the batter hit the ball, and the camera was switched to follow the flying ball (resulting in scene cut 2). In this case, since the frame contains the field, the play continues, until another scene cut (scene cut 3) when the current play ends and another camera break happens. The time between the last camera break of the current play and the start of the next play is usually not exciting and thus should not be included in a summary. Note that after a hit, there may be multiple camera breaks, and thus we must analyze the scene image after a scene cut to decide if the current play should end.
Figure 3. (a)-(c) A typical start of a regular play – a pitching scene. Other rare types of starts include a base-stealing scene (d), which is also captured from a fixed camera angle. How the current play will end depends on the pitching result (e.g. how many times the pitcher has thrown, the batter’s action, etc). For example, if the first throw of the pitcher is valid but the batter did not swing, then the pitcher will prepare for the second pitch. If the time until the next pitch is too long, there will usually be a scene cut and camera may be shooting some less important scene (such as the team of players sitting aside) until the next pitch. In this case, effectively we can think the play stops at the scene cut. If, however, the batter hits the ball, then the scene will be switched to the camera that is shooting at the flying ball (and almost always resulting in a frame containing the field). Since a fixed camera may not be able to capture a ball that flies too far, there may be several switches of cameras until the ball was caught or hit the ground. After that, the current play ends, and another camera break usually occurs. These observations have led to a simple modeling of the play in baseball: 1) a play usually starts with a pitching scene; 2) after the play starts, if after a scene cut the camera is shooting the field, then the current play should continue; otherwise, the current play ends. In this modeling, a play starts when the pitcher poses to pitch. In reality, the play can end in different ways depending on the pitching result (e.g. a play can terminate with the batter being struck out, or with a home run). Our modeling avoids the estimation of
play 1 starts
scene cut 1 (play 1 ends)
play 2 starts
scene cut 2 (play 2 continues)
scene cut 3 (play 2 ends)
Figure 4. How baseball plays are characterized by a pitching scene and scene cuts. The above characterization of the play (a relatively fixed start scene plus certain type of scene transition) seems to be very general for many types of sports. For example, a football game is also typically captured by fixed cameras, which result in start scenes as illustrated in Figure 5; after the current play is finished (the ball is dead), a camera break typically follows. One may notice that our characterization of the play depends on two assumptions: fixed camera angles at the start of a play, and a scene transition at the end of a play. However, these two assumptions have been validated by extensive data (including historical data). In fact, for a class of sports, the use of a fixed camera angle is almost a necessity since only one camera angle is capable of
capturing the action of all parties involved. On the other hand, camera break (resulting in scene transitions) is an established way of professional video production. A good time for a camera break is right after an event just finished.
Figure 5. A typical start of a regular play (left), and a typical start of a kick in football (right).
3.2 Rule-based Inference Although a play can be characterized by low-level features such as the field colors and their spatial patterns, scenecuts, etc., there exist uncertainties and inaccuracy in extracting these features. Thus typically a high-level inference stage is needed to make the final decision based on the low-level features. A deterministic way of inference is to establish a set of rules for a specific game. For example, in football and baseball, a real start scene should not have much camera motion. Also, to compensate for the possible miss in detecting the end of a play (possibly due to failure in detecting scene transition), one can put constraint on the length of a play so that the play will not run too long. The length limit could be, for example, 30 seconds in the football case. In general, these rules can be established according the specific game, and we will discuss some of the issues with case studies in Section 5.
4. Methods of Probabilistic Inference As will be demonstrated with experiments in the next section, the approach proposed in Section 3 is very successful in practice. It is also easy to implement and computationally efficient. In this section, we discuss two possible alternative methods of probabilistic inference that might be of interest in some situations. For example, when it is difficult to specify a set of rules for inference, a probabilistic approach that is capable of learning may handle the inference problem better. Also, a probabilistic approach may avoid the difficulty of choosing hard thresholds by absorbing the uncertainties into the modeling itself. Assuming that shot segments have been obtained by detecting start scene and scene transition, one may use different probabilistic inference methods such as Bayesian networks (BN). Here we consider methods using Hidden Markov Model (HMM) (a more general idea of graphical
models would unify BN and HMM [6,15]), since previous work of using HMM for video parsing or segmentation suggests good potential ([1,2,16]). A straightforward way of using HMM for inference is assuming that shots have been detected, and that each shot is generated with probability by certain underlying state. We can consider, for example, a four-state HMM shown in Figure 6, where arrowed lines indicate possible transitions between the states. Training sequences of shots with prespecified play/non-play segmentation are used to estimate the model parameters. To detect plays in an input video, one first obtains a sequence of shots using the method of Section 3.1. Then the most likely sequence of states is found by the Viterbi algorithm ([12]). Plays are detected by identifying sequences of states “1-2-3”. This way of using HMM is similar to the dialogue sequence parsing approach in [16].
Figure 6. A four-state HMM model for modeling a class of sports video in terms of “play”, where arcs represent possible transitions, and blocks attached to each arc represent observation vectors (a transitionaloriented model). The above approach still relies on a detection stage to obtain the shots, and then uses an HMM-based module to do the inference. We now propose another way of using HMM, which simultaneously addresses both shot-detection and high-level inference. We still use the four-state model in Figure 6, assuming that each arc is associated with an observation vector. The algorithm works as follows. For parameter estimation, a feature vector is computed for each frame in training sequences. Each frame in the training sequences is labeled with one of the four states. Parameter estimation for the HMM is done using Baum-Welch algorithm [12]. With the ground truth (state labeling) for each frame given, we compute an initial model from the training sequences, instead of using a random or ad hoc handpicked initial model, as follows:
π i( 0 ) = expected frequency in state S i at time t = 1 (0)
aij =
expected # of transitions from state Si to state S j exp ected number of transitions from state Si
expected # of transition s from state i to j and observing symbol Vk b (k ) = expected number of transition s from state i to j (0) ij
where {aij},{bij(k)}, and {πi} are the transition, emission, and initial state probabilities of the HMM, respectively. This initial estimate ensures that the Baum-Welch algorithm converges to a better critical point than using a random or ad hoc initialization. To detect plays in an input video using the trained model, the same feature vector is computed for each frame, and Viterbi algorithm is then applied to find the most likely sequence of states. A sequence of “1s-2s-3” signifies a play. The key for the success of the above integrated approach is a good choice of features that constitute the observation vector Vk, since both training and testing heavily depend on the observation probability P(Vk |T,Λ) (the probability of observing the vector Vk given an HMM Λ and state transition T).
the cluster of the batter, the catcher, and the umpire. A base-stealing scene is similarly handled. We only consider plays that start with either a pitching scene (Figure 3(a)-(c)) or a base-stealing scene (Figure 3(d)). As described earlier, a play ends after scene transition occurs and the camera is not viewing at the field. To check if the camera is viewing at the field, we examine if the frames right after the scene cut contain large green and brown areas. With the shots defined by the detected starting and end points, inference is done based the semantics of the game to determine if a shot or a sequence of shots forms a play. Using rules (for baseball) described in the previous section, we obtained the performance statistics in Table 1. Table 1. Baseball play detection performance. Test data include seven 20-minute sequences, one 60minute sequence, and one 84-minute sequence, from 6 different games broadcast by three different TV networks. We observed no significant performance differences between individual sequences (the same is true for Table 2).
5. Case Studies and Experimental Results In this section, we use three types of sports as case studies, and present experimental results: baseball, football, and sumo wrestling. Our goal is to demonstrate with experiments that the proposed framework works for a class of sports that contains football, baseball and sumo wrestling. We have also incorporated the results into an MPEG-7 compliant prototype system featuring novel user interface paradigms for viewing sports summaries, which we plan to demonstrate during the conference. All the data used in our experiments are MPEG-encoded streams of 160x120 frame resolution, captured by a $199 TV-tuner PC card. The proposed algorithms do not require very high quality input. In each of the three case studies, the scene cut detection (which signifies a potential end-of-play) is based on detecting isolated peaks in color histogram differences. The detection of a start scene is slightly different for each type of sports as explained in the following, although in all cases color and motion are always the primary cues. The ground truth in all cases was obtained by hand. The rule-based approach of Section 3 achieved faster than 30 frames/second performance for all three sports on a Pentium III-800MHz PC.
Total number of plays Perfectly detected plays Imperfect detection 1 Number of missed plays Compaction ratio2 Detection rate Number of falsely alarms
450 403 38 9 (2.0%) 3:1 98.0% 20
1
Imperfect detection refers to plays that are detected to be either a little longer or a little shorter than desired. 2 Compaction ratio is defined as the ratio between the lengths of the input video and the generated play summary.
Table 2. Football algorithm performance. Test data include four 80-minute, one 60-minute, and one 40minute sequences, totally 420 minutes (including six different games in different stadiums broadcast by three different TV networks). Total plays Perfect detection Imperfect detection Number of missed plays Compaction ratio Detection rate Number of false positives 1 1
448 379 50 19 (4.2%) 3.6:1 95.8% 14
Some of the false alarms are from other games in half time report (thus are plays themselves by definition).
Case Study 1: Baseball Case Study 2: American Football In the case of baseball, the features used to detect a pitching scene include field colors and their spatial distribution, and the spatial geometric structures induced by the pitcher and
For detecting a start scene, as illustrated in Figure 5, we use color, motion, and shape information. Specifically, a frame
is detected as a start scene if it has dominant green color with scattered non-green blobs, and has little or no motion, plus parallel lines on a green background (parallelism is tested by assuming a perspective camera model). The distinction between a regular start and a kick-off is obtained by analyzing the orientation of the detected lines and the line-up of the non-green blobs. Like in baseball, a play may contain multiple shots if it starts with a kick-off captured by a camera from either end of the field. In Table 2, we list the result provided by using the rule-based approach. Case Study 3: Japanese Sumo Wrestling Japanese Sumo Wrestling falls into the class of sports that can be successfully handled by the proposed approaches. To detect a start scene (as illustrated in Figure 7), we examine if a frame contains two symmetrically distributed blobs of skin color on a relatively uniform stage (whose color is relatively fixed since the stage is built according to strict regulations). Notice that, in a Sumo game, there are many pre-game ceremonies that result in a scene extremely similar to the one shown in Figure 7. To distinguish a real start-of-play from other pre-game performances, the two blobs are tracked to see if they converge. A start-of-play is declared only when the two detected skin-colored blobs are converging. We applied the rule-based approach to two sumo sequences, one 60-minute and one 52-minute in duration, respectively. We obtained 100% detection rate in both cases, with no false alarms for the second sequence and two false alarms for the first sequence (the false alarms are in fact parts of historical flashbacks of other sumo matches, and thus are recognized as plays by the algorithms). A compaction ratio of 20:1 was achieved in both cases.
define features to be used. For testing the idea, we have used a very simple 3-dimensional discrete feature. Specifically, for each frame we compute an observation vector V defined as V=[ v1 v2 v3 ] T where v1 is the similarity with respected to a model image (computed by first taking the scalar product between two images downsampled to 16x12, then quantizing into four levels), v2 the average motion magnitude (computed by first averaging motion vectors of all 20x20 blocks, then taking the magnitude and quantizing into two levels), and v3 scene transition type (0 for non-scene-cut, 1 for scene cuts with camera viewing at the field, and 2 for other types of scene cuts). The 2-level v2 is used to indicate if there is significant global (camera) motion in current frame, since a start-ofplay typically results in a nearly still scene. The motion vectors are obtained using block-matching method. For, simplicity, independence between components of V has been assumed, thus we have 24 distinct observation symbols. One can imagine that the above coarsely quantized features are not sufficient for characterizing a play, and other better features can be adopted. Nevertheless, we have found that even if with such an insufficient set of features, we were able to obtain an 89% detection rate for baseball with the proposed method, although false alarm rate increased significantly. Considering the fact that we used only such a simple feature vector, the result is very encouraging. We attribute the increased false alarm rate to the fact that v1 defined above does not provided sufficient information for distinguishing a start scene from a non-start scene (especially after the coarse quantization).
6. Summary and Discussion
Figure 7. A typical start scene of a Sumo match.
5.1 Preliminary Experiments on HMM-based Inference Preliminary experiments have also been done with the HMM-based inference methods proposed in Section 4. Here we briefly describe the experiments and results for baseball using the approach that integrates shot detection into the HMM model. With this approach, one has to first
We have proposed a framework for event detection and summarization in sports video. Under this framework, the most important parts of a sports video are modeled by plays, which are detected and concatenated to form a compact summary of the original video. We have proposed different approaches for the detection of the plays, one being rule-based inference, the others using HMM for highlevel inference. Case studies with football, baseball, and sumo wrestling have shown that the framework provides a unified approach to the sports video summarization problem for a class of sports, and that the proposed algorithms work very well in practice on consumer grade platforms. In our experiments, we have used only visual cues. Although potentially audio can also be used, we do not
believe audio feature alone is sufficient for accurate segmentation in sports video since there is almost consistent level of background noise. Also, frequently, scene changes do not associate with any change in audio domain. However, we have found that, by using the characteristics of audio tracks associated with the detected play segments to rank order the detected plays, we can successfully obtain a hierarchy of summaries where shorter summaries contain plays that are potentially more exciting. In our experiments, hierarchical summaries generated by ranking the detected plays turned out to be meaningful (judged by human subjects who are sports fans). The proposed approach will include slow motion replays into the summary if the replay (of a play) is captured from the same camera angle. Ideally replays should be excluded from the summary for the sake of compactness. We have successfully experimented with the method proposed in [11] to detect and exclude the replays from the summaries, pushing the compaction ratio higher (for example, for the football data, a compaction ratio of 4:1 was achieved). With the additional replay-detection module, one can also form a (at least) three-layer hierarchy of summaries, with the base layer containing all the plays plus replays, the second layer containing only the plays, and the third layer containing the replays only (which are supposed to be highlights of a game). The preliminary results using the HMM-based approach suggest good potential. Currently, we are experimenting with more complex features to improve the performance.
Acknowledgement:
We greatly appreciate the permission granted to us by MLB and NFL for using images of single frames from their TV broadcast content.
References [1] J.S. Boreczky and L.D. Wilcox, “A Hidden Markov Model Framework for Video Segmentation Using Audio and Image Features”, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1998, Seattle, WA. [2] S. Eickeler and S. Muller, “Content-based Video Indexing of TV Broadcast News Using Hidden Markov Models”, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1999, Phoenix, AZ. [3] U. Gargi, R. Kasturi, and S.H. Strayer, “Performance Characterization of Video-Shot-Change Detection Methods”, IEEE Trans. on Circuit and Systems for Video Technology, Vol. 10, pp. 1-13, 2000. [4] S.J. Golin, “New Metric to Detect Wipes and Other Gradual Transitions in Video”, Proc. IS&T/SPIE Conference on
Visual Communications and Image Processing, 1999, San Jose, CA. [5] T. Kawashima, K. Tateyama, T. Iijima, and Y. Aoki, “Indexing of Baseball Telecast for Content-based Video Retrieval”, Proc. IEEE International Conference on Image Processing, 1998, Chicago, IL. [6] F.R. Kschischang, B.J. Frey, H-A. Loeliger, “Factor Graphs and the Sum-Product Algorithm”, IEEE Trans. on Information Theory, Vol. 47, pp.498-519, 2001. [7] R. Lienhart, “Comparison of Automatic Shot Boundary Detection Algorithms”, Proc. IS&T/SPIE Conference on Visual Communications and Image Processing, 1999, San Jose, CA. [8] Z. Liu and Q. Huang, “Detecting news reporting using AV information,” Proc. IEEE International Conference on Image Processing, 1999, Kobe, Japan. [9] H.B. Lu, Y.J. Zhang, and Y.R. Yao, “Robust Gradual Scene Change Detection”, Proc. IEEE International Conference on Image Processing, 1999, Kobe, Japan. [10] M.R. Naphade, R. Mehrotra, A.M. Ferman, J. Warnick, T.S. Huang, and A.M. Tekalp, “A High-Performance Shot Boundary Detection Algorithm Using Multiple Cues”, Proc. IEEE International Conference on Image Processing, 1998, Chicago, IL [11] H. Pan, P. van Beek, and M.I. Sezan, “Detection of Slowmotion Replay Segments in Sports Video for Highlights Generation”, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001, Salt Lake City, UT. [12] L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proceedings of the IEEE, Vol.77, No. 2, pp.257-285, 1989. [13] Y. Rui, A. Gupta, and A. Acero, “Automatically Extracting Highlights for TV Baseball Programs”, Proc. ACM Multimedia 2000, Los Angeles, CA. [14] D.D. Saur, Y-P. Tan, S.R. Kulkarni, and P. Ramadge, “Automatic Analysis and Annotation of Basketball Video”, Proceedings of SPIE, Vol. 3022, pp. 176-187. [15] P. Smyth, “Belief Networks, Hidden Markov Models, and Markov Random Fields: A Unifying View”, Pattern Recognition Letters, Vol. 18, pp.1261-1268, 1997. [16] W. Wolf, “Hidden Markov Model Parsing of Video Programs”, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1997, Munich, Germany. [17] D. Yow, B-L. Yeo, M. Yeung, and B. Liu, “Analysis and Presentation of Soccer Highlights From Digital Video”, Proc. 2nd Asian Conference on Computer Vision, 1995, Singapore. [18] D. Zhong and S-F. Chang, “Structure Analysis of Sports Video Using Domain Models”, Proc. IEEE International Conference on Multimedia and Expo, August 2001, Tokyo, Japan.