Automatic Generation of Personalized Music ... - Semantic Scholar

Automatic Generation of Personalized Music Sports Video Jinjun Wang2,1 , Changsheng Xu1 , Engsiong Chng2 , Lingyu Duan1 , Kongwah Wan1 , Qi Tian1 1

Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613 {xucs, 2

duan, kongwah, tian}@i2r.a-star.edu.sg

CeMNet, SCE, Nanyang Technological University, Singapore 639798

[email protected], [email protected] ABSTRACT In this paper, we propose a novel automatic approach for personalized music sports video generation. Two research challenges, semantic sports video content selection and automatic video composition, are addressed. For the first challenge, we propose to use multi-modal (audio, video and text) feature analysis and alignment to detect the semantic of events in sports video. For the second challenge, we propose video-centric and music-centric music video composition schemes to automatically generate personalized music sports video based on user’s preference. The experimental results and user evaluations are promising and show that our system’s generated music sports video is comparable to manually generated ones. The proposed approach greatly facilitates the automatic music sports video generation for both professionals and amateurs.

Categories and Subject Descriptors I.5.5 [Pattern Recognition]: Implementation—Interactive systems; H.3.1 [Information Storage And Retrieval]: Content Analysis and Indexing—Abstracting methods, Indexing methods

General Terms Algorithms, Design, Experimentation

Keywords Event detection, Video content selection, Automatic video editing, Sports video analysis, Personalized music sports video

1.

INTRODUCTION

The sports broadcasting industry is becoming one of the most profitable business because of the growing appetite for sporting excellence and patriotic passions at both the international level and the domestic club. The media is kept abuzz, feeding the latest sports events to the hungry masses.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’05, November 6–11, 2005, Singapore. Copyright 2005 ACM 1-59593-044-2/05/0011 ...$5.00.

At the same time, the advances in digital TV open new perspectives in the distribution of Audio/Visual (A/V) content to public. One major advantage of digital broadcast is the possibility of delivering customized and interactive TV programs. For example, future sports fan could choose to view contents from a game series or a just completed game with selected clips based on events and/or with artistic sports MTV presentation style. However, current production of such music video (MV) is very labor-intensive and inflexible. The selected clips must be found manually from huge volumes of sports video documents by a professional producer, who then cuts and edits each selected segment to match a given music. By introducing tools which can automate the process, production efficiency for professionals can be improved. Another limitation with current sports MV generation is that they are produced by professionals based on highlights and hence may not meet some user’s appetites who want to watch sports MV related to certain player or team. However, amateurs do not have the skills or tools to generate such personalized sports MV easily. Therefore, the availability of automatic generation tools of personalized sports MV will not only benefit the professionals but also the amateur users. This possibility is similar to that of muvee [1] which revolutionizes home video editing. In this paper, we focus on a system for automatic generation of personalized sports MV. Specifically we call the generated MV “music sports video (MSV)”. Our system extends previous work by including video semantic and music tempo matching as one criterion. We use soccer video as an example because it is not only a globally popular game, it also presents many difficult challenges for video analysis due to its dynamic structure. However, our proposed approach can be extended to other sports domains. There are two research challenges for automatic generation of personalized MSV. They are semantic sports video content selection and automatic video composition. The following sections discuss these two issues.

1.1

Semantic Sports Video Content Selection

An ideal automatic personalized MSV generator should be able to generate MV containing sports contents that are interesting to different audience with different preferences. The selection of contents may be by (1) certain sports event, (2) certain player/team and (3) certain topic. Details of these three selection possibilities are discussed below: 1) Sports video content selection by event is similar to sports event detection which has been extensively studied [2]. Many researchers have proposed to use either rule

based method [3] or statistical classifier [4] to distinguish cinematic features [5] and/or object-based features [6] for event detection. These previous work focuses on proposing robust and accurate methods to recognize as many types of sports event as possible. 2) Sports video content selection by player/team requires system to detect events related to certain player(s) or team. For example, audience may customize the content to include “Zidane” or “Beckham” only. This task is more difficult because the identification of a player and team is difficult in sports video. Some researchers have examined the textual features for this task. For example, Babaguchi [7] analyzed the Closed Caption (CC) to recognize the players in American Football games. However the CC is only available in certain countries and the alignment of video with text still remains challenging. 3) Sports video content selection by topic is the most difficult of the three possibilities as the term “topic” is not only a subjective concept, it is also extremely difficult to represent concepts using A/V features. Here we give a few examples of sports topic: the happiness of teams/players when winning a game, the sadness of an injured player who is unable to compete, players in dispute with referees, etc. In this paper, we only discuss sports video content selection by the first two possibilities, i.e. selection by event and player/team. Our research on the selection by topic is on-going.

1.2

information greatly improves event detection performance compared with traditional methods using A/V features only [2], and the text analysis also enables sports video content selection by player(s) and team; 2. We propose a robust algorithm to align the detected text event with the video to identify the event boundaries. This algorithm does not demand accurate time information compared with previous work where accurate time-stamp is required [7, 10]; 3. We propose video-centric and music-centric schemes using semantic video and music analysis and matching to automatically generate MSV. Existing work [8, 9] have not emphasized the semantics and personalization in video editing and composition. The rest of the paper is organized as follows: Section 2 describes the framework of this research; Section 3 discusses the audio, visual and text analysis and the alignment of A/V features and text for video content selection. Section 4 introduced two MSV composition schemes: video-centric and music-centric schemes. In section 5 the experimental results of audio/visual/text analysis, event detection and evaluations of the generated MSV are reported and discussed. Finally, section 6 draws the conclusions and raises some future work.

Automatic Video Composition

Digital video editing and composition are common in film or broadcast post-production. The latest development in computer science has extended these technologies from professionals to home users. However, video editing remains a time-consuming and labor-intensive task, and good editing comes with skill, experience and artistic talent. For these reasons, automatic video editing has attracted increasing research attentions in recent years. For instance, the automatic or semi-automatic home video content selection and composition is studied in [1, 8, 9]. However, these approaches are not applicable for personalized MSV generation due to two aspects: Firstly, these techniques use un-edited home video as input and do not attempt to extract semantic information due to its difficulty. The sports video is a wellstructured post-edited video and includes multi-modality streams hence identifying the semantic from sports video is possible and necessary. Secondly, these approaches have not addressed the challenging issue of music/video matching from semantic perspective. In our research, we include the semantic perspective as one criterion for video composition. That is, the videos generated will not only be aligned to the selected music piece, they should also be semantically matched with the music such that the comfortableness and enjoyment of the video is maximized.

2.

1.3

3.1

The Contributions of the Paper

Compared with the existing work on automatic video content selection and editing, the main contributions in our proposed approach include: 1. We propose a novel semantic sports video content selection method using multi-modal (audio, video and text) analysis techniques. Particularly, the use of text

FRAMEWORK

Automatic generation of personalized MSV is a challenging task due to two major issues discussed in the previous section, i.e. semantic sports video content selection and automatic video composition. We propose to utilize multimodal (audio, visual and text) feature analysis and alignment to select semantic video content and use video and music semantic analysis for MSV generation. Fig. 1 illustrates our proposed framework. Specifically, in the “semantic sports video content selection” block, the “audio analysis” and “video analysis” modules extract A/V features from the input video. The “text analysis” module detects the event and identifies the player/team information from the text web source. The “A/V & text alignment” module aligns the detected text event with A/V features to recognize the event boundary in the video. The “automatic video composition” block automatically generates video-centric and music-centric MSVs based on semantically matching of video content and music. In the following sections, the details of the semantic sports video content selection and automatic video composition are presented.

3.

VIDEO CONTENT SELECTION Visual/Audio Analysis

We focus on using broadcast soccer video for our A/V analysis because the broadcast soccer video contains rich post-production information that is favorable for video summary, and furthermore, this post-production information provides robust A/V features. Details of the A/V feature used in our system are listed in Table 1, and associated analysis is described in the following subsections.

%

! $

$(##

"

&

' ## &

!$

$!

" #

!

( $

!&

' )

*$$

, $

+

Figure 1: Framework of automatic personalized music sports video generation

Table 1: A/V Analysis Description ID Description Analysis F1 Shot boundary detection Visual F2 Semantic shot classification Visual F3 Replay detection Visual F4 Camera motion Visual F5 Audio keyword Audio

(Table 1). Since the proposed replay detection approach requires the presence of flying-logo, it may not be applicable to those replays without such editing effect. Although those replays are a small portion (∼ 5%) compared with the replays with flying-logo, we are investigating more generic replay detection approaches to incorporate replays without flying-logo.

Camera motion (F4 ) As the camera always follows the movement of the player, the camera motion provides a useful cue to represent the activity of the game. In our system, 6 camera motion features are extracted for each frame, specifically “average motion magnitude”, “motion entropy”, “dominant motion direction”, “camera pan factor”, “camera tilt factor” and “camera zoom factor”. The method to compute these motion features can be found in our previous work [15], and the generated camera motion feature is denoted as ID F4 (Table 1), an R6 vector sequence. 3.1.4

3.1.1

Shot boundary detection (F1 )

In broadcast video the shot can be regarded as a basic analysis unit. Currently we perform shot boundary detection (SBD) using M2-Edit Pro [11] software. The obtained shot boundary feature is denoted as ID F1 (Table 1) which is a sequence of frame numbers. Each number indexes a boundary between two successive shots.

3.1.2

Semantic shot classification (F2 )

The shots transition in soccer video reveals the state of the game, hence the semantic shot classification provides a robust feature for soccer video analysis. Our previous work [3] proposed a method that is able to classify each frame into one of three shot classes: far view, medium view and close-up view. In this paper we additionally identify the in-field/out-field state of the medium and close-up view shot to obtain 5 semantic shot classes: far view, in-field medium view, in-field close-up view, out-field medium view and outfield close-up view. The generated shot classification feature is denoted as ID F2 (Table 1). F2 is a sequence with each element indicating the shot class label of the corresponding frame.

Replay detection (F3 ) In a typical broadcast soccer game the director would normally launch a replay for interesting events. Hence replay detection greatly facilitates the soccer video analysis [3]. Existing techniques used for replay detection can be categorized into two classes: (1) detecting the editing effect, e.g. flying-logo [12], and (2) detecting the slow-motion [13, 14]. Based on our observation, nowadays above 95% broadcast sports videos use flying-logo to launch replays. Hence in our system, we detect replays using flying-logo template matching technique in R, G, B channels. The detected replay/nonreplay state of each frame is denoted by value 1 and 0 respectively and is collected as a sequence in feature ID F3 3.1.3

Audio keyword (F5 ) There are some significant game-specific sounds that have strong relationships to the action of players, referees, commentators and audience in sports videos. Hence the creation of suitable audio keywords would help the high-level semantic analysis. In our previous work [15] we proposed to use Support Vector Machine to classify Mel Frequency Cepstral Coefficients and Liner Prediction Coefficient Cepstral features to generate three audio keywords for soccer audio, namely “whistle”, “acclaim” and “noise”. In this work we use the same algorithm. The generated audio keyword is denoted as ID F5 (Table 1). F5 is a sequence with each element indicating the audio keyword label of the corresponding frame. 3.1.5

3.2

Text Analysis

Analyzing the sports video by using only A/V features has the following limitations: (1) It is difficult to identify the semantics of the detected events. For example, in soccer video, although the “goal” event can be detected using A/V features, the details such as who scored the goal cannot be evaluated using A/V features easily; (2) Some events are almost impossible to be detected using only A/V features, e.g. “yellow/red cards” events. This motivates us to use the text related to sports video to detect the semantics of the event. The text analysis plays an important role for

sports video content selection because the text analysis can greatly increase the event detection performance due to the exact text keywords of the events. Furthermore, the text analysis can extract the semantics such as player/team of the detected event. In sports domain, the textual information is commonly available as caption text [6], Automatic Speech Recognition (ASR), CC [16, 17, 18], game logs [10], and Text Web Broadcasting (TWB). Existing work using textual feature includes: Babaguchi et al took advantage of CC to extract baseball highlights [16], to index plays and related players [17] and to identify semantic structure of sports videos [18]. Assfalg et al annotated the sports documents using caption text and other visual features [6]. Xu utilized the game log information to refine event detection results from A/V analysis [10]. Compared with caption text and ASR that are heavily affected by video quality, or CC and game logs that are not easily obtainable, the TWB information are more freely available as live text commentary [19] or match report from the Internet. In addition, the TWB possesses very detailed information about the event, related players and approximate time (Fig.2). Because of these advantages, our system uses TWB as the source of textual information. Our “Text analysis” module is required to recognize simple content such as event type, player name, etc. As most TWBs are concisely related to events (Fig.2), in our text analysis approach, we use keyword matching technique for text processing instead of other text analysis techniques such as National Language Processing [20]. The following subsections discuss the major functions in our Text analysis module (Fig.1) respectively.

! $ %

% & ) #*

"

(

! !

+ $ %

#

' &

"

" # ,

$ % #

%

" "( ( !+

% %

%

(

& ( '

-. %

Figure 2: Text web broadcast example [18]

3.2.1

Table 2 shows the keywords definition using dtSearch grammar [21] for 8 soccer events. These events are chosen because they are either important or difficult to be detected by traditional A/V analysis techniques. The popular event “shoot” is not selected because the “shoot” event overlaps the combination of “goal” and “save” events that are already listed in Table 2. The keyword definition is extendable. Table 2: Keyword Definition keyword “yellow card” or “red card” or “yellowcard” or “redcard” or “yellow-card” or card “red-card” (commits or by or booked or ruled or yellow) w/5 foul foul goal g-o-a-l or scores or goal or equalize - kick (flag or adjudge or rule) w/4 (offside or offside “off side” or “off-side”) (take or save or concede or deliver or fire or curl) w/6 (“free-kick” or “free kick” freekick or freekick) (make or produce or bring or dash or save “pull off”) w/5 save injury injury and not “injury time” substitution substitution event

3.2.2

#

'

• The presenting of a keyword does not guarantee the occurrence of the related event, which causes a false alarm. Babaguchi [7] suggested searching both the keyword and the companying verbs to rectify this problem, and this idea is applied in our text analysis.

Keyword definition

Each type of an event features one or several unique nouns, such as “Yellow card” and “Red card” for card event. These nouns are defined as keyword and by detecting the keywords from TWB, the relevant event can be identified. To achieve accurate event detection performance, the following conditions should be satisfied: • Keyword might have different appearance. For example, “g-o-a-l”, “gooooaaaaal”, etc are all forms of “goal”. So during keyword search, the stemming, phonic and fuzzy options [21] should be selected; • The keyword in phrases might mean differently. For example, the meaning of “goal kick” is far from “goal”. Hence phrases with different meaning should be removed;

Text event detection

Once proper keywords are defined, a soccer event can be detected by finding sentences that contain the relevant keyword. The software dtSearch [21] which supports stemming, phonic, fuzzy and boolean searching is used for our event detection task. As the events are detected from the text commentary so far, we denote them as “text event” for later discussion.

3.2.3

Player/team extraction

In addition to detecting the event, the players involved in the event are also identified. For this purpose, during the initialization phase, a player/team database is built by analyzing the start-up line information from the TWB. During detection, when a text event is recognized, we further match every word in the text event with the name entries in the database to extract all the names of players and teams relevant to the text event.

3.3

Aligning the text event with A/V stream

Upon the detection of text event, our system will perform a second pass on the A/V domain to align the detected text event with the A/V streams. Some previous researches have attempted similar task of A/V and text alignment. For example, Babaguchi et al [7] matched the shot sequence, located during the time interval given by text (CC) analysis, with example sequence to identify the sports highlight as well as highlight boundary. Xu [10] utilized the time-stamp in game log to synchronize the text event and the event

from A/V analysis. Both approaches used text sources (CC & game log) with accurate time information and assumed that the time tag in text is well synchronized with the time in the video. However, in practice even with accurate time information in the text, the starting point of the time tag in text, i.e. the actual starting of the game, is seldom synchronized with the starting of the video due to some nongame scenes such as player/team introduction, ceremony, half time break, etc, which requires the A/V and text synchronization. On the other hand, the text sources used in previous work are not always available. For example, CC is only available in certain countries. In our approach, we chose to use TWB as the text source since it is widely available. However, the inaccuracy of the time-stamp in TWB can be as great as 2-3 minutes while the duration of a typical soccer event is around 30 seconds. For instance, we have observed that the time-stamp of an “injury” text event is “39 minute” while the recorded game clock in the A/V stream shows that the event happens from 37 minute 40 second to 38 minute 19 second. To align the A/V content with text, we introduce a robust algorithm to align the text analysis result with the A/V features without using very accurate time information and detect the event boundary in the video. Since the time information from TWB is not accurate, the time-stamp in a detected text event only suggests a time duration during which the stated event might happen in the A/V domain. Hence the problem is to identify the text event in the A/V stream by finding the correct event boundary within the suggested duration. In our system, the search duration is empirically set to 3 minutes (Fig.3). Searching for the desired segment in a specified duration is similar to the problem of event detection using A/V analysis where researchers search for events from full-length game videos instead. As the text analysis has reduced the search range, more accurate event boundary detection can be achieved as compared to full-length video search.

(a) ( noise1 [event] noise2 )

(b) ( < noise1 | noise2 > [event] < noise2 | noise1 > ) Figure 4: HMM grammars [20] feature from SBD, i.e. F1 in Table 1. Theoretically, the shot count represents the duration of an event, therefore the duration modeling techniques [23] could be used to solve this problem. But as discussed in subsection 3.1, all the A/V feature vectors are based on frame except that the SBD is based on shot. Hence the duration model could not be easily used for our purpose. To include the shot count feature, we propose to use the three HMMs as three separate probability classifiers instead. As illustrated in Fig.5.(a), the selected duration of A/V input from the text analysis is broken into 3 non-overlapping shot-segments, Sn1 , Se and Sn2 . Each segment can contains several shots, and each segment acts as an input to each HMM model. The three probability scores obtained from each HMM are combined with different weights, wn , we . By assigning a different weight value G(M ) according to the shot count M in the Se segment, the event shot duration is taken into consideration. We evaluate all possible partitions of Sn1 , Se and Sn2 to find the one that yields the highest combined score to give the recognized event boundaries.

(a)

Figure 3: A/V stream and text stream Because identical events feature similar temporal patterns in the A/V stream, the statistical approach is applicable for the alignment search. In our system the Hidden Markov Model (HMM) based classifier is used. The HMM works in the following manner: We build a search model with a “noise-event-noise” structure in A/V stream (Fig.3). Hence three HMM models are built, one for the event (Event HMM), one for the beginning noise (Noise HMM1) and one for the ending noise (Noise HMM2). Different events use different subset features of {F2 , F3 , F4 , F5 } (Table 1) which are described in section 5. Two alignment search experiments using HMMs concatenated with the grammar [22] in Fig.4 were conducted. The results using these two grammars are however not satisfactory. This is because some of the detected event boundaries are either too short or too long. The reason for this is due to the fact that the HMM search did not use the shot count

(b) Figure 5: Probability score combination scheme The details to compute the HMM probability score is described as follows: Let S denote the search range of the A/V features indicated by text analysis, Θn1 , Θe and Θn2 be the parameters of the beginning noise HMM, event HMM and ending noise HMM respectively. The segment S is decom-

posed into 3 possible shot-segments, i.e. S = [Sn1 , Se , Sn2 ], and let Xn1 , Xe and Xn2 denote the features within each of the segment respectively. To compute the probability of our model (Fig.5.(a)), we evaluate p(X|Θ)

= wn p(Xn1 |Θn1 ) + we G(M )p(Xe |Θe ) + wn p(Xn2 |Θn2 )

(1)

where X = [Xn1 , Xe , Xn2 ] and Θ = {Θn1 , Θe , Θn2 }. If the weights wn = 1, we G(M ) = 1 and the 3 segments used to compute Eq.(1) is the same as that found by Viterbi decoding using model of Fig.4.(a), then the probability computed would be the same. In practice, the noise segments Xn1 and Xn2 do not present distinguishing temporal patterns, hence higher weight is given to p(Xe |Θe ). To incorporate shot count information in the probability evaluation, we introduce G(M ) to model shot count feature, specifically G(M ) =

1 √

σe 2π

¯ 2 /2σ 2 e

e−(M −M )

(2)

¯ and σe represent the mean and covariance of the where M ¯ and σe are found during training shot count. The values M and have different values for different events. Similar to Fig.4.(b), we change the scheme in Fig.5.(a) to allow self-jump between the noise HMMs as illustrated in Fig.5.(b). Our experimental results show that introducing self-jump structure improves performance. Fig.6 gives an example of the probability scores with respect to all possible partitions of S. In this example of a card event, the suggested search range by text analysis is from frame 141241 to 144991. The x/y axis represents the start/end frame of Se , and the z axis is the normalized probability p(X|Θ). The optimum partition is highlighted, and the recognized event boundary is from frame 143366 to 143915.

!

4.1

The video-centric scheme is to generate MV by matching music clips to selected video contents. Since the video is the dominant content in video-centric MV and users are more interested in the selected video content, music clips are only used as a minor role to enrich the video content. Such a scheme generates MSVs also commonly known as sports summary. We first examine the current practice of making soccer summary by professionals. A quantitative study of 1 hour of broadcast soccer summary is conducted. The rules found by the study, together with their associated implementations in our system, are listed below: 1) Content: Current soccer summary mainly select soccer event as the content, such as “goal”, “foul”, “injury”, etc. Our semantic video content selection approach enables the system to select video content not only by event, but also by player(s) and team. Hence our system can satisfy current content selection rule and additionally provides the audience with more choices to customize their favorable content. 2) Sort: Current soccer summary groups the selected events according to chronological order. In our system, the event detection is required for both video content selection by event and by player(s) or by team, hence the selected video content could always be sorted chronologically, and thus the rule of sorting selected video contents is satisfied. 3) Music: Current soccer summary is normally accompanied with background music. For the video-centric case, there is no necessary to align the video shot boundaries with music structures boundaries as that in music-centric MSV in subsection 4.2. Hence this rule can be easily satisfied by multiplexing user specified music clips to the generated video. Our study reveals that applying different background music gives the generated music soccer summary different feeling. For example, fast music makes the summary jolly and lively, and slow music makes the summary heavy and serious. Both these styles exist in professional MSVs, hence users can customize the feeling of the automatic generation from our system by specifying different music clips. In our implementation for this scheme, personalized video contents are first selected from the prepared video content selection pool in chronological order and multiplexed with music clips. The automatically generated video-centric MSV is expected to be comparable with that produced by professionals, which is proved by our later experimental results.

4.2 Figure 6: Probability scores example for a “card” event After the alignment of the A/V and text analysis, the semantic sports video content selection task is completed and a pool of video contents is generated. In the next section, the users can specify certain contents to be picked out from the pool to generate desired MSVs.

4.

AUTOMATIC VIDEO COMPOSITION

With the semantic video content selection result, the original broadcast video can be re-edited to generate new video materials. In the following subsections, we introduce two possible schemes to generate MSVs, the video-centric scheme and music-centric scheme.

Video-Centric MSV

Music-Centric MSV

The music-centric scheme is to generate MV by matching video contents to a selected music clip. Generating musiccentric MSV is a difficult task for the following reasons: (1) music-centric MV is driven by both the video and the music. Hence understanding the music content is also necessary; (2) The artistic style in music-centric MSV plays an important role. Therefore both content and tempo matching between the video and music are required. In order to generate satisfactory music-centric MSV, a semantic music content analysis method and a novel video/music matching scheme are introduced in the following subsections, respectively.

4.2.1

Analyzing the semantic music structure

As proposed in our previous work [24], the music content can be described using a hierarchical structure as Fig.7. Three types of boundary information are used for our music-

centric MSV generation task, namely, “beat boundary” which is the boundary between music beats, “lyric boundary” which marks the start of a lyric sentence, and “semantic music structure boundary” which is the boundary of successive semantic music structures such as Introduction (Intro), Verse, Chorus, Bridge, Instrumental and Ending (Outro) [24]. We assume that the music time signature to be 4/4, this being the most frequent meter of popular songs. Our previous work [24] is used to extract the “beat boundary” and “semantic structure boundary”. The “lyric boundary” is obtained by collecting lyric with time-stamp from the Internet. Once the boundary information is extracted, the music is ready for video/music matching in the next subsection.

Figure 7: Example of music structure

4.2.2

Semantic video/music matching

The music-centric MV is a short film meant to present a visual representation of a popular music song, therefore the quality of visual representation determines the quality of MV. Good visual representation should present comprehensive semantic information. However, existing work does not address the video/music matching problem from the semantic perspective [1, 8, 9]. An enjoyable music-centric MV with excellent visual representation requires the video and music to be perfectly matched in the following two aspects: 1) Content matching: It is observed from 20 professional music soccer videos that different portions of the music songs present different video contents. For example, professionals sometimes use landscapes scene for the “Intro” and use excited player, coach and audience scene for the “Chorus” of the music. Hence our semantic matching module should pick up the most suitable content, such as certain events, players or teams, for different semantic music structures of the song, e.g. far view for the “Intro” and close-up view in “goal” event for the “Chorus”. To achieve this, a user-defined rule such as Fig.8 is used. The second row of Fig.8 illustrated the syntax: The first field is the name of the music structure, currently including Intro, Verse, Chorus, Bridge and Outro. The second field is the type of visual content, the value being either the name of events or players or teams or all. And the third field is the type of shots as mentioned in subsection 3.1. Such rules can be customized to produce personalized music-centric MSV. !!!

!!!

Figure 8: Example of semantic matching rules Once the rule is defined, another consideration for presenting comprehensive semantic information for the MV is

to preserve as much semantic of the video as possible after aligning the video contents with the music tempo. 2) Tempo matching: This task is necessary because good music-centric MV presents shot changes at suitable time points. It is observed from 20 professional music soccer videos that such time points include the “semantic music structure boundary”, the “lyric boundary”, and sometimes the “beat boundary”. Hence the tempo matching module performs the alignment between shot boundaries and music structure boundaries. Previous researchers have studied this task such as [8] and [9] where the shots in home video were cut short to match the music clips. However, in the broadcast soccer video, each shot is produced by professional directors and it as a whole contains full semantic content. Hence cutting the shot short is not preferred in our system. To match the shot with music boundaries while preserving as much semantic information as possible, we proposed to use the event containing a series of shots as the basic video/music tempo matching unit instead of using any single shot as in [1, 8]. To select the most suitable matching unit, a scoring scheme is defined to find an event whose shot durations could perfectly match the length of respective music structures to reduce cutting. In addition, the motion of each shot is also computed and matched with that of the music such that the speed of video and music are similar. This idea is explained as follows: Assume the video content selection block generates a video content pool E = {E1 , E2 , ...EL }, where Ei is the ith soccer event, i = 1..L. Let iK denote the number of shots in Ei , specifically Ei = [Si1 , Si2 , ..., SiK ]. To select a suitable event a matching score vi for event Ei is computed as follows, vi =

K 1 X p¯(X|Θ) · rik · vik K

(3)

k=1

where X is the A/V features extracted from the search range to find event Ei (section 3.3) and p¯(X|Θ) represents the normalized probability score (Eq.(1)) which is used as a confidence for Ei . The variable rik is a binary value used to reflect the matching of personalized semantic rule (Fig.8). Its value is set to 1 if the current event Ei and current shot Sik satisfy the rule, otherwise 0. The variable vik measures the distance between a single shot Sik in the event Ei to the music boundary. Specifically, vik is computed by vik = e

−||Tik −Tm ||2 2Σ2 m

(4)

where Tik = [Dik Mik ] ∈ R2 represents the characteristic of the shot Sik , and Tm = [Dm Mm ] ∈ R2 represents the music boundary characteristic. The variable Dik is the duration (in seconds) of the shot Sik and Mik is the average motion intensity computed for the shot. The variable Dm is the duration (in seconds) of the music boundary, and Mm is a measure of music intensity. Although Mik and Mm are measures of different features, we use them in this form to correlate between visual features and music tempo. The variance Σ2m ∈ R2×2 is a diagonal matrix with its diagonal values empirically set to 10% of Tm . This scoring scheme is applied to all the events in the video content pool. Each time the event that obtains the highest score is picked out. This process is repeated until all the music boundaries have been matched. Then the selected events are integrated together with the music to produce music-centric soccer MV.

5.

EXPERIMENTAL RESULTS

In the following experiments, a soccer video data set containing 7 World-Cup 2002 games and 4 Euro-Cup 2004 games with all correspondent TWB files collected from the Internet are used. The total video database is 16 hours.

5.1

Accuracy of A/V Analysis

The accuracies of the 5 features extracted for A/V analysis are reported in the following subsections.

Shot boundary detection (F1 ) The SBD precision by M2-Edit Pro [11] reaches 81.2% using the whole data set. Since SBD is not a key part of the research, we manually refined F1 feature by deleting any obviously unnecessary boundaries to combine the neighboring shots. 5.1.1

Semantic shot classification (F2 ) Totally 60 minutes video from 2 matches in the data set is manually labeled to train our semantic shot classification module. The test result over the rest of the data set is listed in Table 3. Errors are mainly due to the game noise such as unbalanced luminance, shadow, caption, etc. 5.1.2

Table 3: Precision of Shot Classification Class F IM OM IC OC Precision 94.5% 89.7% 87.7% 76.7% 92.7% F: Far view; M: Medium view; C: Close-up view; I: In-field; O: Out-field

Replay detection (F3 ) The template matching technique used in subsection 3.1.3 is applicable to our soccer video data because these videos all execute the flying-logo effect before and after the replay scenes. 91% of the replay scenes are correctly detected from the whole database, and the error is due to the absence of the flying-logo in the original broadcasting. 5.1.3

Camera motion (F4 ) Since the motion features are directly extracted from the MPEG2 motion vector field, it is objective and no more experiment is carried on it. 5.1.4

Audio keyword (F5 ) The experiment has been described in [15], and the audio keyword accuracy is listed in Table 4. 5.1.5

Table 4: Accuracy of Audio Keyword Keyword Acclaim Whistle Noise Accuracy 93.8% 94.4% 96.3%

5.2

Accuracy of Text Analysis

We use collected text commentaries to conduct the text analysis, and the performance is listed in Table 5 columns “Precision” and “Recall”. We have noticed that there are sometimes missed logging of events in our collected TWBs (especially those from web-sites whose TWBs are tagged by sports fans). To tackle this problem, we propose two possible solutions: (1) we collect the TWBs from sports web-sites

like ESPN and BBC [19] where the TWBs are generated by professionals. These TWBs can ensure that all the events are logged and can be used as ground truth. (2) In case the TWB is not available from professional sources, we collect multiple TWBs for the same game from different web-sites such that they could complement each other to have more sports events recorded.

5.3

Accuracy of A/V and Text Alignment

In this experiment all the 8 types of text event detected in previous subsection are sent for alignment with A/V stream. The A/V features used for boundary detection for different events are empirically selected and listed in column “Feature” of Table 5. To measure the performance of A/V and text alignment search, the “Boundary detection accuracy” (BDA) value [15] is computed. A higher BDA value represents better accuracy. For the experiment, the data set is broken into 66%/34% for training/testing. Fig.9 demonstrates the different performance of A/V and text alignment as described in subsection 3.2.

Figure 9: Event boundary detection performance by using HMMs in different manner It is observed from Fig.9 that the HMM scoring scheme in Fig.5.(b) gives the best performance with the wn and we (Eq.(1)) empirically set to 0.2 and 0.8 respectively. The detailed accuracy of boundary detection for each event is listed in Table 5 column “BDA”. Table 5: Performance of Text Event Detection and Event Boundary Detection Text event Precision Recall Feature BDA card 97.6% 95.2% F2 , F3 88.3% foul 96.9% 93.9% F2 , F3 , F5 75.7% goal 81.8% 93.1% F2 , F3 , F5 84.3% offside 100% 93.1% F2 , F3 70% freekick 96.7% 100% F2 , F3 , F4 58.7% save 97.5% 79.6% F2 , F3 , F5 66.2% injury 100% 100% F2 , F3 92.5% substitution 86.7% 100% F2 , F4 80.7% The major reason that leads to the inaccuracy in A/V and text alignment is due to the error in the A/V features extraction. The A/V features is sometimes incorrect due to mistaken A/V analysis and/or interference from other events in the search range which produce undesired probability peaks (Fig.6). Another reason for the inaccuracy is that event boundary sometimes does not exist in the search range given by text analysis due to inaccuracy in the TWB time-stamp or that the broadcast A/V data do not contain the stated event at all.

5.4

Performance of Video Composition

Since there is no objective measure available today to evaluate the quality of a MV, we employed a subjective user

study [25] to evaluate the performance of our automatic MSV generation method. There are various attributes to evaluate a MV, including Clarity, which pertains to the clearness and comprehensibility of the MV. In our automatic personalized MSV generation system, the clarity measure is reflected by the sports event detection precision and event boundary detection accuracy. Conciseness, which pertains to the terseness of the MV and how much of the MV captures the essence of the original video. For our MSV generation task, a concise MSV requires high precision on event boundary detection. Coherence, which pertains to the consistency and natural drift of the segments in the MV. This is reflected by the suitableness of video/music matching in the generated MSVs by our system. And Overall Quality, which pertains to the general perception or reaction of the users to the MV. All these criteria are employed in the following subjective test on video quality. We first use our proposed video-centric method to generate 3 MSV summaries by event–“card” event from EuroCup 2004, by player–“Ronaldo” in Euro-2004, and by team– “Germany” in World-Cup 2002. The video content selection result listed in Table 5 is used to provide the required video content for the summary. The length of generated MSVs ranges from 178 to 305 seconds. 3 music clips are used to match the video contents. We next use proposed music-centric method to generate 3 MSVs. The selected music songs are “Forca” by “Nelly Furtado”, 220 seconds long, “Do I have to cry for you” by “Nick Carter”, 217 seconds long, and “When you say nothing at all” by “Boyzone”, 256 seconds long. We use the semantic sports video content selection results from Euro-Cup 2004 game videos to match the music songs. 8 people took part in the study, 7 male and 1 female, age between 24 to 38. Of the 8 people, 2 are sports fans and another 1 is a music fan. All the 8 people have basic idea of MV but are not familiar to MSV except the 2 sports fans. The 8 people were asked to view the generated videos first and then gave each clip a score in 4 categories, i.e. Clarity, Conciseness, Coherence, and Overall Quality, at 5 scale corresponding to strongly accept (5), accept (4), margin (3), reject (2) and strongly reject (1). The average score of a MSV from all subjects is the final score of this MSV. In order to make comparison, we also asked the subjects to rate manually generated music soccer summaries and the music soccer videos generated using muvee [1] software based on our selected 3 music songs and video data set. To remove the potential biased evaluation results, we presented the MVs generated by different methods in a random order, and the participants were not aware of the techniques used to generate each video clip. Table 6 shows the average scores of the user evaluation for video-centric scheme as compared to a manually generated soccer summary (last column of Table 6). We also compared our automatic music-centric MSVs with that produced by muvee in Table 7. We observe from Table 6 that, the “Clarity” of soccer summary by player or team gets lower score compared with soccer summary by event (either manual or automatic generation). This is mainly because our current system is unable to identify whether every single shot in an event is related to the required player/team or not. Hence the generated video sometimes contains shots of unexpected players or teams which affect the “Clarity” of the summary. The inability of our system to identify the content from any single shot also

Table 6: Video-Centric Music Soccer Video Quality Criteria SE SP ST MS Clarity 4.6 4.1 4.4 4.8 Conciseness 4.8 4.15 4.3 5 Coherence 4.5 4.5 4.3 4.3 Overall Qlty. 4.7 4.3 4.5 4.7 SE: Summary by event; SP: Summary by player(s); ST: Summary by team; MS: Manual summary (by event) Table 7: Music-Centric Music Soccer Video Quality Method Criteria MSV1 MSV2 MSV3 Clarity 3.8 3.75 4.17 Our Conciseness 4.25 3.5 3.8 method Coherence 3.75 4.25 4.25 Overall Qlty. 4.25 4.05 4.5 Clarity 3.33 3.25 3.15 muvee Conciseness 3.65 3.75 3.5 Coherence 3.75 3.5 3.5 Overall Qlty. 3.8 3.75 3 MSV1: “Forca”; MSV2: “Do I have to cry for you”; MSV3: “When you say nothing at all” affects the “Conciseness” score of the automatic generated soccer summary. The “Conciseness” of the manual soccer summary get higher score because the professional producer can recognize and delete the undesired shots from the selection to make the production more concise. However, as the unexpected video shots only take up a small portion of the generated video, the “Overall Qlty.” of our automatic generation is still comparable to manually generated soccer summary. Another result from the test is that the participants in the second test indicated that the generated music-centric MSVs show good tempo matching between video and music, but both the MSVs generated by our system and muvee software are a bit difficult to understand, hence the “Clarity” and “Conciseness” scores are both low in Table 7. This phenomenon is especially notable in the MSVs generated by muvee because the video shots selected by the software do not present semantic relationship with neighboring shots. The possible reasons that lead to our low “Clarity” and “Conciseness” score are: (1) Our music-centric MSV contains several event types which makes it difficult to understand and thus lowering the “Clarity” score. To improve the system, less type of events should be used to generate one MSV. We observe that the professionals also practice this rule. (2) Because of the requirement to match the music boundary, shots within an event is sometimes discarded. This will result in incomplete event being used for the production. Hence the “Clarity” and the “Conciseness” performances were affected. By proposing more suitable video/music matching scheme and using larger video data sets, the quality of the generated music-centric MSV will improve. The generated music video examples can be viewed at http://www.ntu.edu.sg/home5/Y020002

6.

CONCLUSION AND FUTURE WORK

This paper presents a novel framework to select soccer video content and to automatically generate music soccer

video. This has obvious importance to reduce manual processing, and furthermore enable the generation of personalized music soccer video by event, by player(s) or by team. The generated video material can be used for customized soccer highlight generation, soccer video summarization, sports MTV production, etc. We have built up a demo system for the framework. Although currently the system requires some human interventions, we envisage that it can be made into a fully automatic system. For example, the required human intervention for our current system is to delete the unnecessary shot boundaries produced by M2-Edit Pro (subsection 5.1.1). Without this manual refinement, our system is still capable of automatically generating the MSVs. Our system can be extended to other sports domains for the following four reasons: (1) The required A/V analysis can be extended to other sports domains by applying specific game rules and feature extraction. (2) The presented text analysis is applicable to other sports domains such as basketball and tennis because the TWBs for these sports games are also available. (3) The proposed semantic video content selection scheme by event/player/team is generic among different sports games. By applying specific game rules, e.g. keyword definitions (Table 2), interested video contents can be identified from various sports games. And (4) our automatic video composition level system is not limited to any single sports domain. We have begun investigating the next stage of the proposed system. The future work includes several parts: Firstly, to improve the event detection accuracy and applicability by utilizing more advanced text analysis techniques. Secondly, to increase the performance of event boundary detection by examining more robust A/V feature extraction and alignment methods with text. Thirdly, to discover more MSV generation rules to make the automatic generation by our system more comparable to that generated by professionals. And fourthly, to explore new application areas of the personalized video content by taking advantages of the new digital broadcasting and transmission techniques.

[8]

[9]

[10]

[11] [12]

[13]

[14]

[15]

[16]

[17]

[18]

7.

REFERENCES

[1] MuVee Technologies Pte. Ltd, “MuveeT M ,” 2000. [2] N. Adami, R. Leonardi, and P. Migliorati, “An overview of multi-modal techniques for the characterization of sport programmes,” Proc. of SPIE-VCIP’03, pp. 1296–1306, July, 2003. [3] J. Wang, E. Chng, and C. Xu, “Soccer replay detection using scene transition structure analysis,” Proc. of IEEE ICASSP’05, March 2005. [4] J. Wang, et al, “Event detection based on non-broadcast sports video,” Proc. of IEEE ICIP’04, Nov. 2004. [5] A. Ekin, A. Tekalp, and R. Mehrotra, “Automatic soccer video analysis and summarization,” IEEE Trans. on Image Processing, vol. 12:7, no. 5, pp. 796–807, 2003. [6] J. Assfalg, et al, “Semantic annotation of soccer videos: automatic highlights identification,” Computer Vision and Image Understanding (CVIU), vol. 92, pp. 285–305, Nov. 2003. [7] N. Babaguchi and N. Nitta, “Intermodal collaboration: A strategy for semantic content

[19] [20]

[21] [22] [23]

[24]

[25]

analysis for broadcasted sports video,” Proc. of IEEE ICIP’03, vol. 1, pp. 13–16, Sept. 2003. X. Hua, L. Lu, and H. Zhang, “Automatic music video generation based on temporal pattern analysis,” Proc. of ACM MultiMedia’04, pp. 472–475, Oct. 2004. J. Foote, M. Cooper, and A. Girgensohn, “Creating music videos using automatic media analysis,” Proc. of ACM MultiMedia’02, pp. 553–560, Dec. 2002. H. Xu and T. Chua, “The fusion of audio-visual features and external knowledge for event detection in team sports video,” Workshop on Multimedia Information Retrieval (MIR’04), Oct. 2004. MediaWare Solutions Pte. Ltd (USA), “M2-edit proT M ,” 2002. H. Pan, B. Li, and M. Sezan, “Automatic detection of replay segments in broadcast sports programs by detection of logos in scene transitions,” Proc. of IEEE ICASSP’02, May 2002. V. Kobla, D. DeMenthon, and D. Doermann, “Detection of slow-motion replay sequences for identifying sports videos,” Proc. IEEE Workshop on Multimedia Signal Processing, 1999. H. Pan, B. Li, and M. Sezan, “Detection of slow-motion replay segments in sports video for highlights generation,” Proc. of IEEE ICASSP’01, May 2001. J. Wang, et al, “Automatic replay generation for soccer video broadcasting,” Proc of ACM MultiMedia’04, pp. 31–38, Oct. 2004. N. Babaguchi, Y. Kawai, and T. Kitahashi, “Event based indexing of broadcasted sports video by intermodal collaboration,” IEEE Trans. on Multimedia, vol. 4, pp. 68–75, March 2002. N. Nitta, N. Babaguchi, and T. Kitahashi, “Generating semantic descriptions of broadcasted sports video based on structure of sports game,” Multimedia Tools and Applications, vol. 25, pp. 59–83, Jan. 2005. N. Nitta and N. Babaguchi, “Automatic story segmentation of closed-caption text for semantic content analysis of broadcasted sports video,” Proc. of 8th International Workshop on MIS’02, pp. 110–116, 2002. “http://news.bbc.co.uk/sport1/hi/football/teams/,” C. Manning and H. Schutze, “Foundations of statistical natural language processing,” The MIT Press, Cambridge, Massachusetts, May 1999. dtSearch Corp, “dtsearch 6.50 (6608),” 1991-2005. “Hidden markov model toolkit,” http://htk.eng.cam.ac.uk/. J. Pylkkonen and M. Kurimo, “Duration modeling techniques for continuous speech recognition,” Proc. of IEEE ICASSP’04, pp. 385–388, May 2004. N. Maddage, et al, “Content-based music structure analysis with applications to music semantics understanding,” Proc. of ACM MultiMedia’04, pp. 112–119, Oct. 2004. J. Chin, V. Diehl, and K. Norman, “Development of an instrument measuring user satisfaction of the human-computer interface,” Proc. of SIGCHI on Human Factors in CS, pp. 213–218, 1998.