Feature-Based Synchronization of Video and Background Music Jong-Chul Yoon, In-Kwon Lee, and Hyun-Chul Lee Dept. of Computer Science, Yonsei University, Korea
[email protected],
[email protected],
[email protected]
Abstract. We synchronize background music with a video by changing the timing of music, an approach that minimizes the damage to music data. Starting from a MIDI file and video data, feature points are extracted from both sources, paired, and then synchronized using dynamic programming to time-scale the music. We also introduce the music graph, a directed graph that encapsulates connections between many short music sequences. By traversing a music graph, we can generate large amounts of new background music, in which we expect to find a sequence which matches the video features better than the original music.
1
Introduction
Background music (BGM) enhances the emotional impact of video data. Here, BGM means any kind of music played by one or more musical instruments, and should be distinguished from sound effects, which are usually short sounds made by natural or artificial phenomena. We introduce a method that synchronizes BGM with motion in video data. Well-synchronized BGM helps to immerse the audience in the video, and can also emphasize the features of the scenes. In most cases of film production, the picture comes first and music and sound effects are usually added once the picture is completed [1]. In order to obtain music that synchronizes with a particular video, we have to hire a composer. Since this approach is expensive, it is more common, especially in a small production or home video, to fit existing recorded music to the video after it has been produced. But, it is not simple to find a piece of music that matches the video scene. One may have to go through several scores, and listen to many selections in order to find a suitable portion for a given scene. Furthermore, it is still hard to match the selected music with every significant feature of the video. Our goal is the automatic generation of synchronized video by choosing and modifying the music sequence, with the aim of avoiding drastic changes which make damage to music. Our system analyzes MIDI and video data to find the optimal matches between features of the music and the video using DP (Dynamic Programming). This is followed by modification of the time domain, so as to match the musical features while preventing noticeable damage. We also exploit the music graph [2], as a music rearrangement method. Similar to a motion graph [3,4,5], a music graph encapsulates connections between N. Zheng, X. Jiang, and X. Lan (Eds.): IWICPAS 2006, LNCS 4153, pp. 205–214, 2006. c Springer-Verlag Berlin Heidelberg 2006
206
J.-C. Yoon, I.-K. Lee, and H.-C. Lee
several music sequences. Music can be generated by traversing the graph and then smoothing the resulting melody transitions. The music graph can be utilized in our synchronization system to generate new tunes that will match the video better than the original music. The contributions of our research can be summarized as follows: – We introduce a feature-based matching method which can extract the optimal matching sequence between the background music and video. – We introduce a stable time warping method to modify the original music that can prevent noticeable damage to the music. – Using the music graph, we can generate novel background music which has better coherence with video than the original music.
2
Related Work
There has been a lot of work on synchronizing music (or sounds) with video. In essence, there are two classes of approach, depending on whether one is modifying a video clip for given music, or vice versa. Foote et. al [6] computed the novelty score at each part of the music, and analyzed the movements of the camera in a video. Then, a music video can be generated by matching an appropriate video clip to each part of the original music. Another segment-based matching method was introduced by Hua et. al [7]. Since home video, which is pictured from typical people, has low quality and unnecessary clips, Hua et. al calculated the attention score of each video segment as the method for extracting important shots. Using the beat analysis of video data, they attempted to create a coherent music tempo and beat. Then, the tempo and beat of given background music can be adjusted by the computed tempo and beat. Mulhem et. al [8] introduced aesthetic rules, which are commonly used by real video editors, as a method of video composing. In addition to the previous research that considered the composing of video segments, Jehan [9] suggested a method to control the video time domain and synchronized the feature points of both video and music. Using the temporary data manually given, he adjusted the dance clip by time-warping for the synchronization to the background music. Our method is similar to this method, but we considered the reverse direction: the time-warping of music. Yoo et. al [10] suggested a method to generate long background music sequences from a given music clip using a music texture synthesis technique. Lee et. al [2] introduced a music graph concept that is an utility for synchronization of music and the motion in the character animation. Since the video is more commonly used than animation scenes, we adapted the music graph to the method for a video-based BGM generater.
3
Feature Extraction
We will represent a video clip in the time interval [tb , te ] as a multidimensional curve, A(t) = (a1 (t), a2 (t), ..., an (t)), tb < t < te , which is called a feature curve.
Feature-Based Synchronization of Video and Background Music
207
Each of the component functions ai (t) represent a quantitative or qualitative property of the video clip, such as: – – – –
Shot Boundary. Camera Movement (Panning, Zoom-in/out). The movement of any object in the video clip. An arbitrary function specified by the user.
Similar to the video clip, the BGM can also be represented by a multidimensional BGM curve, which we will write M (s) = (m1 (s), m2 (s), ..., mm (s)), sb < s < se . Each component function mi (s) represents any quantitative or qualitative property of the music, such as: – – – – – – –
Note pitch, duration, or velocity (volume). Inter-onset interval (duration between consecutive notes). Register (interval between highest and lowest pitches). Fitness for a fixed division (see Equation 3). Chord progression. Feeling of the music. An arbitrary function specified by the user.
Collecting such samples from the BGM is not easy when its source is analogue or digital sound data. A MIDI file makes extraction of the necessary data much easier. In the following subsections, we will look at some examples of how feature points are obtained from the video and BGM curves. 3.1
Video Feature Extraction
There are several methods for feature extraction of video in computer vision and image processing. Ma et. al [11] suggested the feature extraction method using the motion of object, variance of human face appearing in the video, and camera movement. Foote [6] used the variance of brightness to compute the feature points. In our work, for analyzing the camera movement, we use ITM (Integral Template Matching) method, which was introduced by Lan [12]. Using ITM, we can extract the shot boundary, and analyze the dominant motion of camera and its velocity at the same time. The ITM system uses MAD (Mean Absolute Difference) method to derive the camera movement. The time instances having sharp local maximum MAD can be considered as shot boundaries. We can also determine DM (Dominant Motion) of the camera by considering the camera motion having minimum MAD. The equation of MAD is defined by: M AD(Δx) =
1 |Li (x) − Li+1 (x + Δx)| N
(1)
x∈T
where Δx is a translation of image basis, and Li is ith frame of the video. We use three types of Δx, especially vertical, horizontal and zoom in(out). After
208
J.-C. Yoon, I.-K. Lee, and H.-C. Lee
computing dominant motion, we assume that the changing points of DM as feature points. The features of a video clip are influenced by shot boundary and variation of camera movement dominantly. Additionally, we extract the other features from each shot. Inside an single shot, we applied the Camshift [13] method for tracking the object played in the shot clip. Using Camshift, we can analyze the movement of the user selected object (see Figure 1). By tracking the trajectory of the selected region, we can construct the positional curve p(t) of the selected object. The component feature curves ai (t) are converted into feature functions fi (t) that represent the scores of the candidate feature points. For example, an fi (t) can be derived from ai (t) as follows: q if ai (t) = 0 and ai (t) > 0 fi (t) = , (2) 0 otherwise where q is a predefined score corresponding to the importance of the features. For example, we use 1.0 as a shot boundary score, and 0.8 as a camera movement score. The score of the object movement can be computed to be proportional to the secondary derivative of the positional curve. Finally, the video feature function F (t) can be computed by merging the component feature functions fi (t). The user can select either one feature function or merge several functions together to give an overall representation of the video.
Fig. 1. Object Tracking using Camshift
3.2
Music Feature Extraction
In our work, low-level data such as note pitch and note velocity (volume) can be extracted from MIDI files, and these data can be used to analyze higher-level data such as chord progressions [14]. These data are represented in separate BGM curves that can either be continuous or bouncing. The note velocity (volume) m1 (s) in Figure 2(a) is a continuous function which represents the change in note volume through time. By contrast, the fitness function m2 (s), which determines whether a note is played near a quarter note (a note played on the beat), is of the bouncing type. For example, the fitness function m2 (s) can be defined as follows: |s − sk | if a note exist in [sk − , sk + ] m2 (s) = , (3) 0 otherwise
Feature-Based Synchronization of Video and Background Music
209
se − sb (k = 0, 1, 2, · · ·), (4) 4Nm where is a small tolerance, and Nm is the number of bars of the BGM; thus Δs is the length of a quarter note. (Note that the time signature of the BGM in Figure 2 is 44 ). Feature points can be extracted from the BGM curves in various ways depending on the kind of data we are dealing with. For example, we may consider the local maximum points of the note velocity (volume) curve to be the features of this curve, because these are notes that are played louder than the neighboring notes. sk = kΔs, Δs =
& 44 m1(s)
qqqqqqq qqqqqqqq
q
q
qqq qqq qqq qqq
g1(s)
(a)
G(s)
Feature Point Detection
m2(s)
s
g2(s)
Merge s
s
(b)
s
s
Fig. 2. An example of BGM curves and feature point detection: (a) note velocity (volume); (b) fitness to quarter note
The BGM curves mi (s) are converted into the feature functions gi (s), as shown in Figure 2. In some cases, the fitness function can be used directly as the feature function since it represents discrete data. For example, m2 (s) is inverted in order to represent how well the note fits a quarter note: ⎧ 1 ⎪ if m2 (s) = 0 ⎨ m2 (s) g2 (s) = , (5) ⎪ ⎩ 0 otherwise Finally, the BGM feature function G(s) can be computed by merging the normalized component feature functions. The user can select either one feature function or merge several feature functions together to form the final representation of the music.
4
Synchronization Using DP Matching
DP matching is a well-known method for retrieving similarities in time series data such as speech or motion. Using DP matching, we can find the partial sequence from the given BGM that best matches the video clip, while also pairing the feature points from the video and music. And to synchronize the music and video, we modify the music using feature pairs that will not cause severe damage to the music.
210
4.1
J.-C. Yoon, I.-K. Lee, and H.-C. Lee
DP Matching
The DP matching method does not require the video and music to be of the same time length. However, we will assume that the music sequence is longer than the video so that we are sure there is enough music for the video. Following Section 3, we assume that F (t), tb ≤ t ≤ te , and G(s), sb ≤ s ≤ se , are the feature functions for the video and music, respectively. For DP matching, we use ti , i = 1, ..., T , and sj , j = 1, ..., S, which consist, respectively, of T and S sampled feature points of F (t) and G(s), and which satisfy F (ti ) > 0 and G(sj ) > 0, for all i and all j. Note that we place default sample feature points at the boundary of each feature function, such that t1 = tb , tT = te , s1 = sb , and sS = se . The distance d(F (ti ), G(sj )) between a video feature point and a BGM feature point can be given by the following formula: d(F (ti ), G(sj )) = c0 (F (ti ) − G(sj ))2 + c1 (ti − sj )2 ,
(6)
where c0 and c1 are weight constants that control the relative influence of the score difference and the time distance. The DP matching method calculates d(F (ti ), G(sj )) using a matching matrix q(F (ti ), G(sj )), of dimension T × S. The matching matrix is calculated as follows: q(F (t1 ), G(sj )) = d(F (t1 ), G(sj )) (j = 1, ..., S)
(7)
q(F (ti ), G(s1 )) = d(F (ti ), G(s1 )) + q(F (ti−1 ), G(s1 )) (i = 2, ..., T )
(8)
⎞ q(F (ti−1 ), G(sj )) q(F (ti ), G(sj )) = d(F (ti ), G(sj )) + min ⎝ q(F (ti−1 ), G(sj−1 )) ⎠ q(F (ti ), G(sj−1 )) ⎛
(i = 2, ..., T,
j = 2, ..., S)
D(F, G) = min{q(F (tT ), G(sj )) | 1 ≤ j ≤ S}.
(9) (10)
Here q(F (tT ), G(sj )) is the total distance between the video feature point sequence F and the partial BGM feature point sequence of G, when F (tT ) matches G(sj ). Moreover, D(F, G) is the total distance between F and the partial sequence of G starting from s1 . In order to find the optimal match, we increase the starting time of G from s1 to sS−T and calculate the matching matrix again until we get the minimum value of D(F, G). This dynamic programming algorithm naturally establishes the optimal matching pairs of motion and music feature points with time complexity O(N 3 ), where N = max(T, S). 4.2
Music Modification
Now we synchronize the feature points by time-scaling the music to match the feature pairs obtained from DP matching. First we plot the feature pairs and
Feature-Based Synchronization of Video and Background Music
211
interpolate the points using a cubic B-spline curve [15]. The reason we use curve interpolation is to minimize the perceptual tempo change around the feature pairs. Once an interpolation curve C(u) = (s(u), t(u)) has been computed, each music event, occurring at a time s∗ = s(u∗ ), is moved to t∗ = t(u∗ ) (see Figure 3(a)). Before we apply the scaling to the music, we discard the points that will give large local deformations and lead to abrupt time scaling of the music. The points to be discarded are further than a user-specified threshold from the least-squares line, which approximates all the feature pairs. The red points in Figure 3(a) are removed, producing the new curve illustrated in Figure 3(b). This new curve will change the tempo of the music locally, with natural ritardando and accelerando effects. t (video)
t (video)
C(u* ) = ((s* ), (t* ))
t* b'
b
a a'
c
d
(a)
s*
s (music)
c
d
s (music)
(b)
Fig. 3. Music time-scaling using B-spline curve interpolation. The red circles on the s-axis indicate the feature points of the BGM, and the red crosses on the t-axis indicate the feature points of the video: (a) all feature pairs used to interpolate the curve; (b) after removal of feature pairs that will damage the music.
5
Music Graph
The music graph [2] encapsulates connections between several music sequences. New sequences of music can be generated by traversing the graph and applying melody blending at the transition points. The goal of the music graph is to retain the natural flow of the original music, while generating many new tunes. A traversal of the music graph begins at a vertex selected by the user. Because every edge in the music graph is weighted by the chord distance, the next edge in the traversal can be selected by a random process, influenced by the score, as in a Markov chain [16], which is used as a standard tool for algorithmic composition in computer music [17,18]. The system randomly traverses the music graph repeatedly (100 times in our work), retrieving new music sequences. We expect that some of these will synchronize more effectively with the motion than the original music clips. We measure the disparity between each new music sequence and the given video using the DP matching distance function, and select the sequence corresponding to the minimum distance.
212
J.-C. Yoon, I.-K. Lee, and H.-C. Lee Music Sequence
Music Graph
Sequence1 1 2
3
4
Sequence2 1 2
3
4
Sequence3 1 2
3
4
1
2
5
3
4
1
4
2
Fig. 4. Basic concept of Music graph
6
Results
In our experiments, we used two video clips, showing Flybar and scenes of Venice; and the music data have various genres including classic, waltz, and dance. We synchronized the Flybar (Figure 5(a)) clip with Light Character Polka BGM which is composed by Johann Strauss. The length of the video clip is 43 seconds and the BGM is 198 seconds. We used the fitness of quaternote as a main feature score of the music, and used the shot boundary, camera movement and object movement as a video feature score. The other one is the scenes of Venice (Figure 5(b)). We used An der sch¨ onen blauen Donau, which was composed by Johann Strauss, as a BGM. The length of the video clip is 33 seconds and the BGM is 201 seconds. We also used the fitness of quaternote as a feature score of the music, and use the shot boundary as a video feature score. In the case of the Flybar video, we used the movement of objects as a dominant feature of the synchronization, while for the Venice scene, we used the shot boundary as a dominant term, for generating the similar effects to any music video. Although there are some non-uniform shot changes, we could create a nicely synchronized video by the music modification. The next example used the music graph. Using the single BGM, Gm sonata Op. 49 which was composed by Beethoven, we constructed the music graph. The original music is composed with two main parts, a piano solo and orchestral
(a)
(b)
Fig. 5. Sample video clip: (a) Flybar (b) Scenes of Venice
Feature-Based Synchronization of Video and Background Music
213
music. As a result of similarity analysis, we extracted 329 transitions in the resulting music graph. By traversing the music graph, we synchronized the features of the Flybar video and the synthesized music. Consequently, more dynamic BGM is composed which is more coherent with movement of object than any original music. Table 1 shows the comparison of the synchronization using the original BGM and synthesized BGM generated by music graph. Table 1. Music graph can generate more synchronized BGM (having less DP matching distance) compared with original BGM
7
BGM Length Distance of DP matching Gm sonata Op. 49 168 sec 8.92 Music graph 44 sec 7.09
Conclusion
We have suggested a method to synchronize background music and video using DP matching and the music graph. Our method matches the feature points extracted from the music and video by time scaling the music. By modifying of music a little, we can minimize the changes to the original data necessary to synchronize the feature points. The music graph is a new way to synthesize new music from a directed graph of music clips. It has various applications. In this paper we show how it can be used to generate well-synchronized background music for the given video. There are several factors that could make the music graph more useful. Replacing random search with systematic traverse methods, as used in motion graph research [4,5], is one possibility. Additionally, we could extend the functions for transition distance and melody blending to consider melodic or rhythmic theories. Using the difference of DP matching, we can measure the suitability of BGM. Using our database system, we can extract the most suitable BGM by comparison of the matching score. However, to synchronize the mood of BGM and video, maybe the user must select the BGM candidates. At a higher level, it may be possible to parameterize both music and video in terms of their emotional content [19]. Synchronizing emotions could be a fascinating project. Acknowledgement. This work was supported by the Ministry of Information & Communications, Korea, under the Information Technology Research Center(ITRC) Support Program.
References 1. Burt, G.: The Art of Film Music. Northeastern University Press (1996) 2. Lee, H.C., Lee, I.K.: Automatic synchronization of background music and motion in computer animation. In: Proceedings of the EUROGRAPHICS 2005. (2005) 353–362
214
J.-C. Yoon, I.-K. Lee, and H.-C. Lee
3. Kovar, L., Gleicher, M., Pighin, F.: Motion graphs. In: Proceedings of ACM SIGGRAPH. (2002) 473–482 4. Arikan, O., Forsyth, D.: Interactive motion generation from examples. In: Proceedings of ACM SIGGRAPH. (2002) 483–490 5. Lee, J., Chai, J., Reitsma, P., Hodgins, J., Pollard, N.: Interactive control of avatars animated with human motion data. In: Proceedings of ACM SIGGRAPH. (2002) 491–500 6. Foote, J., Cooper, M., Girgensohn, A.: Creating music videos using automatic media analysis. In: Proceedings of ACM Multimedia 2002. (2002) 553–560 7. Hua, X.S., Lu, L., Zhang, H.J.: Ave - automated home video editing. In: Proceedings of ACM Multimedia 2003. (2003) 490–497 8. Mulhem, P., Kankanhalli, M.S., Hassan, H., Yi, J.: Pivot vector space approach for audio-video mixing. In: Proceedings of IEEE Multimedia 2003. (2003) 28–40 9. Jehan, T., Lew, M., Vaucelle, C.: Cati dance: self-edited, self-synchronized music video. In: Proceedings of SIGGRAPH Conference Abstracts and Applications. (2003) 27–31 10. Yoo, M.J., Lee, I.K., Choi, J.J.: Background music generation using music texture synthesis. In: Proceedings of the International Conference on Entertainment Computing. (2004) 565–570 11. Ma, Y.F., Lu, L., Zhang, H.J., Li, M.J.: A user attention model for video summarization. In: Proceedings of ACM Multimedia 2002. (2002) 553–542 12. Lan, D.J., Ma, Y.F., Zhang, H.J.: A novel motion-based. representation for video mining. In: Proceedings of IEEE. International Conference on Multimedia and Expo. (2003) 469–472 13. Bradski, G.R.: Computer vision face tracking as a component of a perceptual user interface. In: Proceedings of Workshop on Applications of Computer Vision. (1998) 214–219 14. Rowe, R.: Machine Musicianship. MIT Press (2004) 15. Hoscheck, J., Lasser, D.: Fundametals of Computer Aided Geometric Design. AK Peters (1993) 16. Trivedi, K.: Probability & Statistics with Reliability, Queuing, and Computer Science Applications. Prentice-Hall (1982) 17. Cambouropoulos, E.: Markov chains as an aid to computer assisted composition. Musical Praxis 1 (1994) 41–52 18. Trivino-Rodriguez, J.L., Morales-Bueno, R.: Using multiattribute prediction surffix graphs to predict and generate music. Computer Music Journal 25 (2001) 62–79 19. Bresin, R., Friberg, A.: Emotional coloring of computer-controlled music performances. Computer Music Journal 24 (2000) 44–63