Background music recommendation for video based on ... - IEEE Xplore

BACKGROUND MUSIC RECOMMENDATION FOR VIDEO BASED ON MULTIMODAL LATENT SEMANTIC ANALYSIS Fang-Fei Kuo1*, Man-Kwan Shan2, Suh-Yin Lee1 Department of Computer Science, National Chiao-Tung University, Hsinchu, Taiwan1 Department of Computer Science, National Chengchi University, Taipei, Taiwan2 {ffkuo, sylee}@cs.nctu.edu.tw1, [email protected] ABSTRACT Automatic video editing is receiving increasingly attention as the digital camera technology develops further and social media sites such as YouTube and Flickr become popular. Background music selection is one of the key elements to make the generated video attractive. In this work, we propose a framework for background music recommendation based on multi-modal latent semantic analysis between video and music. The videos and accompanied background music are collected from YouTube, and the videos with low musicality are filtered out by musicality detection algorithm. The cooccurrence relationships between audiovisual features are derived for multi-modal latent semantic analysis. Then, given a video, a ranked list of recommended music can be derived from the correlation model. In addition, we propose an algorithm for music beat and video shot alignment to calculate the alignability of recommended music and video. The final recommendation list is the combined result of both content correlation and alignability. Experiments show that the proposed method achieves a promising result. Index Terms— background music recommendation, content correlation, multi-modal latent semantic analysis 1. INTRODUCTION Digital video and music has become popular in human life, owing to the advancement of the digital media technology. Video is widely used to record and capture the important moments of our daily events such as weddings, graduation ceremonies and birthday party. Moreover, with the increasing quantity of video content uploaded on the social media sites such as YouTube and Flickr, users can share their personal videos freely and easily. However, a lengthy, unedited video make it boring and uninteresting to watch. There is increasing need to make video more attractive. In general, there are two ways to automatically edit the video to make it more interesting. One is video summarization, which aims to discover the important ___________________________ * The author is now at the Department of Electrical Engineering, University of Washington.

segments from the original video and generate a more concise but still informative summary. Much work has been done on the automatic video summarization.[5, 17]. The other is to accompany the video with background music. Music has been considered an effective way to express emotions and feelings, and played an important role in TV series, commercials and films. Nevertheless, to select good background music for a video, both video post-production and music professions are required. Furthermore, with the rapid growth of digital music, it is very hard and timeconsuming for users to find suitable background music from a large music collection. Therefore, automatic background music recommendation is useful for users to enrich videos. The most relevant research areas to background music recommendation are automatic video editing[1] and music video generation[2, 4, 10-13, 15]. However, most of the previous studies assumed the background music to be specified by users and aimed at generating music videos by selecting video clips suitable for accompaniment with the background music. In other words, users have to specify background music and the video is edited according to the given music. In this work, we address the problem of recommending appropriate background music for a user-specified video. Nothing sets the mood for videos like music. Different styles of music can convey mood such as sadness, excitement, or encouragement. There are two approaches to the problem of background music recommendation: knowledge-driven and data-driven approaches. The former extracts and encodes the domain experts’ knowledge to select the suitable background music for the video, whereas the latter is to analyze wellaccompanied videos with music and to build the recommendation model. While the know-driven approach relies on the knowledge specified by humans, we explore the data-driven approach for recommending the background music. There are two main design issues that need to be addressed: 1) content correlation between video and music streams and 2) temporal structure alignment between video and music streams. For the first issue, a qualified data source is required for discovering the content correlation between video and music. We use the films on social media sites, where the amount of

videos on social media sites is very large. However, the quality of user-generated videos varies extremely. To solve this problem, we propose to use the commercial films. Music in commercial films is usually written or chosen by composer according to the product image, story, and emotion of the advertisement. Therefore, the commercial film is a good source for video background music modeling. Although music has a big influence on commercial film, there still exist commercial films without any background music or with only a small portion of music. In this work, we propose a method for video musicality detection to address this issue. Moreover, we need to find out how to decide appropriate background music for a video. Is there any correlation between background music and video? More precisely, can we model the correlation between music features and video features from the training dataset? To discover the correlation between music and video modalities, we adopt multiple-type latent semantic analysis algorithm (MLSA) to capture the underlying interrelated correlations. The discovered latent correlations are used for background music recommendation. Moreover, to recommend a better background music for video, rhythmic structure need to be considered. In this work, we propose an alignment algorithm to measure the alignability between video shots and music beats. 2. RELATED WORK The research areas closed to background music recommendation include automatic video editing and music video generation. Hua et al. [1] proposed a system to extract highlights from a set of home videos and aligns them with user provided music based on the video and music content. Several works have been done on music video generation [2, 4, 13, 15]. In [2], temporal patterns were used for video and music clips matching analysis. Liao et al. [4] learned the associations between music and video by dual-wing harmonium model. In [13], relationships among music, video, and emotion were learned from an emotion-annotated corpus of music videos. Video and music were matched by measuring the emotion distributions of video and music. Wu et al. [15] proposed a system to generate music video from images. Wang et al. [11] proposed a multimodal approach to generate music sports video from broadcast sport video. Yu et al. [16] used emotion as a medium to generate soundtracks for outdoor videos. 3. PROPOSED SYSTEM The architecture of the proposed background music recommendation system is shown in Figure 1. The training dataset is collected from commercial films of social media site such as YouTube. Music plays an important role in commercial films. Music can serve several functions in advertisements [3], such as entertainment, structure/ continuity, memorability, lyrical language, targeting, and authority establishment. Moreover, music in advertisement is

usually selected by professional producer or composed by composer, However, the commercial films without background music or only contains a small fraction of music are not suitable for learning the relationship between music and visual features. Consequently, Musicality Detection module calculates the musicality of commercial films and filter out low musicality ones. Then, music features and video features are extracted by the Audio/Visual Word Extraction module from audio track and video track from commercial films respectively. Then, the numerical features are converted into audio and visual words. The Content Correlation Modeling module captures the latent content relationships between audio and visual words extracted from commercial films. Given a video as a query, the extracted visual words are used to find audio words with high weights to generate ranked list of recommended background music from the music library. To get the better rhythmic effect, the temporal structure alignment between music beat and video scene change are taken into consideration. Social Media Site

Query Video

Music Library

Video Summarization

Training Data Collection Musicality Detection

Audio/Visual Word Extraction Content Correlation Modeling

Audio Word Extraction

Temporal Structure Alignment

Visual Word Extraction

Recommendation Video Music Synchronization

Fig. 1. Proposed Framework of Video Production System

After the recommendation, the last stage is to generate video with accompanied music. Since the length of recommended music may not fit the video, the music needs to be repeated or cut. Music structure analysis techniques which find the meaningful segments of music such as verse and chorus, is useful for repetition or shortening. Note the goal of our work is to recommend a ranked list of background music for video, so the video music synchronization and audio/video mixing are not covered in this paper. 4. MUSICALITY DETECTION Not all the films collected from social web are suitable for the video background music recommendation. Films without any background music or in which only a small fraction is accompanied with music are not good sources for training. In this work, we propose a method to obtain high musicality films. The musicality of a film is defined as the proportion of the number of music segments to that of non-music ones. We

transform the problem of film musicality detection into music/speech classification problem. Five features widely used for audio classification and speech recognition are employed for musicality, ZeroCrossing Rate, Spectral Roll-off, Spectral Centroid, Spectral Flux and Mel Frequency Cepstral Coefficients. For the classification algorithm, we utilize Multi-dimensional Gaussian maximum a posteriori estimator (MAP Gaussian classifier for music and speech classification. After applying the classifier, the audio track of a film is segmented into several music and speech segments. Videos with low musicality are filtered out. 5. AUDIO/VISUAL WORD EXTRACTION The audio-visual word representation for films consists of the following three steps. First, the video shot boundaries of each film are discovered by shot boundary detection algorithm. Second, video features are extracted from the first frame of each detected shot, while audio features are extracted from the audio track of the film. Finally, each feature is quantized into a codebook of feature vectors respectively; each quantization is regarded as a type of audio-visual word representation. In general, video can be represented by the hierarchical structure: frame, shot and scene. A shot is a video piece that consists of one continuous action. A scene consists of one or more shots that form a semantic unit. Shots are commonly regarded as the basic unit for film photography and editing. In this work, the visual feature is extracted in shot-level. The shot boundary detection algorithm used in our work is based on calculating edge change fraction in temporal domain [18]. The algorithm first detects edges from the two consecutive frames and then calculates edge change fractions. The peaks of edge change fraction are regarded as shot boundaries.

measures the proportion of the dark or shadow area in the frame. Texture Feature: Texture is an important feature for the emotion classification for image, and also plays an important role in filmmaking. Texture adds depth to the scene, and is an important element to human visual perception. In this work, we extract one of the most widely used texture features, Gray Level Co-occurrence Matrix (GLCM). The descriptors of GLCM are shown in Table 1. Motion Feature: Motion is a widely used visual feature in affective video content analysis. In film theory, motion is a highly expressive element and can elicit viewers’ emotional responses. The motion descriptor Visual Excitement [9] is calculated based on the differences between 20x20 blocks of consecutive keyframes in CIELuv Space. We use the average of visual excitement over keyframes as the motion feature of entire video. Table 1. Descriptors used in background music recommendation. Modality

Type Color

Video

Texture

Light Motion Rhythm Timbral Texture Music

5.1. Visual Features Video can be characterized by color, light, texture and motion, and these factors affect emotions strongly. In this work, we extract visual features in these four categories. Color Feature: People are often influenced by color in the subconscious. Though color symbolism varies in different cultures, in general, cool colors tend to be calm, aloof and serene. Warm colors tend to be more aggressive, energetic and stimulated. The descriptors of color features in our work include Color Energy and Saturation Proportion, which are proposed by Wang and Cheong [9]. Color energy is calculated based on color contrast and the angular distance to blue and red respectively. Saturation Proportion is based on the proportion of low-saturation pixels. Light Feature: The descriptors of light features we adopted are Light Median and Shadow Proportion [9]. Light Median is the median value of brightness. Because of the insensitivity to the outliers, light median is a good measure of average lightness of the video. Shadow Proportion

High Level

Number of Dimensions

Name (abbr.) Color Energy (CE) Saturation Proportion (SAP) Angular Second Moment (ASM) Contrast (CON) Correlation (COR) Dissimilarity (DIS) Entropy (ENT) Homogeneity (HOM) GLCM Mean (MEA) GLCM Variance (VAR) Light Median (LM) Shadow Proportion (SHP) Visual Excitement (VE) Beat Histogram (BH) MFCC (MFCC) Spectral Centroid (SC) Spectral Rolloff (SR) Spectral Flux (SF) Danceability (DAN) Duration (DUR) Energy (ENE) Key (KEY) Loudness (LOU) Mode (MOD) Tempo (TEM) Time Signature (TS)

1 1 1 1 1 1 1 1 1 1 1 1 1 8 52 4 4 4 1 1 1 1 1 1 1 1

5.2. Audio Features Sound plays a very essential role in commercial films. Audience emotion is highly influenced by the sound effect and music. Famous director Akira Kurosawa had said “Cinematic sound is that which does not simply add to, but multiplies, two or three times, the effect of the image.” In this work, the music features we used are as follows: Rhythm Feature: The rhythm descriptor used in this work is beat histogram proposed in [8], which is built from the autocorrelation function of the signal. Timbral Texture Feature: The timbral textural descriptors we used include MFCC, Spectral Centroid, Spectral Rolloff, Spectral Flux, which are already described in Section 4.1.

High-Level Feature: Besides low-level audio features described in the previous sections, we also use Echo Nest analyze API1 to extract high-level features. Echo Nest is a web service that provides a music analysis API to extract high level features from songs. The high level descriptors used in this work include Danceability, Duration, Energy, Key, Loudness, Mode, Tempo and Time Signature. 5.3. Audio and Visual Word To make the numerical features more evident for latent semantic analysis of affective concept, we generate audio/visual words by discretizing the audio and visual features. There are several ways to discretize the numerical features, such as k-means clustering, equal width binning and equal frequency binning. In this work, we adopt the equal frequency binning for audiovisual word generation. 6. MODELING AND RECOMMENDATION 6.1. Content Correlation Modeling To discover the latent correlations between background music and video, we employ and modify Multiple-type Latent Semantic Analysis (M-LSA) [14]. While LSA is developed to exploit the co-occurrence relations between two types of entities, M-LSA is developed to exploit relations among multiple types of entities. In our work, the correlation between all types of audio words and visual words is captured based on the latent semantic space constructed by M-LSA. SAP

… ASM

CE

_

TEM

_

0

_ _

F TS

0

_

0

_ _

_

_

_

_

_

… … … … …

_

_

_

_

_

_

0 _

0

_

Fig. 2. Multiple-type graph and the unified matrix U

Given N types of entities {X1, X2, ..., XN} and pairwise co-occurrence relationships among entities, M-LSA represents the entities and correlations by a multiple-type graph G(V, E), which is an undirected graph where each vertex corresponds to a type of entities. There exists an edge eij ∈E that connects the ith and jth vertices if Xi and Xj have a co-occurrence relationship. All the pairwise co-occurrence information is represented by a unified matrix U, as shown in the right of Figure 2. U is composed of N×N correlation matrices, where each correlation matrix Mij, 1 ≤ i, j ≤ N, is a |Xi|×|Xj| matrix in which |Xi| (|Xj|) is the number of entities of type i (j). Mij is an identity matrix if i=j. In our work, each dimension of an audiovisual feature is regarded as one type of entity. We have total 93 types of interrelated entities in Table 1. The left of Figure 2 shows the graph, where vertex F is the film object and all other vertices 1

http://developer.echonest.com/docs/v4/track.html

are audiovisual features. Each audiovisual feature has edge connected to F, which means there are direct relationships between film objects and each feature. To exploit the indirect relationships between each pair of features, for example, the color energy (CE) and the danceability (DAN), the correlation matrix between them is defined as _

_

_

where F is the film object. The unified matrix U shown in the right of Figure 2, is produced to capture all correlation information of these entities. Moreover, an importance factor is associated to each matrix Mij to denote the relative importance of matrix Mij . The basic idea of M-LSA is to capture the most significant concepts hidden in the interrelated entities. The significance of concepts can be identified based on the mutual reinforcement principle. Based on the mutual reinforcement principle, the significance of each type of entities Xi, denoted as the significance vector ri, can be expressed as the following equation :

and can be rewritten by the unified matrix U · It can be observed that r will converge to the eigenvector of the unified matrix U. The significance vector r can be regarded as the latent semantics behind the correlations. The top k eigenvalues of U corresponds to k most salient concepts. The top k eigenvalues and the corresponding eigenvectors [e1, e2, ..., ek] can span a kdimensional latent space, which can be represented as an matrix Uk = [λ1·e1, λ2·e2, …, λg·ek]., and each row in Uk represents an entity. Moreover, in the step of audiovisual word extraction, it is possible that two similar feature values are discretized into different audiovisual words. To remedy this effect, two mechanisms are adopted for correlation matrix generation: non-sliding window discretization (NSWD) and sliding window discretization (SWD). Without loss of generality, let the entity be the color energy (CE). For non-sliding window discretization, the value of the element in the correlation matrix MCE_F is calculated as 1, 0, where mwf is the w-th audiovisual word of film f, Bin(.) is the binning function and vf is the value of CE feature of film f. For sliding window discretization, the value of the element in the correlation matrix MCE_F is calculated as 1,

where 0 1, sliding window.

, 1 0, is a degrade factor of correlation in

6.2. Temporal Structure Synchronization In the previous section, we consider the sematic identical in both background music and the query video. In the basic video production, many videos can be made more interesting by just adding the background music which is semantically identical to video; however, videos can become even more exciting by taking the time to synchronize various beats and rhythms in the background music with action or scene changes in the video. Consequently, to make a more attractive video, the synchronization in temporal structure should be considered. In our work, we propose a temporal structure alignment algorithm to measure how concordant and harmonic it is between the rhythms of video and background music. The presented alignment algorithm is a dynamic programming approach to align the music beats and video shots. If the timing of music beats and video shots can be well-aligned, the alignment score will be lower. Given the music beat sequence of music , , , t and the video shot boundary sequence , , , of a query q, the alignment score c(i, j) between the music beat and the video shot boundary sequence , , , sequence , , , is defined as , 1 1, , min 1, 1 where pm and pv are the penalties for skipping a music beat and a video shot boundary respectively. The first row and first column are initialized as follows: ,0 , and 0, . The alignment score of a music t is therefore defined as scorealign(t)=c(x, y) 6.3. Recommendation Given a query video, the goal of this work is to find a ranked list of music which is suitable for the query video as background music. The query video is first converted to a query vector q, which is obtained by the visual word extraction. It is then projected onto the reduced space where qc = q×Uk. After the projection, qc is compared with each row of audio features in semantic space by cosine measure. The similarity between an audio feature a and query vector q is · sim , | || | where ac is the projection of feature vector a. As a result, the top n relevant music are found by ranking score scorecont

_

· sim ,

where A is the set of audio features, t is a candidate music in music library, ma_t is 1 if t has audio word a and is 0 otherwise.

By taking temporal alignment score into consideration, the final recommendation result will be ranked by rank γ · rank 1 γ · rank where γ is ratio of alignment. 7. PERFORMANCE EVALUATION To evaluate the effectiveness of our proposed home video background music recommendation approach, we performed experiments on an advertising film collection, which is downloaded from YouTube. We collect 1278 commercial films using YouTube Data API 2 with the following query terms: “commercial”, “CF”, “advertisement”, “advertising”, “TVC” and “Ad”. The commercial films with low musicality are discarded from the dataset. The final dataset contains 723 commercial films. A set of music is also collected to form the background music candidate set. The candidate set includes soundtracks of films or TV dramas, albums of music in advertising and instrumental music. The music is in mp3 or wav format and converted as 22050 Hz, 16-bit and mono audio files. The total number of music is 999. We take ten-fold cross-validation to evaluate the performance. The commercial film dataset is divided into 10 disjoint subsets of equal size. In each iteration, one of the subset is used for testing and the others are used for model training. The average dimension of the unified matrix U is 4130. For each test video, the extracted video features and visual words are used to form the query vector. The audio track extracted from test video is regarded as the ground truth, and ranked along with the background music candidate set. The performance measure is defined as 1 1 | | 1 where is rank of the ground truth g, |C| is total number of music in the candidate set. The range of accuracy is from 0 to 1, where 1 means the top one result is the ground truth. To compare the proposed sliding window discretization (SWD) and non-sliding window discretization (NSWD) and evaluate the influence of number of eigenvalues k, we perform the experiments on the number of eigenvalues ranges from 50 to 500 versus SWD and NSWD, where the degrade factor α of correlation in SWD is 0.5. Figure 3 shows the experimental results. As shown on Figure 3, the proposed SWD significantly outperforms NSWD except for k ≤ 100. It may be caused by the dimension reduced too much and result in too much information loss. SWD achieves the highest accuracy (0.85) with k=300, while the accuracy of NSWD does not change much with different k. For SWD, the comparison on different correlation degrade factor α with k=300 is shown in Figure 4(Left). The accuracy drops significantly when α is less than 0.4, that is 2

https://developers.google.com/youtube/2.0/developers_guide_proto col_api_query_parameters

identical to the expectation that when α is ssmall, the sliding window discretization has only a minorr effect on the performance improvement. In the meantimee, the accuracy is similar when α is larger than 0.4. The higghest accuracy is achieved when α = 0.5. The experimental result for considering the temporal structure synchronization is shown in Figuure 4(Right). The performances of alignment ratio from 0.1 tto 0.4 are slightly better than 0 , which means consideriing the temporal structure is helpful for the recommendatioon. However, the accuracy drops while is larger than 0.5. It meets our expectation that the semantic correlation between music and video is more important than the temporal strructure. 1

Accuracy

0.8 0.6 SWD NSWD

0.4 0.2 0 50 100 150 200 250 300 350 400 450 500 Number of eigenvalues (k)

Fig. 3. SWD and NSWD with various numberr of eigenvalues 1 0.8

Accuracy

1 0.8 0.6

Accuracy

0.6 0.4

0.4

0.2

0.2

0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Degrade factor α of correlation

0 0.1 0.2 0.33 0.4 0.5 0.6 0.7 0.8 0.9 A Alignment Ratio

Fig. 4. Comparison of different degrade factor α iin SWD (Left) and recommendation with beat/shot alignment, thhe variation on alignment ratio (Right).

8. CONCLUSION In this paper, we investigated automatic baackground music recommendation by given home video basedd on multi-modal latent semantic analysis from video and the background music features. We exploited thirteen visuual features from four classes - color, texture, light and mottion, and thirteen music features from three classes – rhythm, ttimbral and highlevel features. The co-occurrence relatioonships between features are derived from the high-musicality video dataset and used for multi-modal latent semanttic analysis. we propose an algorithm for music beat and video shot alignment to calculate the alignability oof recommended music and video. The final recommendaation list is the combined result of both content correlationn and alignability. Experiments show that the proposed metthod achieves a promising result.

9. REFEREN NCES ng, "Ave - Automated Home [1] X.-S. Hua, L. Lu, and H.-J. Zhan Video Editing," in ACM Intl. Conf. f on Multimedia, 2003. [2] X.-S. Hua, L. Lu, and H.-J. Zhan ng, "Automatic music video generation based on temporal patttern analysis," in ACM Intl. Conf. on Multimedia, 2004. [3] D. Huron, "Music in Advertising: An Analytic Paradigm," The Musical Quarterly, 73(4), pp. 557-574, 1989. Z "Mining Association [4] C. Liao, P. P. Wang, and Y. Zhang, Patterns between Music and Video V Clips in Professional MTV," in Intl. Multimedia Modeeling Conf. on Advances in Multimedia Modeling, 2009. [5] R. W. Lienhart, "Dynamic videeo summarization of home video," in Proc. SPIE 3972, Stora age and Retrieval for Media Databases, 1999. [6] E. Scheirer and M. Slaney, "Consstruction and evaluation of a robust multifeature speech/music discriminator," d in IEEE Intl. Conf. on Acoustics, Speech, and Sig gnal Processing, 1997. [7] G. Tzanetakis and P. Cook, "Marssyas: A framework for audio analysis," Organised Sound, 4(3), pp. p 169-175, 2000. [8] G. Tzanetakis and P. Cook, "Mu usical genre classification of audio signals," IEEE Transactio ons on Speech and Audio Processing, 10(5), pp. 293-302, 20 002. [9] H. L. Wang and L.-F. Cheong, "Affective " understanding in film," IEEE Transactions on Circcuits and Systems for Video Technology, 16(6), pp. 689-704, 20 006. [10] J. Wang, E. Chng, and C. Xu, "Fully and Semi-Automatic n," in IEEE Intl. Conf. on Music Sports Video Composition Multimedia and Expo, 2006. L Hanqinq, and Q. Tian, [11] J. Wang, E. Chng, C. Xu, L. "Generation of Personalized Music M Sports Video Using Multimodal Cues," IEEE Transacttions on Multimedia, 9(3), pp. 576-588, 2007. uan, K. Wan, and Q. Tian, [12] J. Wang, C. Xu, E. Chng, L. Du "Automatic generation of personallized music sports video," in ACM Intl. Conf. on Multimedia, 20 005. [13] J.-C. Wang, Y.-H. Yang, I. H. Jhuo, Y.-Y. Lin, and H.-M. Wang, "The acousticvisual emotion guassians model for automatic generation of music video," in ACM M Intl. Conf. on Multimedia, 2012. [14] X. Wang, J.-T. Sun, Z. Chen, and d C. Zhai, "Latent semantic analysis for multiple-type interrelated data objects," in Intl. ACM SIGIR Conf. on Researrch and Development in Information Retrieval, 2006. [15] X. Wu, B. Xu, Y. Qiao, and X. Taang, "Automatic music video generation: cross matching of mussic and image," in ACM Intl. Conf. on Multimedia, 2012. [16] Y. Yu, Z. Shen, and R. Zimmermann, "Automatic music oor videos from contextual soundtrack generation for outdo sensor information," in ACM Intl. Conf. C on Multimedia, 2012. [17] M. Yu-Fei, H. Xian-Sheng, L. Lie, L and Z. Hong-Jiang, "A generic framework of user attentio on model and its application in video summarization," IEEE Transactions T on Multimedia, 7(5), pp. 907-919, 2005. A feature-based algorithm for [18] R. Zabih, J. Miller, and K. Mai, "A detecting and classifying scene breeaks," in ACM Intl. Conf. on Multimedia, San Francisco, 1995.