A Model for Valence Using a Color Component in Affective Video Content Analysis Iwan de Kok
[email protected] ABSTRACT
this field can be found in [LSG01] and [Del99].
This paper proposes a model which is able to detect differences in the valence level of the content of video data. The valence level is the degree of pleasantness in the scene. This detecting is done by means of analyzing the colors of the frames and linking these colors to emotional states. This model is obtained through a literature study and evaluated by analyzing some video sequences in which color is used to enhance the emotional feelings aroused by the viewer. Evaluation shows the model gives a pretty accurate prediction of very pleasant and unpleasant scenes.
On the other hand, affective video content analysis focuses on analyzing the moods of the content. It adds an emotional classification to the description like “dramatic development”, “romantic scene” or “soccer highlight”. This emotional classification is by the way mostly not the emotion the watcher actually feels, but the expected emotion. Every human being reacts different to stimuli. Someone might react with tears of joy when the two protagonists finally fall in each others arms kissing, while someone else might be objected by the overstated sweetness and corniness of the scene or people who laugh during scary scenes in a horror movie.
Keywords
In this paper a model is proposed which labels video clips with an emotional curve computed by analyzing the color characteristics of the imagery. This curve depicts the changes in the level of pleasantness throughout the video.
Affective video content analysis, video content modeling, color emotion
1. INTRODUCTION Nowadays more and more digital video content becomes available. TV-shows are available on the internet after broadcast for repeated viewing; movies are digitalized for DVD and even online rental of movies is available nowadays. As with all things, if you have a lot of it, it is a lot harder to find the specific video you need, unless you organize it in some sophisticated way. But how do you organize video content? You can watch every video clip yourself and attach keywords or something like that to it, but that involves a lot of work. It would be much easier if a computer can do this for you. This is exactly what research in video content analysis is trying to achieve. It tries to link the low-level features of a video stream to a certain semantic meaning the video stream is trying to express. In this field there are two types of video content analysis. •
Cognitive video content analysis
•
Affective video content analysis
Cognitive video content analysis focuses on analyzing the facts of the video and labeling the video accordingly. This type of algorithms can for instance recognize whether the video clip is an excerpt of a movie or a soccer match or even more specific recognize faces, skylines of cities. Even more detailed descriptions of the scene in terms like “a dialog between person A and person B”, “a car chase involving 5 cars of different types” or “a goal is scored by number 9” can be given. Over the past years there has been more research in this field of video content analysis than in its affective counterpart. Overviews of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission. 4th Twente Student Conference on IT , Enschede 30 January, 2006 Copyright 2006, University of Twente, Faculty of Electrical Engineering, Mathematics and Computer Science
In section 2 of this paper further background information is provided to back this model, followed by section 3 in which the research questions of the paper are explained as well as the methods used to answer those questions. Next in section 4 I will elaborate about the model developed in this research. After that in section 5 the evaluation of the model will be presented and the results will be discussed. In the last section, number 6, the conclusion will be presented. In this section also recommendations for future research will be given.
2. BACKGROUND Hanjalic and Xu [HX01] were one of the first researchers to address the topic of affective video content analysis. They argued the need for this type of research as follows. To ensure the success of consumer-oriented multimedia databases the development of such systems should meet the demands of the user. Since the average user becomes more and more technologically aware their demands of such system increase. In case of video storage systems this could mean they would expect personalized video delivery. Of course content of the video is a good criterion for personalization of the demand and is nowadays available, but as always the user constantly wants more. He might want a very funny movie and based on content alone a video storage system won’t be able to meet his demands. When affective labels are available the system can deliver the video the user demands. Another application where affective video content analysis may be helpful is the automatic generation of movie trailers or a summary of a sports match by extracting the highlights of the original movie or sports event. Before further elaborating about the results of their research I should first clarify the term affect. The definition of the word is “a psychological term for an emotion or subjectively experienced feeling”. In the field of psychology there are a couple of models or theories regarding emotions. One of them and the one Hanjalic and Xu used in their research ([HX01] and [HX05]) is the three-dimensional approach as mentioned in the research of Russell and Mehrabian [RM77]. According this research affect has the following three basic dimensions.
•
Valence (Pleasure)
•
Arousal
•
Control (Dominance)
Valence characterizes the type of emotion felt. It ranges from pleasant to unpleasant. In some papers this dimension is called pleasure. The second dimension, arousal, characterizes the intensity of the emotion and ranges from calm to excited. The final dimension, control or dominance, distinguishes emotional states with similar valence and arousal levels in the range no control to full control. By these dimensions all emotions can be placed in the 3-dimensional space as seen in fig. 1. For instance joy has a high valence level, high arousal level and low control level and fear has a low valence level, high arousal level and low control. Not every combination of these dimensions is possible. E.g. there is no emotion with a neutral valence level and a high arousal level hence the gray area in Figure 1.
Figure 1. Illustration of the 3-D emotion space (from Dietz and Lang [DL99]) Now going back to the research of Hanjalic and Xu; they found in literature connections between some low level features of video data streams and dimensions of the emotion space and made algorithmic models of them. The relations they found for the arousal dimension were a motion component, rhythm component and a sound energy component. The motion component measures the amount of movement in the imagery. The more movement in the video the more exciting this part is. The rhythm component is obtained by measuring the shot lengths along the video. The shorter the shot lengths the more arousing this part is. In the end the total energy of the sound track is measured. Again the more energy is in the sound track the more arousing the part of the video is. They only used one component for the valence part of the model, namely the pitchaverage component. A high pitch-average indicates a positive valence level and low pitch-average indicates a negative valence level. One component to calculate the valence level isn’t very reliable, so that leaves room for improvement. Some other researchers have also been researching affective video content. Color is one of the low level features which isn’t investigated by Hanjalic and Xu. Wei et al. [WDC04] made two types of color histograms of each frame of a movie. The first one was simply a histogram of the color in this frame, the other one was the histogram of the transitions between the frames. They fed these histograms in different ways to a machine learning algorithm to analyze them for mood classification. They used the relations
found by [Mah96] between colors and moods as basis for their analysis of the color in relation to mood-tones. They only analyzed the hue of the color for their system. With this system they could classify whole movies with prevailing mood-types at accuracy between 78% and 85%. Kang [Kan03] also included the brightness and saturation levels of the color in his research, as well as motion and the shot-cut rate. He used Hidden Markov Models, which are probabilistic models for time series data, to model these features and map them to an emotional classification. This classification is one in which the sequence is labelled with either fear, sad, joy or normal. Just like the model of Wei et al. this is also parameterized by means of machine learning. These three emotional classifications can be detected by these means with accuracy between 76% and 81%. Taken these results in account a color component seems like a good addition to the model of Hanjalic and Xu. Instead of labelling scenes or whole movies with emotions like [WDC04] and [Kan03] do with their models, this model will result in a valence time curve. This curve will also clearly illustrate the way the emotions change during the video. The fact that color is used to influence the watcher’s emotional state is also supported by film theory [Zet99], [Smi03]. It states that color perception and lighting are important contributors to the mood tones the filmmakers would like to add to the scenes. How this works in practice is nicely illustrated by the following quote of Sven Nykvist, a famous and celebrated Swedish cinematographer. “We felt that color raw stock was technically too perfect; it was difficult not to make it look beautiful. When we came to shoot The Passion of Anna, we wanted to control the color palette so it would reflect the mood of the story. We managed to find a location with very little color, and everything in the film’s design was intended to simplify the colors in front of the camera. I wanted to avoid warm skin tones, and after exhaustive testing, I learned to use very little make-up and to drain some of the color in the labs. (I used the same technique on The Sacrifice for Tarkovsky where we took out the red and the blue, giving us a look that was neither black-and-white nor colour – it was ‘mono colour’) Cries and Whispers marked another important step in my appreciation of how to use colored light for dramatic effect. We developed a color scheme for the interiors based on the color red – every room was a different shade of red. Audiences watching the film might not realise it consciously, but they feel it.” [Ett98].
3. RESEARCH The main research question this paper proposes an answer to is the following •
What model, comparable to that of Hanjalic and Xu [HX05], can be found to discover the usage of color to achieve or enhance emotional effect in video data streams by automatic, computational means?
To ensure the model makes sense a couple of other basic questions need to be answered. In the first place whether or not color can be used in video data streams to achieve or enhance emotional effect and whether this is actually used in films. I’ve shown above this effect is used indeed in films e.g. by means of filtering colors in post-production, choosing specific locations and adjusting the make-up. Wei et al. [WDC04] and Kang [Kan03] have also shown that it is possible to discover the usage of color to this end by automatic, computational means.
To be able to make a model I should somehow find the relation between the low-level features of the video data stream in this case color and the emotion intended by this color. Through literature research I found some papers regarding this issue. Before going into them in detail I will first give a short explanation about the color spaces I’m using in this research, since there are several of them, each with its own characteristics. Probably the most widely known color space is the Red, Green, Blue color space (RGB). This color space is used in the digital world to store and display colors for instance the screen of a television set works this way, which can be clearly seen by looking very closely at the screen. Since the researched relations between colors and emotions aren’t based on the amount of red, green or blue in a color, this color space isn’t suited for the model.
emotions to the color yellow. Similar results are stated in [Mah96] and [Hår04]. Since literature study doesn’t provide a solid answer to the relation between the hue of a color and the expressed emotion another way of developing a model for this relation needs to be applied. I intended to use machine learning like Kang [Kan03] and Wei et al. [WDC04], but due to the lack of time I never came to actually implementing this and the results and thus the model aren’t included in this paper.
Therefore I chose to use the HSV color space. HSV stands for Hue, Saturation and Value and is commonly used in graphics applications. This color space can be represented by the cone in Figure 2. The hue dimension determines the wavelength of the color and ranges from 0 to 360°. In other words it determines whether the color is blue, yellow, green or any other color of the rainbow. Saturation determines the vibrancy of the color and ranges from 0% to 100%. With a saturation of 0% the color becomes true white, gray or black dependent on the value level, which determines the brightness of the color ranging from 0% to 100%.
Figure 3. Actual and predicted valence levels as functions of hue [VM94]
4. MODEL Since the model should be able to be an extension of the model proposed by Hanjalic and Xu [HX05], it should satisfy the same criteria they introduced. As valence is a psychological category, the model needs to be psychologically justifiable. To achieve this they introduced the following three criteria.
Figure 2. The HSV color space Valdez and Mehrabian [VM94] researched how each of the different characteristics of the HSV color space related to the three dimensions of the VAC emotion space. They discover that the saturation and brightness characteristic had strong and consistent effects on all three dimensions. In case of the valence dimension both saturation and brightness had a positive effect on the valence level. In other words, the brighter and more saturated the color, the more positive the perceived emotion. These results are to be expected and the mathematical equation discovered by them is suitable for implementation in the model. They also give an equation for the relation between the hue characteristic and the valence level. Remarkable about this relation is the fact that yellow is the most negative color as shown in Figure 3. This was counterintuitive to my own perception and when I dug deeper I found literature where these results are contradicted. In a study of Kaya and Epps [KE04] in which college students were asked about their emotional response to 13 different colors 93.6% associated positive
•
Comparability
•
Compatibility
•
Smoothness
Comparability ensures that the results of calculating the valence level by analyzing the brightness and saturation of different video data streams isn’t influenced by differences in brightness and saturation levels while converting the data stream from the original source to DVD or another medium. They introduced compatibility as criterion to ensure the resulting valence and arousal levels of the same moments in the data stream weren’t conflicting with the 3-D emotion space. E.g. the result of the arousal model was high, but the result of the valence model was neutral, which would imply a high exciting, but neutral feeling and no such emotion exists. Since I only focus on a valence model this criterion doesn’t apply to my model. The third and last criterion is smoothness. In real life you don’t instantly forget the images you just saw and the emotions aroused by them linger for a while. The model should provide some way to imitate this behavior.
To calculate the valence level based on the brightness and the saturation of the colors we must compute the average brightness b(k) and saturation s(k) of each frame k. Before we can do this we must meet the comparability requirement. Using the equation found in Valdez’ and Mehrabian’s research [VM94] we get the following formula c(k). c (k ) = 0.69b(k ) + 0.22 s( k )
state, so the changes are a little delayed in the second time curve. The smoothed time curve C(k) is the end result of the model. This smoothed time curve is defined as follows: max(c(k )) C (k ) =
(1)
k
max(c (k ))
c ( k )%
(2)
k
Here, c (k ) is the result of the convolution of the curve c(k) with the smoothing window. In formulaic form: c (k ) = c (k ) ∗ K (l , β )
(3)
In (2) the results of the convolution are scaled in order to bring the values back inside the original value range (-100%-100%).
5. EVALUATION Figure 4. Kaiser Window with length 750 and shape parameter 5
Figure 5. Brightness and saturation time curve without smoothing
Figure 6. Brightness and saturation time curve after smoothing
To evaluate my model I implemented the model in order to analyze some video sequences. An algorithm is implemented which accesses a video data stream frame by frame. In order to be able to calculate the average brightness and saturation values the RGB color space needs to be converted to a HSV color space. This conversion works in the following way. First each of the color values of the RGB code, which usually ranges from 0 to 255 on a computer, must be scale to a scale ranging from 0.0 to 1.0. Let Max be the maximum value of the (R, G, B) values and Min the minimum of those values. The results (H, S, V) values can be obtained by calculating the following functions.
H = 60 *
G−B +0 MAX − MIN
, if MAX = R
H = 60 *
B−R + 120 MAX − MIN
, if MAX = G
H = 60 *
R−G + 240 MAX − MIN
, if MAX = B
MAX − MIN MAX V = MAX S=
As can be seen in Figure 5 the obtained brightness and saturation time curve clearly doesn’t meet the smoothness criterion. While watching the fragment you won’t experience such abrupt emotional changes. Here these fluctuations are caused by scene changes which result in great variations in brightness and saturation levels. To smooth this out c(k) is convolved with a sufficient long smoothing window. Like Hanjalic and Xu [HX05] I use a Kaiser Window K(l, β) of length l and the shape parameter β for this purpose. The values chosen for l and β are 750 and 5 respectively. These are the same values Hanjalic and Xu used for their smoothing. This window is shown in Figure. 4. After convolving c(k) with the Kaiser Window the resulting brightness and saturation time curve looks like the one depicted in Figure 6. It shows a much more natural development of the viewer’s emotional response. The scene represented with this time curve is taken from the CGI-animation comedy Valiant. In this scene the protagonist pigeon Valiant and his new buddies are taken to the training facility to become homing pigeons carrying messages during World War II. During the first three quarters of the scene they are meeting their drill instructor for the first time and they are joking around. The mood is very light, which shows on the figures. After this we cut to another homing pigeon that is taken in captivity by German falcons on the front. The mood in this scene is much creepier since the pigeon is tortured by being forced to listen to yodel music. It takes time before the viewer totally acquired this new emotional
Now that we have converted the colors of the frame into a usable color space we can calculate the average brightness and saturation and thus the valence level by putting those values in the model. By repeating this for every frame the resulting valence time curve can be computed and displayed. The video sequences I analyzed came from 10 different movies, namely Cold Mountain, The Color Purple, Dude Where’s My Car?, Guess Who, Hotel Rwanda, The Incredibles, Mean Creek, Monster’s Inc., Saving Private Ryan and Valiant. All of the movies are American movies and most of them clearly have manipulated the colors to enhance emotional effect. From each movie a positive and a negative scene is selected and analyzed by the implemented algorithm. This selection was done more or less at random. Skipping through these movies I selected scenes which I thought were sad or happy. I did try to select two scenes which takes place in a more or less similar environment. The length of those scenes varies from 4 to 19 minutes. Below some of the results are displayed.
Figure 7. Valence curve for a scene from Mean Creek
Then we have a flash forward to a battlefield near the end of the Civil War in which Jude Law fights alongside the Confederate army and massacre a platoon of Union forces who have run themselves stuck in a crater. Later in the scene he also attempts to rescue a friend who falls into the crater among Union forces. This transition is about halfway in Figure 10 where the valence level drops below 0 never to return above 0 except for a brief moment.
The scene from Mean Creek (Figure 7) features some youngsters on a boat trip across a river. They are joking around a bit, but one kid starts to get annoying. The kid gets pushed and falls into the water. During this fall he bumps his head to the camera he is holding, which results in him being unconscious and eventually drowning. This happens when the curve (Figure 7) goes from a positive mood to a negative mood. In the following minutes they drag the body on shore and mourn for his death in despair about the consequences this may have on their lives and debating what to do with the body.
The results of the discussed analyses prove this model can be useful in detecting positive and negative scenes in a video segment, except it should be used with caution. The results of Hotel Rwanda are pretty consisted, but the results of Mean Creek and Cold Mountain do show a lot of peaks. In the fragment of Mean Creek those peaks are caused by the amount of sky that is visible in the scene. The sky consists of brighter colors than the rest of the frame, so the more sky is visible, the higher the brightness and saturation levels of that frame and thus the higher the valence level. This is also the case in the first part of the sequence taken from Cold Mountain, but the peaks in the second half of the sequence are caused by white smoke which looks orange/red due to the filter which is placed over the entire scene.
Figure 8. Valence curve for a neutral scene from Hotel Rwanda
Figure 9. Valence curve for a sad scene from Hotel Rwanda The two scenes I analysed from Hotel Rwanda both take place in the same hotel lobby and square in front of the hotel. The neutral scene (Figure 8) takes place in broad daylight, while in the sad scene it is raining, presumably to enhance emotional effect, since it is one of the saddest scenes of the movie and the only one in which it is raining during the whole movie. In the neutral scene hotel guests arrive and the hotel manager has a conversation in the bar of the hotel with a reporter. In the second scene (Figure 9) all the foreigners get to leave the hotel and the country after the rebellion started. The cruel part of the scene is that all the Rwandese people need to stay behind, even little orphans brought to the hotel by priests and nuns. This results in some sad departures and separations of loved ones.
Apart from these three movies the fragments from the three CGI-animation movies, Valiant (shown in Figure 6), The Incredibles and Monsters Inc., also show results that prove the model to be useful. Since the model only makes sense when the makers of the movie tried to enhance emotional effect by manipulating the colors, it is to be expected the CGI-animations show consistent results. No color in the movie is there by accident. They thought about every color, since everything is created from scratch. The other four movies don’t show curves in which negative or positive scenes are recognizable, but there also aren’t any clear differences in color detectable with the human eye. Only in sequence of The Color Purple is a subtle color adjustment applied on the more negative parts of the segment, but you have to look carefully to see it. Variations in the colors in the background make this difference undetectable by this model. This model will definitely not be foul proof. For instance night scenes will always be labelled negative since the colors are far less bright than scenes during the day, while there also will be positive night scenes. Since filmmaking is not an exact science it isn’t required to have the appropriate lighting for a scene. Like cinematographer Robby Müller said in [Ett98]: “A scary scene doesn’t require film noir lighting to be effective, nor should a love scene always need obviously romantic lighting. By counter pointing the mood of a scene, you can sometimes give it added emotional realism.”
6. CONCLUSION
Figure 10. Valence curve for a scene from Cold Mountain This scene from Cold Mountain is taken from the beginning of the movie and consists of two settings. The first part of the scene takes place in a nice village in the hills on a bright and sunny day. Nicole Kidman arrives in this village where Jude Law and the other men of the village are building a new house. After a little talk with one of the women, she brings the men some refreshing drinks. Here they meet for the first time and it is obvious they fancy each other.
In this paper I proposed a model for affective video content analysis using a color component. The results of the evaluation of this model show that it has pretty good results once the video content offered to the model meets some prerequisites. The content has to have manipulated the colors in some way to enhance emotional effect. Typically television series or movies apply these types of techniques. But you probably won’t find any differentiations in the valence curve when analyzing an ordinary soccer match, something Hanjalic and Xu [HX05] are able to do with their model. Since this model has the same criteria it had to meet and the input and output of the model is the same it is possible to integrate it with their model. Whether this will result in more accurate affect curves is something
which can be investigated in the future. Also the relation between hue and affect is something for further research. In general other relations between low-level features of video content and the affect dimension should be investigated and known relations should be sharpened to obtain better results from the existing models.
REFERENCES [LSG01] M.S. Lew, N. Sebe, and P. Gardner, Principles of Visual Information Retrieval, Springer, New York, 2001. [Del99] A. Del Bimbo, Visual Information Retrieval, Morgan Kaufman, New York, 1999. [HX01]
[HX05]
A. Hanjalic, and L.Q. Xu, User-oriented Affective Video Content Analysis. Proceedings IEEE CBAIBL 2001,Kauai, HI, 50-57. A. Hanjalic, and L.Q. Xu, Affective Video Content Representation and Modeling. IEEE Transactions on Multimedia, Vol. 7(1), 2005, 143-154
[RM77] J. Russell, and A. Mehrabian, Evidence for a threedimensional theory of emotions. J. Res. Personality, vol. 11, 1977, 273-294. [DL99] R. Dietz and A. Lang, Affective Agents: Effects of agent affect on arousal, attention, liking and learning. Proceedings of Cognitive Technology Conference, San Francisco, CA, 1999.
[WDC04] C.Y. Wei, N. Dimitrova, and S.F. Chang, ColorMood Analysis of Films Based on Syntactic and Psychological Models. Multimedia and Expo, 2004. ICME ‘04, Vol. 2, 831-834. [Mah96] F.H. Mahnke, Color, Environmental and Human Response, Van Nostrand Reinhold, New York, 1996, Chapter 3. [Kan03] H.B. Kang, Affective Content Detection using HMMs. Proceeding of the 11thACM Multimedia 2003, Berkeley, CA, 259-262. [Zet99] H. Zettle, Sight Sound Motion: Applied Media Aesthetics, Wadsworth Publishing, 1999. [Smi03] G.M. Smith, Film Structure and the Emotion System, Cambridge University Press, 2003 [Ett98]
P. Ettedgui, Cinematography, RotoVision SA, 1998.
[VM94] P. Valdez, and A. Mehrabian, Effects of color on emotions. Journal of Experimental Psychology: General., Vol. 123(4), (Dec. 1994), 394-409. [KE04] N. Kaya and H.H. Epps, Color-emotion associations: Past experience and personal preference. Proceedings of the Interim Meeting of the International Color Association, Porto Alegre, Brazil, November 3-5, 2004, 31-34. [Hår04] Hårleman, M. Colour emotion in full-scale rooms. Proceedings of the Interim Meeting of the International Color Association, Porto Alegre, Brazil, November 3-5, 2004, 223-226.