Learning Video Browsing Behavior and Its Application in the Generation of Video Previews Tanveer Syeda-Mahmood*
Duke Ponceleon
IBM Almaden Research Center K57/82, 650 Harry Road San Jose, CA 95120
IBM Almaden Research Center K57/B2, 650 Harry Road San Jose, CA 95120
[email protected]
[email protected] a large number of users. With such prevalence of videos and users, streaming media servers are presented with new opportunities and a challenge: Can they track browsing behavior of users both within and across videos, to determine what interest users? Learning this information is potentially valuable not only for improved customer tracking and context-sensitive e-commerce, but can also be useful in the generation of fast previews of videos for easy pre-downloads. For example, a simple observation such as “most people abandoned a video after a few seconds of watching” can give important clues to the effectiveness of the video as an advertisement. This information can be provided by customer tracking agencies as important feedback to video ad production agencies. Similarly, by tracking interest at a single user level, context-sensitive feedback to users can be provided. Thus for example, if it can be deduced in an e-commerce video that users preferred a section of the video depicting, say a microwave, further information about these devices can be presented to the user to augment his/her current presented information as a type of active hot-spotting. Finally, if we can learn users’ interest in video segments by watching their interaction with videos, the corresponding clips are bound to satisfy the “interestingness” constraint in assembling video previews. Such trailer video pre-views can be suitable pre-downloads for the actual video and can help users find and select appropriate content from the vast amount of video data now available. Can machines learn browsing behaviors of users though? Simple tracking of video usage is possible by examining the server logs and noting the number of times a video was “touched”. In a sense, this is already being done in digital video recorders and set-top boxes such as Tivo, wherein user’s preference of TV programs are derived by analyzing their past viewing behavior as recorded in server logs. In fact, information about particular video segments watched by the users can also be obtained from server logs and used in summary generation as done in [5]. These simple indicators, however, do not necessarily reveal interest of users. For example, in trying to find a certain information, a user may have skimmed different portions of the video before settling down on a region of interest to watch. All such video regions will be indicated as “touched” in the server logs and are indistinguishable as can be seen from Figure 3e. Thus there is a need for more detailed tracking of video browsing behavior to determine the interest of users. We first present a theoretically well-founded methodology for learning video browsing behavior through the use
ABSTRACT With more and more streaming media servers becoming commonplace, streaming video has now become a popular medium of instruction, advertisement, and entertainment. With such prevalence comes a new challenge to the servers: Can they track browsing behavior of users to determine what interest users? Learning this information is potentially valuable not only for improved customer tracking and contextsensitive e-commerce, but also in the generation of fast previews of videos for easy pre-downloads. In this paper, we present a formal learning mechanism to track video browsing behavior of users. This information is then used to generate fast video previews. Specifically, we model the states a user transitions while browsing through videos to be the hidden states of a Hidden Markov Model. We estimate the parameters of the HMM using maximum likelihood estimation for each sample observation sequence of user interaction with videos. Video previews are then formed from interesting segments of the video automatically inferred from an analysis of the browsing states of viewers. Audio coherence in the previews is maintained by selecting clips spanning complete clauses containing topically significant spoken phrases. The utility of learning video browsing behavior is demonstrated through user studies and experiments.
Keywords Learning, browsing behavior, views, audio, topics.
interesting
content,video
pre-
1. INTRODUCTION With more and more streaming media servers becoming commonplace, streaming video has now become a popular medium of instruction, advertisement, and entertainment. With thousands of hours of content becoming available on demand, one can expect that such videos will be seen by *Author
for
correspondence.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. A4M’OZ, Sept. 30-Oct. 5,2001, Ottawa, Canada. Copyright 2001 ACM l-581 13-394-4/01/0009...$5.00
119
of Hidden Markov Models(HMMs). Specifically, we analyze the various potential browsing behavioral states that users can be in while watching videos, and model them as hidden states of an HMM. We then observe their interaction with video (using a media-player) and model a sequence of these interactions as observations generated by the HMM. The parameters of the HMM are estimated using a combination of supervised learning (through positive examples), and unsupervised learning (maximum likelihood estimation via the EM algorithm). The most likely state sequence that can account for a given observation is then derived using the Viterbi algorithm [13]. The browsing state labels and data for training the HMM were obtained through user studies. Next, we show an application of such learning of browsing behavior in the automatic generation of video pre-views. Specifically, we assemble short pre-views from selecting candidate clips from interesting video segments. The interesting video segments are defined as those segments watched by a user in an interested state. Detection of interesting segments is done by collecting data about browsing states across all viewers for the same video and recording peaks in the resulting visit count distribution during the interested states. Since the interesting segments can themselves be long, we further select clips within these segments for video preview generation. The clips to retain are determined by analyzing the audio track for important topical phrases. A rough duration of the clip is determined by noting the time during which a sentence or clause containing the topical phrases, was spoken. The trailer summary or preview assembled this way ensures that it captures interesting video segments and is also piece-wise coherent spanning a spoken sentence or clause. The rest of the paper is organized as follows. In Section 2, we present the HMM-based modeling of browsing behavior. In Section 3, we describe user studies performed to obtain data for training and testing the learning of browsing behavior. An application of learning browsing behavior in the generation of video previews is discussed in Section 4. Finally, in Section 5, we present an evaluation of the utility of learning browsing behavior in the context of generating video previews.
2.
spective, they also acquire additional meanings when seen from a user’s browsing perspective. If such browsing behavior can be characterized by a set of states, called the browsing states, then the learning of browsing behavior can be transformed into the problem of predicting the browsing states of the user by observing the interaction sequence generated during a viewing session. A simplistic approach would be to use a rule-based system to do the prediction. But such a system cannot effectively deal with uncertainties associated with multiple interpretations of the same viewer interaction. A more formal way to address these uncertainties is through the use of HMMs. To motivate the appropriateness of HMMs for this problem, observe that, even though there may not be a one-toone correspondence between user actions and likely browsing states, it is reasonable to assume that the action elicited by a user is a reflection of his current behavioral state. Further, we can also make the assumption that the current behavioral state at time ‘t’ is a function of the immediate past behavioral state at time ‘t-l’. This implies that the influence of the past behavioral states on the current behavioral state is a Markov chain. Such time varying behavior and these two assumptions imply that the browsing states can be modeled through the mechanism of an HMM [13]. An HMM is a stochastic signal model to describe a system that can be in one of a set of N possible states {Sl, SZ, ..SN} at any time instant ‘t’, and produces one of a set of M possible observation symbols {VI, V,, . ..VM} at ‘t’. The transition between states, as well as the probability of seeing a specific observation symbol while in a state are described through probability distributions. Specifically, an HMM is characterized by X = (A,B, rr), where A = {aijll < i,j, 5 N} is the state transition probability distribution, B = {bj (Ic)ll 5 j 5 N, 1 < Ic 2 M} is the observation symbol distribution, r = (nil1 5 i 5 N} is the initial state distribution [13], and aij bj(k) Ti
= = =
P(q* = SjIqt-1 P(Ot
=
P(q1 =
= S,)
%Iqt = S,),
sj)
(1)
(2) (3)
where Ot is the observation symbol seen in the observation sequence at time t. We assume a first-order Markov chain,
so that that P(qtlqT,w-z, . ..qt+l.qt-1, . ..ql) = f’(qtlqt-1) for a T-length observation sequence 0 = {01,02, ..OT}.
MODELING VIDEO BROWSING BEHAVIOR
HMMs are a widely and successfully used tool in statistical modeling, statistical pattern recognition, and speech recognition [13]. They have not been used as often for modeling user behavior, with a few exceptions [ll, 11. In our approach, the browsing states of a viewer are the hidden states, while the sequence of user interactions correspond to an observation sequence. But to make it work, we need to design the HMM almost from a clean slate. That is, we do not know how many browsing states there are, nor what their labels are. Even if we could derive them by some process, we need techniques for designing the model parameters (X). Learning the browsing state at any time then becomes equivalent to using the HMM to predict the most likely browsing state sequence that can give rise to the observation.
The central thesis we propose is that it is possible to learn browsing behavior of viewers by observing their interaction patterns while watching a video. The rationale behind this idea is as follows. Viewers watch digital videos by operating the controls offered by a media player. If the video is short, it is possible that viewers watch a video from beginning to end. This behavior can be revealed by the viewer’s choice of controls used, i.e., by the fact that the play button was pressed once, and pause was used only once at the end. When the video is long, however, people look for ways of quickly grasping the content. Thus if they could get a peek into the content by visual fast-forwarding (as is available in DVDs), then a series of viewer interactions involving the “fast forward->play->pause” can be interpreted as reflecting a browsing behavior in which the viewer was initially looking for something, and having found it, watched it for a while before pausing. Thus while the media playing controls have obvious meanings from playing the media per-
2.1 Estimating the HMM through unsupervised learning In this section, we assume that the number of browsing
120
states and their labels have already been ascertained. Similarly, we assume that the observation symbols have been suitably chosen. A user study to ascertain these is described in Section 3.1. For each observation sequence 0 then, we would like to choose the HMM that maximizes the probability of seeing this sequence. The most popular way to address this problem is to employ iterative maximum likelihood estimation using the Baum-Welch or the EM algorithm [S]. Specifically, the likelihood of the parameters X given the observation 0 called L(X]O) = P(O]X) is estimated by taking the average (expectation) of the (logarithm of ) likelihood function P(O,Q]X) over all possible state sequences that can give rise to an observation, and choosing the parameters that maximize this function. The typical way to do this is to start off with an initial estimate of the model parameters X1-l and revise the estimate forming the log likelihood function U(X, Xi-‘) = c logP(O,&IX)P(O, QIX) Q
The right-hand-sides in both equations ( 8) and ( 9) can be computed from previous estimates of the model parameters as described in [13]. As can be seen from the above equations, the HMM parameters are continuously updated for each observation sequence. Although the number of states are fixed, each user will have their own HMM, which will be identical to another user’s HMM if they both elicited the same observation sequence and started off in the same initial states. For the number of states we considered, (10 in our case), having a separate HMM for each user will not be a major issue since the EM algorithm converges within 5 iterations.
2.2
(4)
and choosing the new estimate X” = argmaax&(X, X%-l). It has been proved by Baum and several others [8], that maximization of &(&Xi-‘) leads to increased likelihood, i.e., maz[Q(X, Xi-‘] + P(O]X) > P(O]Xi-l), with the likelihood function converging to a critical point. For a particular state sequence Q, it is easy to show that
so that the likelihood function becomes
Qt
U(X, Xi-‘) = x logn,oP(O, &IX+‘)
t=l
St(i) = maxq, ,42 ,... 4t--1
+ x(2 log bqt (Ot)P(O,QIAi-l). Q
rj = p(“;~o~A?/:l) = P(qo = $10, xi-l).
azj = CL’=, P(O,qt-1 = si,qt = sjlxi-‘) = SIXi-‘)
.
2.3 Training the HMM As we saw above, the EM or the Baum-Welch algorithm can iteratively estimate the model parameters starting from initial estimates and is an example of unsupervised learning. Such maximization, however, leads to local maxima only, with the nature of the maximum reached critically dependent on the choice of the initial estimates. To enable effective learning, we augmented this unsupervised learning by a training or supervised learning stage in which manually chosen pairs of observation sequence and state labels were used to generate the initial estimate in two steps. In the first step, we initialized the HMM parameters by making some obvious impossible state and symbol transitions have zero probabilities. The rest of the parameters were obtained from uniform distributions. In the second step, which we let users specify their state at random time instants during their video browsing sessions. The experimental setup and the focused tasks that elicited the various browsing behaviors are described in detail in Section 3.1. The observation symbols of each session were recorded continuously giving rise to observation sequences with browsing states at random instants.
(7)
(8)
Similarly, bj (Jr) is the ratio of the expected number of times in state Sj and observing the synbol vk and the expected number of times it is in state Sj. It is given by bj(k) = CL’=, P(O, Pt = sj IXi-l)SOt=Vk CT=l=, P(O,qt = Sjl-A+-l)
P(q1,qz, . ..qt = i,o1,02, . ..OtIX). 01)
The right hand side term is simply the expected number of times the HMM is in state Si at time t=O, and can be computed from the values of the model parameters from the previous iterations. The detailed formulas are available in the literature and standard text books on HMMs [13]. Similarly, aij is the ratio of the expected number of transitions from state Si to state Sj and the expected number of transitions from state Si, and can be estimated as P(O,qt-1
(10)
The details of the Viterbi algorithm are available in the literature [13] and several textbooks.
t=l
Using the above equation, the model parameters can be now be independently estimated by optimizing each of the above three terms individually to get:
c;Yl
= UrgmU~l~i