video classification based on hmm using text and faces - eurasip

0 downloads 0 Views 726KB Size Report
1) Initialization: Models on the first frame that contains text are considered as the ... assumption is variation of faces within a continuous shot is usually small.
VIDEO CLASSIFICATION BASED ON HMM USING TEXT AND FACES Nevenka Dimitrova1), Lalitha Agnihotri1) and Gang Wei2) 2) 1) Philips Research Computer Science Department 345 Scarborough Road Wayne State University Briarcliff Manor, NY 10510 Detroit, MI 48202 e-mail: (nevenka.dimitrova, e-mail: [email protected] lalitha.agnihotri)@philips.com

ABSTRACT Video content classification and retrieval is a necessary tool in the current merging of entertainment and information media. With the advent of broadband networking, every consumer will have video programs available on-line as well as in the traditional distribution channels. Systems that help in content management have to discern between different categories of video in order to provide for fast retrieval. In this paper we present a novel method for video classification based on face and text trajectories. This is based on the observation that in different TV categories there are different face and text trajectory patterns. Face and text tracking is applied to arbitrary video clips to extract faces and text trajectories. We used Hidden Markov Models (HMM) to classify a given video clip into predefined categories, e.g., commercial, news, sitcom and soap. Our preliminary experimental results show classification accuracy of over 80% for HMM method on short video clips. This paper describes continuitybased face and text detection and tracking in video for the above HMM classification method.

HMM approach using face and text is discussed in Section 4. Experimental results are given in Section 5 and in Section 6 we present the conclusion and future work. 2.

In our system, we used the text detection algorithm developed in [2]. Videotext is detected based on the assumption that the text has a high contrast and that there are limits to the text box height and width. It consists of the following stages: 1)

2) 3) 4) 5) 6)

1.

INTRODUCTION

Content-based searching is a crucial element of any media content management system. For TV content, consumers can find the category or the title of the program from the paper or the electronic program guide. However, it is difficult to categorize a stored video segment. Once the video segment is categorized, different content analysis algorithms can be applied to extract additional features that are more particular to that domain. For example, if a program is categorized as sporting event then filters to find meaningful events such as “home run” can be applied. Different methods are proposed in the literature for video program classification into predefined categories or detect specific categories of programs like the commercial detection system presented in [5] and [8]. We explored the use of face and text tracking for video classification. This stems from the fact that most TV programs focus on human activities and human faces play a very important role in the same. In most occasions video content can be satisfactorily characterized by capturing face trajectories. On the other hand, text is a helpful cue in identifying certain types of programs, such as news, commercials and sports. In our experiments, we integrate information about face and text, including their size, movement and duration of presence, and classify a given video clip (in MPEG format) into one of four categories: news, commercials, sitcoms or soaps. Encouraging results have been obtained so far and better performance in terms of accuracy and efficiency is expected by optimization. We used the methods described in [2] and [17] for text and face detection, respectively. After detection, text and face features are extracted and classified with Hidden Markov Model (HMM). The remaining of the report is organized as follows. Section 2 presents our text detection and tracking system. Section 3 describes our continuity-based face tracking process. Video classification based on

TEXT DETECTION AND TRACKING

Channel separation: capture input frames from a video and separate the luminance frame from the input. The next steps are performed on the luminance frame. Image enhancement: enhance the edges with 3x3 mask. Edge detection: find the edges using a 3x3 mask. Edge filtering: remove areas that are unlikely to contain text, Character Detection: perform connected component (CC) analysis and filter by area, height and width criteria for each CC, and Text box detection: merge text boxes that are likely to belong to the same line of text.

A more detailed description of the whole procedure is presented in [2]. An example of text detection result is shown in Fig. 1. Text tracking is performed as follows. In TV programs super-imposed text may survive cuts and therefore the tracking can not be restricted within each shot. To solve this problem, we accomplish text tracking by two independent processes: frame-by-frame text detection and trajectory extraction. Since the algorithm in [2] is very efficient, we perform text detection on a frame-by-frame basis and then construct a model for each detected text box. The model includes the center location, height and width of the text. Then we extract the trajectories by comparing models over adjacent frames so that each trajectory contains a series of text models over consecutive frames. The trajectory extraction process consists of two steps, namely initialization and matching. 1)

2)

3.

Initialization: Models on the first frame that contains text are considered as the "leading" models of the initial set of text trajectories. Matching: Each model on the following frames is compared with the models on the previous frame. If a good match (similar in terms of the center position, height and width) is found, it is appended to the corresponding trajectory of the previous model. Otherwise a new trajectory is created and it is considered as the "leading" model. This step loops until the end of the video clip.

FACE DETECTION AND TRACKING

We first perform face detection in order to identify faces present in the scene along with their position. The accuracy of the face detector is a critical factor affecting the performance of the tracking system.

• Various methods have been proposed for face detection. A survey can be found in [3]. Those approaches can be assigned into one of the two broad categories: (i) feature-based methods, which operate by searching for different facial features and using their spatial relationship to locate faces [1], [19], [7], [6], [15], and [18]; and (ii) classification-based methods [12], [14], [9], [4] and [13] which on the other hand try to detect the presence of a face as a whole by classifying all possible subimages of a given image into the categories of face or non-face subimages. 3.1 Face Detection We used the face detection method described in [17]. The system employs feature-based top-down scheme. It consists of four stages, which are: 1)

2)

3)

4)

Skin-tone region extraction. Through manually labeling the skintone pixels of a large set of color images, a distribution graph of skin-tone pixels in YIQ color coordinate is generated. A half ellipse model is used to simulate the distribution and filter skintone pixels of a given color image Pre-processing. Morphological operations are applied to skin-tone regions to smooth each region, break narrow isthmuses and remove thin protrusions and small isolated regions. Face candidate selection. Shape analysis is applied to each region and those with elliptical shapes are accepted as candidates. Iterative partition process based on k-means clustering is applied to rejected regions to decompose them into smaller convex regions and see if more candidates can be found. Decision making. Possible facial features are extracted and their spatial configuration is checked to decide if the candidate is truly a face.

3.2 Face tracking Applying face detection to each frame is inefficient due to the complexity of computation. To improve the performance, we take advantage of content continuity between consecutive frames by considering the detection of faces along with trajectory extraction. The assumption is variation of faces within a continuous shot is usually small. Compressed-domain cut detection is performed to segment the video into shots [10]. Face detection is then applied to the first few frames of each shot. For each detected face the mean and standard deviation in color, height, width and center position are computed. All these features constitute a face model. The face model is used for tracking faces in the future frames until the next cut is detected. In the subsequent frames the tracking system searches only in a reduced area in the neighborhood of the face found according to the model created in the previous frame for the corresponding face instead of searching the whole frame. The models are updated on a frame by frame basis to reflect the latest changes until the end of the shot. Our experiments showed that by reducing the search space we gain over ten times speed-up against applying face detection to each frame in an isolated fashion. 4. VIDEO CLASSIFICATION BASED ON HMM In our system, through face and text tracking we can extract trajectories in the video and have a hierarchy of information consisting of three layers: •

Video Layer. This is the top level which contains the general information. A video clip contains a number of face trajectories and in this level we can have the number of face trajectories, their average duration, cut rate, number of frames and other information.



Trajectory layer. This is related to each unique face or text trajectory in the video clip, such as their duration (starting and ending frame), movement type (still, linear or other). Model layer. A face trajectory contains a series of face models in a sequence of successive frames. One model corresponds to a single face or text detected in one frame. It includes the color, location and size information of that face and is thus the lowest-level information with most details.

With different layers of information, we can obtain various resolutions of the video content description. In our experiments, the system classifies the given video into one of the four categories of programs, which are news, commercials, sitcoms and soaps. We explored two methods for TV program classification based on face and text trajectories, namely domain-knowledge based [16] and Hidden Markov Model based. In this paper we describe the details of the HMM video classification. Video classification can be performed using a domain based method relying on nearest neighbor clustering [16]. The positive aspect of this classification method is its simplicity. Each decision made in the process corresponds to certain human visual perception and it is straightforward to understand the rules. However, in this method we did not take advantage of the temporal relationship of face and text trajectories, which is a very powerful cue in understanding the video content. Although new rules can be added about this, it can make the system extremely complicated and hard to maintain and extend. Therefore we explored an innovative idea: Hidden Markov Model for video classification, which circumvents the difficulties elegantly. Hidden Markov Model (HMM) is a popular technique widely used in signal processing. The essence of HMM is to construct a model that explains the occurrence of observations (symbols) and use it to identify other observations sequences. The fundamentals of HMM and its applications are presented in [11]. However, up to now its applications have been focusing on crypt-analysis and speech recognition. In our research, we extended it to video analysis and classification. In an HMM, there are a finite number of states and the HMM is always in one of those states. At each clock time, it enters a new state based on a transition probability distribution depending on the previous state. After a transition is made, an output symbol is generated based on a probability distribution depending on the current state. In the formal definition of HMM, the states are denoted as Q={q1, q2, …qN}, where N is the number of states and the observation symbols are denoted as V={v1, v2, …vM}, where M is the number of observation symbols. The transition probability distribution between states is represented by a matrix A={aij}, where aij=Pr{qj at t+1| qi at t}, and the observation symbol probability distribution is represented by matrix B={bj(k)}, where bj(k) is the probability of generating observation vk when the current state is qj. The system consists of two phases, namely training and classification. We constructed four HMM's, each of which corresponding to news, commercial, sitcom and soap, respectively, through training with a collection of video clips of each type. The process of training HMM for news is illustrated in Fig. 2 (a). The other three HMM's are trained in the same way. The HMM training is essentially adjusting parameters of λ=(A, B, π) to maximize the probability of the observation sequences Pr(O|λ). Here π stands for the initial state distribution and is defined as π={πi}, where πi is the probability of state qi being the initial state (state at t=1) of the HMM. O is the observation sequence. In the classification phase, the observation sequence of a given video clip is extracted first by frame classification based on face and text tracking. Then the sequence is feed to four HMM's as input and is classified as the category of the HMM that generates the highest

response (probability of the observation). Fig. 2 (b) shows the classification process of a given video clip. For efficient utilization of the temporal relationship of detected faces and text, we applied classification on the frames first. Based on the results of face and text tracking, each frame is assigned a label. For example, if a frame contains a face of close shot and text, it is labeled as "Anchor person with text". If it is the first frame of a shot, it is labeled as "Shot start frame". In our system we have 15 labels, which means there are 15 observation symbols in the HMM. Table 1 lists the descriptions of those 15 labels. Through the frame-by-frame classification, each video segment is represented by a sequence of symbols. The four HMM’s are trained with symbol (observation) sequences extracted from video clips of corresponding types. Then, for a given video clip, its frame type sequence (denoted as O) is used as the input to the four HMM’s and the probabilities of the observation sequence Pr(O|λi) of the four HMM’s are computed. The video is classified as belonging to the category for which this probability value is highest is value. 5.

IMPLEMENTATION AND EXPERIMENTAL RESULTS

The system was developed with Microsoft Visual C++, running on PC’s with Windows NT. It has two interfaces, where the console version processes a batch of video clips while the GUI version allows the user interactively select the video to be classified and display the video and the face or text that are detected. Fig. 3 is the snapshot of the GUI version. We used 26 video clips as training set: 5 clips for news, 6 commercials, 11 sitcoms and 4 soaps. 35 video clips were used as a testing set. We assume that the input videos always belong to one of the four categories of programs. For the HMM method, results obtained so far are very encouraging and the approach is quite promising. We used the 26 short (one minute) video clips in the training set to construct the four HMM’s and applied them both to the training and testing sets, which yields 4 and 5 errors in the training and testing set respectively and an overall accuracy 85.2%. 6.

CONCLUSIONS AND FUTURE WORK

We developed a classification system for TV program based on text/face tracking. The HMM method, represents the video clip as a series of frame labels, and uses the label strings as observation sequences to train HMM’s and evaluate the probabilities of the given clip being one of the four categories of TV programs. The main contributions of our research are: •





Face and text trajectory extraction is important in video content description. Faces are people’s identities and are very important in capturing human activities in the video. Text can provide semantic description of the video content directly. Combination of utilization of face and text trajectories is an efficient way for video understanding. HMM has been successfully used for video analysis. HMM used to be the basic tool for time-series signal processing and in our work we explored its use in video classification, which proved quite effective. It solved the problem of how to efficiently take advantage of the temporal relationships of detected objects over frames without complicating the procedure with a large volume of rules. High-level video segmentation. By applying TV program classification, we can obtain the ability of high-level video segmentation, which separates different programs. In practical applications, it can search among TV channels to find the types of programs that the consumers prefer or skip the contents that they dislike.

As mentioned before we have applied two different video classification methods: domain knowledge based and HMM based. The former method applies classification based on face and text trajectory patterns within different categories of programs, which are extracted through observations and statistical approaches. The accuracy of the method was 75% as apposed to 85% for the HMM method. Our future plans for the video classification system include: • Classification of other categories of TV programs, including sports, movie, talk-shows, financial programs among others. • Long video clips. In practical applications of video classification, it is usually required that the system scouting TV program 24 hours a day, giving report of categories of the programs at various time resolution, e.g., one report per minute or per hour. Therefore, we will enable the system to work on longer (one-hour or 24 hour) video clips. REFERENCES [1] M. Abdel-Mottaleb and A. Elgammal. Face Detection in Complex Environments from Color Images. The Fifth Intl. Conference on Theoretical, Experimental and Applied image processing, ICIP-99, Kobe, Japan, October 1999. [2] L. Agnihotri and N. Dimitrova. Text Detection in Video Segments. Proc. of Workshop on Content Based Access to Image and Video Libraries, in conjunction with CVPR, Colorado, pp. 109-113, June 1999. [3] R. Chellappa, C.L. Wilson and S. Sirohey, "Human and machine recognition of faces: A Survey", Proceedings of the IEEE, Vol. 83, No. 5, pp. 705-740, 1995. [4] Y. Dai and Y. Nakano. Face-Texture Model Based on SGLD and its Application in Face Detection in a Color Scene. Pattern Recognition, Vol. 29, No. 6, p1007-1017, 1996. [5] A.G. Hauptmann and M.J. Witbrock. Story Segmentation and Detection of Commercials in Broadcast News Video. Proceedings of Advances in Digital Libraries Conference, Santa Barbara, CA., April 22-24, 1998. [6] S.H. Jeng, H.Y.M. Liao, C.C. Han, M.Y.Chen and Y.T.Liu. Facial Feature detection using Geometrical Face Model: an Efficient Approach. Pattern Recognition, Vol. 31, No. 3, p273-282, 1998. [7] T.K. Leung, M.C. Burl and P. Perona. Finding faces in cluttered scenes using random labeled graph matching. Proceedings Fifth Intl. Conference on Computer Vidson, pp. 637-644, Cambridge, MA, June 1995. [8] T. McGee, N. Dimitrova, L. Agnihotri, “ADVISO Demonstration – A video Content Management System,” Proc. ACM Multimedia 99, pp.197-198, 1999. [9] B. Moghaddam and A. Pentland. A Subspace Method for Maximum Likelihood Target Detection. IEEE International Conference on Image Processing, Washington DC, October, 1995. [10] N.V. Patel and I.K. Sethi. Compressed Video Processing for Cut Detection. IEE Proceedings: Vision, Image and Signal Processing, Vol. 143, pp.315-323, October 1996. [11] L.R. Rabiner and B.H.Juang. An introduction to Hidden Mardov Models. IEEE ASSP Magazine. pp4-15, January 1986. [12] H.A. Rowley, S. Baluja and T. Kanade. Neural Network-Based Face Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 1, January 1998, pp23-37 [13] E. Saber and A.M. Tekalp. Frontal-view Face Detection and Facial Feature Extraction using Color, Shape and Symmetry Based Cost Functions. Pattern Recognition Letters, V0019 N8, June 1998, p669-680. [14] K.K. Sung and T. Poggio. Example-Based Learning for ViewBased Human Face Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 1, January 1998.

[15] A. Tankus, Y. Yeshurun and N. Intrator. Face detection by convexity estimation. Pattern Recognition Letters, 18 (1997), pp. 913-922. [16] G. Wei, L. Agnihotri, and N. Dimitrova, TV program classification based on face and text Processing, IEEE Multimedia and Expo 2000, New York, July 2000. [17] G. Wei and I.K. Sethi. Face Detection for Image Annotation. Proceedings of Patter Recognition in Practice VI, Vlieland, Netherlands, 1999. [18] .Yang and T.S. Huang. Human Face Detection in a Complex Background. Pattern Recognition. Vol. 27, No. 1, pp53-63, 1994 [19] K.C Yow and R. Cipolla. Feature-based human face detection. Image and Vision Computing 15 (1997) pp713-735.

Training News Frame classification based on text/face tracking

Observation Sequences HMM Training

HMM for News

Fig. 2 a) Training phase of HMM for news Frame Label Anchor person text Face-text Wide close up Close shot Many-face Two-face Medium close face Many-text-line Few-text-line One-text-line Uniform frame Shot-start frame Face only No-face-text Undefined

Explanation A person of shot closer than medium shot with one or more lines of text underneath Person(s) of long shot with text One person wide close up without text At least 1 person of close shot without text 3 or more persons of long shot without text Two persons of long shot sizes without text At least one person of medium close shot Contain more than four lines of text Contain 2 to 4 lines of text (without face) Contain 1 line of text (without face) Black or white screen with little variation The first frame of a shot A person of long shot without text Neither face or text detected Exception handler

Video Clip

Observation sequence

News

Comm.

Sitcom

Soap

HMM

HMM

HMM

HMM

Choose Maximum

Program Category

Table 1: Frame classification labels: sorted in descending order of priority. If a frame satisfies conditions of two classes, the label of higher priority is assigned

Fig 2 b) Classification of a given video clip using HMM

479

Fig. 1 Result of face and text detection Fig. 3 Snapshot of the HMM video classification tool.