A Knowledge-Based Approach to Video Content ... - Semantic Scholar

9 downloads 0 Views 232KB Size Report
Y. Chen and E. K. Wong. Department of Computer and Information Science. Polytechnic University. 5 Metrotech Center. Brooklyn, NY 11201. Abstract.
A Knowledge-Based Approach to Video Content Classification∗ Y. Chen and E. K. Wong Department of Computer and Information Science Polytechnic University 5 Metrotech Center Brooklyn, NY 11201 Abstract A framework for video content classification using a knowledge-based approach is herein proposed. This approach is motivated by the fact that videos are rich in semantic contents, which can best be interpreted and analyzed by human experts. We demonstrate the concept by implementing a prototype video classification system using the rule-based programming language CLIPS 6.05. Knowledge for video classification is encoded as a set of rules in the rule base. The left-hand-sides of rules contain high level and low level features, while the right-hand-sides of rules contain intermediate results or conclusions. Our current implementation includes features computed from motion, color, and text extracted from video frames. Our current rule set allows us to classify input video into one of five classes: news, weather reporting, commercial, basketball and football. We use MYCIN’s inexact reasoning method for combining evidences, and to handle the uncertainties in the features and in the classification results. We obtained good results in a preliminary experiment, and it demonstrated the validity of the proposed approach. Keywords: video content classification, knowledge-based, rule-based, certainty factors, semantic representation 1. Introduction The use of low-level features, which is the approach used in most video retrieval and analysis systems today, has reached a limit in their capabilities. For many videos, human has a priori knowledge or expectations about their contents. This knowledge is acquired over time through experience in viewing many similar videos of the same type. For example, when viewing the video of a basketball game, the expectations include, among others, fast motion, orange (or other) color basketball, a wood color floor, periodic displays of game scores on the screen, etc. When properly represented and encoded, these a priori knowledge or expectations can be a powerful tool for classifying videos. A knowledge-based or expert system employs human knowledge to solve problems that ordinarily require human intelligence [1-3]. In most knowledge-based systems, the model of problem solving in the application domain is explicitly represented as a separate entity or knowledge base rather than appearing implicitly as part of the coding of the program [2]. We use a rule-based implementation for the knowledgebased video classification system. One advantage of using a rule-based system is the ease to insert, delete, and modify the existing rules when the nature of the video classes changes, or when we add or modify features used in the rules. This is important if we want to make continuous updates or refinements to the system.



This work was supported by the National Science Foundation under the STIMULATE program (Grant # IIS-9619114.)

Further author information – Y. C.: [email protected] E. K. W. (correspondence): [email protected]

The inputs to our proposed system are individual frames from a video. In our current implementation, the knowledge-based system tries to classify an input “frame” into one of the five classes: news, basketball, football, advertisement, and weather reporting. A certainty (or confidence) value is assigned to each of the output classes. In the future, we will modify the system to include classification of video “shots,” in addition to the classification of video frames. The knowledge base in our system is a set of rules that contain low- and high-level features computed or inferred from the video data. The left-hand-side (LHS) of a rule consists of a combination of different features, and the right-hand-side (RHS) represents the conclusion of the rule, which could be some intermediate results or a final video classification. Rather than video features, the LHS could also be some intermediate conclusions or results generated during the inference process. In the decision-theoretic approach to classification, a set of training data is used to train the parameters of a chosen classifier. For example, in the case of a linear classifier, the parameters are the linear coefficients. This approach has the disadvantage that if the number of output classes changes, we will have to re-train the classifier using a modified set of training data. Using the rule-based approach, we only have to insert new rules for the new class, and perhaps also modify some of the existing rules. Updates to the system can therefore be made in a more rapid and efficient manner. In Section 2, we describe some related prior work. In Section 3, we describe feature uncertainty handling and evidence combination. In Section 4, we describe the set of features we use in our current implementation. In Section 5, we describe our current rule-base and its implementation using the rule-based programming language CLIPS [3]. Experimental results are presented in Section 6. Finally, conclusion and future work are presented in Section 7. 2. Prior Work In [4, 5, 6], knowledge-based or rule-based approach was applied to the understanding and interpretation of still images. In [7], a knowledge engineering approach based on probabilistic reasoning is used for image classification. Chen and Chang [8] generated semantic visual templates (SVT’s) for video databases, where SVT’s represent a user’s view of a concept in terms of 2-D visual attributes. In video classification, Fischer et. al. [9] investigated automatic recognition of four types of TV programs using both audio and visual cues by using temporal patterns. Nam et. Al. [10] exploited multiple audio-visual signatures to create a perceptual relation for conceptually meaningful violent scene identification. In [11], integration of multimodal features including visual and audio features was used for video classification based on HMM. In [12], an HMM model was used to classify TV programs based on human faces and text trajectories detected from the video. These prior works on video classification did not use an expert system approach as we present in this paper, where the knowledge or rules for video classification is explicitly represented as a separate entity. 3. Uncertainty Handling and Evidence Combination Due to the inherit difficulty of feature extraction from videos, it is not always possible to extract features with 100% reliability. For example, features that are computed based on MPEG motion vectors would have uncertainty. One source of uncertainty is the inaccuracy of the motion vectors introduced in the estimation process. In the traditional decision-theoretic approach to classification, a fixed set of features is extracted from the input data. A classifier is then trained to partition the feature space into separate classes. In the classification process, there are no mechanisms to incorporate and represent uncertainties in the extracted features and the classification results. In our approach, a certainty factor (CF), having a value between 0 and 1, is used to represent the uncertainties of the features, intermediate results, and the final classifications. We use MYCIN’s inexact reasoning method [3] to handle the uncertainties in the features, and to compute uncertainties in the classification results. Our proposed approach could be considered as an evidence-gathering process, where evidences are features (or some intermediate results) in the LHS’s of rules. Different features (or combinations of features) may lead to the same conclusion. This is represented by rules having the same RHS but with different LHS’s. Not all of the rules leading to the same conclusion will fire given an input frame. But for those that fire, it will increase the confidence in the conclusion.

As a simplified example, two rules∗ that have football as RHS could be as follows:

Greenish → Football

High _ Motion → Football

CF [Greenish, Football ] = 0.5 CF [ High _ Motion, Football ] = 0.6

The first rule says that if the overall color of the frame is greenish with 1.0 certainty, there is a 0.5 certainty that the frame is a football game. The second rule says that if the global motion is high with 1.0 certainty, there is a 0.6 certainty that the frame is a football game. When the greenish or high-motion features cannot be extracted with 1.0 certainty (which is often the case), then the features will be assigned a certainty value less than 1.0 and the certainty in the conclusion is computed as CF [h] = CF [h, s ] ∗ CF [ s ] , where CF[h] is the certainty in the conclusion, CF[h,s] is the degree of certainty in h when s is known to have 1.0 certainty, and CF[s] is the certainty in the LHS. CF[h,s] is also called the certainty factor in the rule itself. If either rule fires, the conclusion is football with a computed certainty value. If both rules fire, the certainty in the conclusion will be larger than that from either rule alone, and the following rule for combining evidences, as was used in the expert system MYCIN [3], is used: Combining Evidences: Rules: s1 → h , s 2 → h , where s1 and s2 are the LHS’s and h the RHS. Combining Method: CF [h, s1 and s 2 ] = CF [h, s1 ] + CF [h, s 2 ](1 − CF [h, s1 ] ) . After many cycles of execution, the certainty in a conclusion, e.g. the input frame is a football class, will increase to a large value after many features (or evidences) supporting football have been gathered, or after many rules with football on the RHS have been fired. Using this approach, not all the features supporting an output class need to be present, nor do they need to be extracted with 100% certainty, but the more the features that can be extracted, and the more certain the features are, the higher the confidence we have for the conclusion. The above two rules are for illustrative proposes. Other rules in our rule base are more complicated, often with the LHS containing a combination of features (compound features.) When the LHS is a compound feature, the following combining methods for certainty factor are used: Conjunction Rule: s1 ∧ s 2 → h Combining Method: CF [ s1 ∧ s 2 ] = min(CF [ s1 ], CF [ s 2 ]) Disjunction Rule: s1 ∨ s 2 → h Combining Method: CF [ s1 ∨ s 2 ] = max(CF [ s1 ], CF [ s 2 ]) 4. Features In Our Current Implementation

In our current implementation, the features in the LHS of rules include those computed from the motion, color, text, and cut rate of the video. We plan to include audio features, as well as other visual features, in future implementation. Features used in the LHS of rules could be high-level features with semantic meanings. It is a challenging task to translate high-level concepts into low-level image processing tasks, or to a form that can be quantitatively computed. For example, the concept fast motion (which is more intuitive to human) is translated to high-motion-magnitude, which could be computed as having a large percentage of motion vectors with large magnitude in the motion vector field. For many of the features, we also form symbolic feature descriptors to represent the magnitude, or to represent certain characteristics of the computed feature. For example, high-motion-activity says that the computed feature motion-activity has ∗

These two rules are illustrative examples only. They are not in the rule base of our current implementation.

a high value, whereas text-upper-right-corner says that the detected feature of a text box is located in the upper-right-corner of a frame. We found these symbolic descriptions of features to be easier and more intuitive for a human to reason and develop rules. In the following, we describe each feature type in our current implementation in greater details. The feature sets, as well as the rules in our knowledge base, will expand and evolve as we continue to build our knowledge-based system. 4.1 Motion

Two types of motion features, i.e. motion-magnitude and motion-activity are computed. Motion vectors for a video frame are extracted directly from the MPEG-1 file of the video. A histogram of the motion magnitudes is then obtained, with the magnitude defined as the sum of the absolute values of the x and y components. Instead of using numbers to represent motion magnitude, three levels of motion-magnitude are defined symbolically: low-motion-magnitude, medium-motion-magnitude, and high-motion-magnitude. The histogram is first divided into three non-overlapping intervals: low, medium, and high. The three levels are computed as the percentage of non-zero bins in the corresponding intervals of the magnitude histogram. For example, the feature descriptor low-motion-magnitude is computed as the percentage of non-zero bins in the low interval. We define motion-activity as the number of non-zero motion vectors over the total number of motion vectors (or macro-blocks) in a frame. It has high value when most parts of a frame is involved in some kind of motion. It has low value when most parts of the frame are stationary. Three levels of motion activity are also defined symbolically: low-motion-activity, medium-motion-activity, and highmotion-activity. 4.2 Color

Two types of color features are defined: colorfulness and P-MPC. A frame is first decoded into a 24-bit color image with R, G, and B values for each pixel. The color histogram is then uniformly quantized into 64 bins in the RGB color space. The feature colorfulness is defined as the number of non-zero bins divided by the total number of bins. This feature represents the total number of different colors that are present in a video frame. The feature P-MPC is defined as the percentage of pixels having the most prominent color (PMPC) in the frame. This feature has a high value in a video frame where a majority of the pixels belong to a single dominant color. For example, in news broadcast scenes with an anchorperson sitting in front of the camera, a majority of the pixels in the frame belong to the background color, which is usually uniform. A local peak detection algorithm is applied to the quantized histogram to find the most prominent color. Note that we do not use the most prominent color itself as a feature directly because it is sensitive to lighting condition and can vary from video shot to video shot, including shots belonging to the same output class. It happens frequently that video shots that appear to have the same most prominent color visually actually have peaks that are located differently in the color histogram. Similar to the motion features, we define three symbolic descriptors, representing three different levels (low, medium, and high), for the two color features. 4.3 Text

The text extraction algorithm in [13] is applied to the decoded color image to extract text that is superimposed onto the video. The algorithm produces individual text boxes that enclose text strings, and the x-y coordinates of their upper-left and lower-right corners. Several symbolic descriptors that are useful for classification are derived from the detected text. These include text-num, no-text, title, caption-atbottom, caption-sub-window, text-upper-left-corner, text-upper-right-corner, text-lower-left-corner, and text-lower-right-corner. These symbolic descriptors are used in the LHS of rules. The descriptor text-num is defined as the total number of text boxes in a frame. From that, three additional descriptors, low-textnum, medium-text-num, and high-text-num, are used to quantify the number of text boxes in a frame. The descriptor no-text means there is no text in the frame. The descriptor title indicates a text box at the top of a frame satisfying certain constraints in width-height ratio. The descriptor caption-at-bottom indicates a text box with a certain width-height ratio at the bottom of a frame. The descriptor caption-sub-window represents a small window appearing inside a frame with text as caption of the window. It appears quite

frequently in news broadcasts. The descriptors text-upper-left-corner, text-upper-right-corner, text-lowerleft-corner, and text-lower-right-corner describe the positions of the text on the frame. At this time, we use a certainty value of 1.0 for these text features. More realistically, the text features should have an uncertainty value that depends on the performance of the text detection algorithm. We intend to include this in the future. 4.4 Cut-Rate

The feature cut-rate is defined to be the number of shot boundaries per second computed over a period of time. There levels of cut-rate are also defined symbolically in our production rules: low, medium, and high. In our experiment, we found that cut rate is a very effective feature for discriminating the five classes. 4.5 Feature Uncertainties

The uncertainties for features or symbolic feature descriptors were obtained in two different ways. Some were assigned based on the experience of the authors after viewing many video sequences. Many were computed by analyzing the reliability of the extracted features using a set of training data. For the later, we first manually label each frame in the training set with the symbolic feature descriptors that fit. We then compute feature values for each frame in the training set. A distribution function for each symbolic feature descriptor, with frequency in the vertical axis and computed feature values in the horizontal axis, is then obtained and normalized with respect to the function maximum. The normalized distribution function is then used to map a computed feature value into a certainty value (between 0 and 1) for the feature descriptor. For example, for the feature P-MPC (Percentage of Most Prominent Color), three normalized distribution functions were obtained for the symbolic descriptors low-P-MPC, medium-P-MPC, and highP-MPC (see Figures 1-3), where the horizontal axis represent computed P-MPC values. An input value of 0.4 for feature P-MPC would be mapped to certainty values of 0.1, 0.8, and 0.0 for the feature descriptors low-P-MPC, medium-P-MPC, and high-P-MPC, respectively. 5. Our Rule Base and Implementation Using CLIPS

We use a forward-chaining rule-based production system approach for our knowledge-based system [3]. A rule-based production system consists of three basic components: rule base, inference engine, and working memory. The rule base contains the expert knowledge in terms of a set of unordered if-then production rules. The working memory contains initial facts and intermediate results during the execution of the production system. The inference engine matches the facts in the working memory with the rules in the rule base and decides which rule to execute next. It also performs conflict resolution in case there is more than one rule that could be executed during an execution cycle. We use the CLIPS (Version 6.05) rule-based programming language for our implementation. CLIPS has forward-chaining inference capability and was designed with the specific purposes of providing high portability, low cost, and easy integration with external systems. The authors developed the rules in the knowledge base after reviewing many video sequences from the five classes. The general characteristics for each of the five classes were first summarized in English. Rules are then developed by analyzing these general characteristics. An example rule for the class news is low-motion-magnitude & low-colorfulness & high-P-MPC  news which says that the co-occurrence of low motion magnitude, low number of colors, and a large percentage of pixels belong to a single most prominent color, implies that the input frame is a news frame. Another rule for the weather class is low-motion-magnitude & low-motion-activity & medium-colorfulness & medium-P-MPC & (no-text| high-text-num)  weather which says that the co-occurrence of low-motion-magnitude, low-motion-activity, medium-colorfulness, medium-P-MPC, and one of the symbolic descriptors in parentheses for text implies a weather frame.

6. Experimental Results

We conducted experiment to test the performance of the prototype rule-based system. The experiment was conducted on a PC workstation with a Pentium II 400 MHz CPU. We used 10 video sequences for testing. The number of frames in the sequences ranges from 924 to 1,790. In the current implementation, feature extraction was done using external C++ routines. The extracted features are then plugged into the CLIPS program. Figure 4 shows an example frame from the news video sequence. The result from text detection algorithm is shown in Figure 5. Figure 6 shows the forward motion vectors extracted from MPEG-1, and Figure 7 shows the results after filtering inconsistent motion vectors. The output of the rule-based system consists of, for each frame, a confidence value for each of the five classes. A confidence value of 1.0 indicates complete confidence in a class, and a value of 0.0 indicates complete lack of confidence. Table 1 shows an example output for a news frame. To calculate the recognition rate, we chose the class with the maximum confidence value for each frame, and then compute the overall correct classification rate by comparison with the ground truth obtained from visual classification of the frames. The result is shown in Table 2. Except for the football class, which has classification rate of 62%, the results were reasonably good for the other four classes. We are investigating new rules for improving the performance of football classification. We expect the overall results to improve in the future when we include rules for audio and other visual features. 7. Conclusion and Future Work

Our experiment has demonstrated that the proposed approach is a viable method to perform video content classification. In addition to low level features, many high level features with semantic meanings can be incorporated into rules in a natural and intuitive manner. An advantage of the rule-based approach is the flexibility it provides in updating and expanding the knowledge base. For future work, we intend to expand our feature set to include audio features, as well as additional visual features such as text duration, etc. We will investigate rules to integrate features from the visual and audio modalities. Our current implementation handles uncertainty in the input features, with the rules in the rule base assigned a certainty value of 1.0. We intend to investigate methods to estimate the certainty values for the rules themselves. We will also use a larger data set for training and testing. Finally, we will include a mechanism for video shot classifications in our implementation, in addition to video frame classification. References

[1] F. Hayes-Roth, “The knowledge-based expert system: A tutorial,” IEEE Computer, September 1984, pp. 11 – 27. [2] D. S. Nau, “Expert Computer Systems,” IEEE Computer, February 1983, pp. 63 – 84. [3] J. Giarratano and G. Riley, Expert System: Principles and Programming, PWS Publishing Company, Boston, 1998. [4] J. H. Kim, E. H. Park, and C. A. Ntuen, “An approach to object knowledge representation for image understanding expert system model,” Proc. 22nd Asilomar Conference on Signals, Systems, and Computers, Vol. 2, 1988, pp. 976 – 980. [5] T. Tjahjadi and R. Henson, “A knowledge based system for image understanding,” Proc. 3rd International Conference on Image Processing and its Applications, 1989, pp. 88 – 92. [6] A. Meisels and S. Bergman, “Find objects on aerial photographs: A rule-based low level system,” Proc. Computer Vision and Pattern Recognition, 1988, pp. 118 – 122. [7] S. Paek and S.-F. Chang, “A knowledge engineering approach for image classification based on probabilistic reasoning systems,” International Conference on Multimedia & Exposition 2000, New York, NY, July 30th – Aug. 2nd, 2000. [8] W. Chen and S.-F. Chang, “Generating semantic visual templates for video databases,” International Conference on Multimedia & Exposition 2000, New York, NY, July 30th – Aug 2nd, 2000. [9] S. Fischer, R. Lienhard, and W. Effelsberg, “Automatic recognition of film genres,” Proc. 3rd ACM International Conference on Multimedia, San Francisco, CA, Nov. 5 – 9, 1995, pp. 295 – 304.

[10] J. Nam, M. Alghoniemy, and A. H. Tewfik, “Audio-visual content-based violent scene characterization,” IEEE International Conference on Image Processing, Chicago, IL, Oct. 4 – 7, 1998, pp. 353 – 357. [11] J. Huang, Z. Liu, Y. Wang, Y. Chen, and E. K. Wong, “Integration of multimodal features for video classification based on HMM,” 1999 IEEE Third Workshop on Multimedia Signal Processing, Copenhagen, Denmark, Sept. 13 – 15, 1999, pp. 53 – 58. [12] G. Wei, L. Agnihotri, and N. Dimitrova, “TV program classification based on face and text processing,” International Conference on Multimedia & Exposition 2000, New York, NY, July 30th – Aug. 2nd, 2000. [13] E. K. Wong and M. Chen, “A Robust Algorithm for Text Extraction in Color Video,” Proc. 1st IEEE Int’l Conf. On Multimedia and Expo, Jul 30-Aug 2, New York, NY, 2000.

Fig. 1 Distribution Function of low-P-MPC

Fig. 2 Distribution Function of medium-P-MPC

Fig. 3 Distribution Function of high-P-MPC

Fig. 4 Original TV Image

Fig. 6 Forward Motion Vectors for Fig. 4

Fig. 5 Text Box Detected from Fig. 4

Fig. 7 Filtered Motion Vectors for Fig. 4

Table 1. Classification Results for a Sample news Frame. Type news weather reporting commercial basketball football

CF* 0.982273 0.024242 0.024242 0.018182 0.00000

* CF represents the certainty value of an output class

Table 2. Classification Rate for the Five Output Classes Class Number of Frames news 5023 weather reporting 3625 commercial 3179 basketball 1552 football 1663

Classification Rate (%) 89.1 81.5 85.9 70.4 62.7