Automatic Annotation and Semantic Retrieval of Video Sequences using Multimedia Ontologies Marco Bertini, Alberto Del Bimbo, Carlo Torniai Università di Firenze - Italy
[email protected],
[email protected],
[email protected]
ABSTRACT Effective usage of multimedia digital libraries has to deal with the problem of building efficient content annotation and retrieval tools. MOM (Multimedia Ontology Manager) is a complete system that allows the creation of multimedia ontologies, supports automatic annotation and creation of extended text (and audio) commentaries of video sequences, and permits complex queries by reasoning on the ontology.
Categories and Subject Descriptors H.3.7 [Information Storage and Retrieval]: Digital Libraries; H.2.4 [Systems]: Multimedia databases
General Terms Algorithms
Keywords Video databases, Video annotation, Content-based retrieval, Multimedia ontology, RDF, OWL
1.
INTRODUCTION AND PREVIOUS WORK
Ontologies are formal and explicit representations of domain knowledge, typically expressed with linguistic terms that include concepts, concept properties, and relationships between concepts. In the recent years several standard description languages for the expression of concepts and relationships in domain ontologies have been defined; among these the most important are: Resource Description Framework Schema (RDFS), Web Ontology Language (OWL) and, for multimedia, the XML Schema in MPEG-7. Using these languages metadata can be fitted to specific domains and purposes, yet still remaining interoperable and capable of being processed by standard tools and search systems. Ontologies can effectively be used to perform semantic annotation of video content by manually associating the terms of the ontology to the individual elements of the video, or associating the terms of the ontology with appropriate knowledge models that encode the spatio-temporal combination of low and mid level visual features. Once these models are checked, video entities are annotated with the concepts of the ontology. Several examples of automatic semantic annotation systems have been presented recently. In [9] MPEG motion vectors, playfield shape and players position have been used with Hidden Markov Models to annotate clips Copyright is held by the author/owner(s). MM’06, October 23–27, 2006, Santa Barbara, California, USA. ACM 1-59593-447-2/06/0010.
679
of soccer highlights; in [1] Finite State Machines have been employed to detect the principal soccer highlights, such as shot on goal, placed kick, forward launch and turnover, from a few visual cues; Yu et al. [12] have used the ball trajectory in order to detect the main actions like touching and passing. Domain specific linguistic ontology with multilingual lexicons, and possibility of cross document merging and reasoning has been presented in [10], to automatically create a semantic annotation of soccer video sources. The possibility of extending linguistic ontologies as multimedia ontologies, has been suggested in [8] where text information available in videos and visual features are extracted and manually assigned to concepts, properties, or relationships in the ontology. The basic idea behind multimedia ontologies is that the concepts and categories defined in a predefined linguistic ontology are not rich enough to fully describe the diversity of visual events and elements that are in a video. In fact although linguistic terms are appropriate to distinguish event and object categories, they are inadequate when they must describe specific patterns of events or video entities. To this end high level concepts, expressed through linguistic terms, and pattern specifications, represented instead through visual or auditory concepts, both should be organized into new extended ontologies that couple linguistic terms with visual/audio information. Other notable experiences have been reported in [2], [11], and [5] and [3]. In [2] perceptual knowledge is discovered grouping annotated images into clusters based on their visual and text features, while semantic knowledge is extracted by disambiguating the senses of words in annotations using WordNet and image clusters. In [11] a Visual Descriptors Ontology and a Multimedia Structure Ontology, based on MPEG-7 Visual Descriptors and MPEG-7 MDS respectively, are used together with a domain ontology in order to support content annotation. Visual prototypes instances are manually linked to the domain ontology. In [5], semantic concepts are defined in an RDF(S) ontology, together with qualitative attributes (e.g. color homogeneity), low-level features (e.g. model components distribution), object spatial relations and multimedia processing methods (e.g. color clustering) and rules to detect video objects. In [3] pictorially enriched ontologies have been introduced for the purpose of automatic video annotation; video clips of highlights are regarded as instances of concepts in the ontology and are directly linked to the corresponding concepts, clustered into subclasses according to their perceptual similarity. Visual concepts are defined as the centers of these clusters, such that each of them is assumed to represent a specific pattern in which the concept can manifest. MOM (Multimedia Ontology Manager) is a new system, which has been developed according to the principles and concepts of pictorially enriched ontologies, as defined in [3], [4], that supports
dynamic creation and update of multimedia ontologies, provides facilities to automatically perform annotations and create extended text (and audio) commentaries of video sequences, and allows complex queries on video databases, based on the ontology itself. The MOM framework has been developed in the VAPEON project, as part of the activities of the DELOS Network of Excellence on Digital Libraries (Contract G038-507618) in the Information Society Technologies (IST) Program of the European Commission. In the following we expound some basic principles and technical details of the MOM implementation and provide some performance figures for the annotation and retrieval of clips of soccer highlights. The paper is organized as follows: creation of a multimedia ontologies using MOM is discussed in Sect. 2. Automatic annotation and creation of commentaries for long video sequences are presented in Sect. 3. In Sect. 4 is shown how ontology-based reasoning enables effective semantic retrieval by content of video sequences by means of complex queries. In Sect. 5 conclusions are drawn.
2.
MULTIMEDIA ONTOLOGY CREATION
Multimedia ontologies as created from MOM, are expressed in the OWL standard. The linguistic part of the ontology is composed of a number of classes, that express the main concepts of the domain (e.g. actors, objects, facts and actions, highlights. . . ) and their relationships. The extended multimedia ontology is created by linking video sequences as instances of concepts in the linguistic ontology, and performing an unsupervised Fuzzy C-Means clustering of the instance clips (Fig. 1). The visual features, that are used for clustering clips that represent soccer highlights, are domain specific descriptors computed from spatio-temporal combinations of low lever features. The centers of the clusters are regarded as visual concepts, each representing a specific pattern in which the highlight can manifest. A special class (Undetected highlight) is also created, that holds all the clips that are not classified as highlight instances up to a pre-defined confidence. The visual features used to describe the visual concepts of the soccer highlights are: i) the playfield area, ii) the number of players in the upper part of the playfield, iii) the number of players in the lower part of the playfield, iv) the motion intensity, v) the motion direction, vi) motion acceleration. For each clip, a feature vector is created, with six distinct components, one for each descriptor used, each component being a vector of as many elements as the number of frames in the clip, holding the values of the descriptor for each frame. New highlight clips can be associated with the existing clusters, by considering the similarity between their visual features and those of the highlight visual concepts. In this way the centers of the clusters can change as new clips are considered. Undetected highlight clips are hence analyzed to check if some of them can be associated to the new clusters. It is worth to notice that due to this mechanism the ontology has a static linguistic part (concepts and their relations are fixed and reflect the agreed description of the domain) and a visual part which is instead subjected to changes, (the centers of clusters - the visual concepts - change as new knowledge is presented to the system) In Fig. 1 part of the ontology and one of the clusters associated with the shot on goal highlight are shown as visualized by MOM. Clips are presented with their keyframes. The visual concept representing the typical highlight pattern, is positioned at the center of the cluster. For our experiments, discussed in the following sections, the linguistic part of the soccer highlight ontology was defined manually. The visual part was built from 68 clips of soccer highlights manually annotated as shot on goal (35), forward launch (16) and placed kick (17). After clustering five visual concepts were obtained for
680
shot on goal, four for forward launch and three for placed kick, each of which represents a distinct visual pattern of the corresponding highlight. Clip instances were linked to the corresponding highlight concepts in the linguistic part of the ontology. Using temporal and semantic relations between events defined in the ontology, we have defined special patterns that can be used to refine annotation and permit the solution of complex temporal queries. For instance the special pattern Video with scored goal, has been defined as the occurrence of one of the following cases: a Forward Launch action followed by a Shot on Goal action, followed by a Crowd Cheering Event, followed by a Score Change event; a Placed Kick action followed by a Crowd Cheering event, followed by a Score Change event; or as a Shot on Goal action followed by Crowd Cheering event followed by a Score Change event.
3.
AUTOMATIC VIDEO ANNOTATION
MOM allows effective automatic semantic annotation of video clips with high level concepts, by checking their similarity with the visual concepts of the ontology. As the similarity with a particular visual concept is assessed, then higher level concepts linked to it in the ontology are immediately associated with the clip. MOM semantic annotation capability has been extensively tested on MPEG2 soccer videos from World Championship 2002, European Championship 2000 and 2004, recorded at 25 fps, with 720×576 PAL resolution. Annotation is performed in two steps: a detailed description is provided in [4]. In the first step, clips to be annotated are selected from video sequences, by manual or automatic segmentation, and checked if they contain any highlight. In the second, annotation is performed automatically based on the ontology visual and linguistic concepts and the relationships defined. Based on the initial version of the ontology (containing the training set of concepts described in Sect. 2) we performed automatic video annotation on a set of 242 clips, including 85 shots on goal, 42 forward launch, 43 placed kick and 72 clips with no highlight. Precision and recall figures of the automatic annotation are reported in Table 1. Most of false detections are clips with no highlight, but anyway similar to highlights (e.g. slow play close to the goal box erroneously classified as placed kick) or misclassified highlights due to similar behaviors detected (e.g. forward launch classified as shot on goal because of similar behaviour in camera motion inten-
Figure 1: A visualization of instances of the “Shot on Goal” concept, the visual concept representing the pattern is positioned at the center of the cluster
sity and direction or placed kick classified as shot on goal because of no presence of the film part where where players get prepared for the kick). Within the multimedia ontology we have defined some “patterns” as temporal sequence of possible detected actions and events that lead to a more complex event. By reasoning on the ontology MOM can verify if a video sequence contains one of the special patterns pre-defined. The RACER [6] description logic reasoner is employed. For each clip the inferred types are evaluated, considering the combinations of detected actions and events (Fig. 2) that are recognized. If a special pattern is recognized the annotation of clips and video is refined according to the pattern. To this end, the following algorithm (Alg. 1) is used. Algorithm 1 Refine annotation algorithm for each clip in the video evaluate the inferred type for each defined pattern for each clip in the video evaluate pattern(clip, pattern event) if (pattern detected) refine annotation of clip refine annotation of video
Figure 3: Automatically generated clip subtitle
quences. For instance a simple sentence for attack action clips can be represented in XML by:
For each clip in a video the inferred types are evaluated according to the possible combination of detected actions and events (Fig. 2) that are recognized in the clip. In order to recognize a special pattern in a video sequence the system also evaluates the temporal distance between the correct series of events. With MOM, reasoning on patterns, and taking into account the visual descriptors of the clips (the playfield zone, the motion intensity of the action and the number of players), permits the automatic construction of extended commentaries of video seHighlight Shot on goal Placed kick Fwd. launch
Miss 5% 9% 10%
False 16% 9% 10%
Unknown 21% 18% 30%
Precision 82% 89% 86%
Taking into account the values of visual descriptors, a commentary can be obtained such as: “England surprises the defense with a very fast shot toward the goal area”. In the case that the clip is recognized as a Scored Goal pattern the commentary will be “England scores an incredible goal with a fast shot from the mid field area”. Commentaries are stored in SRT format and presented as text subtitles (see Fig. 3) or, stored as a text file that can be accessed through the web or downloaded to a mobile device.
4.
SEMANTIC QUERYING
By reasoning over the ontology, MOM allows expression and solution of complex semantic queries. Queries are expressed in nRQL [7]. As an example, queries such as: ”find a video sequence with scored goal after a fault shortly followed (less than 10”) by a fast attack action”, can be easily solved by the MOM reasoner. For each instance of the Clip and Video class, the reasoner evaluates the inferred types, and associates to each instance the proper types considering the actions detected or the patterns contained (for example a clip is classified as Clip with Attack Action by the reasoner if the has highlight property is related to subclass Attack Action of the Detected Action class) and finds the video sequences, checking the timecode property of clips if required, such that the clips of the sequence verify the special pattern of the query. Visual concepts can be used to express more precise queries. With MOM we can retrieve video sequences with some specific highlight pattern and highlights similar to selected visual concepts. An example of such a query is shown in Fig. 4, where it is requested to search for sequences starting with a forward launch and finishing with a shot on goal, and including a placed kick, where all the highlights must be similar to selected examples and the maximum delay between the highlight clips is 600 milliseconds. Using the MOM query interface it is also possible to specify high level concepts that belong to the linguistic part of the ontology, such as the names of the teams, and that are provided by human annotators.
Recall 74% 73% 60%
Table 1: Precision and recall of highlights classification.
Figure 2: A visualization of detected actions subclasses. All the classes on the left of the graph have owl:Thing as ancestor.
681
5.
CONCLUSIONS
Figure 4: MOM query interface and query expression: search for sequences starting with a forward launch and finishing with a shot on goal, and including a placed kick, where all highlights must be similar to selected examples and the maximum delay between the highlight clips is 600 milliseconds. In this paper we have presented a framework for automatic annotation and retrieval of videos using multimedia ontologies. Annotation of clips is performed by checking their similarity to the visual concepts in the ontology and making the proper association with the higher level concepts that are linked to the most similar. By performing reasoning on the ontology, it is also possible to derive extended commentaries and perform complex queries.
Acknowledgments. Authors wish to thank Aldo Claverini, Claudio Orlandi and Davide Silvestre for their contribution to the development of the commentary feature.
6.
REFERENCES
[1] J. Assfalg, M. Bertini, C. Colombo, A. D. Bimbo, and W. Nunziati. Semantic annotation of soccer videos: automatic highlights identification. Computer Vision and Image Understanding, 92(2-3):285–305, November-December 2003.
682
[2] A. Benitez and S.-F. Chang. Automatic multimedia knowledge discovery, summarization and evaluation. IEEE Transactions on Multimedia, Submitted, 2003. [3] M. Bertini, R. Cucchiara, A. Del Bimbo, and C. Torniai. Video annotation with pictorially enriched ontologies. In Proc. of IEEE Int’l Conference on Multimedia & Expo, 2005. [4] M. Bertini, A. Del Bimbo, and C. Torniai. Enhanced ontologies for video annotation and retrieval. In Proceedings of ACM MIR, November 2005. [5] S. Dasiopoulou, V. Mezaris, I. Kompatsiaris, V. K. Papastathis, and M. G. Strintzis. Knowledge-assisted semantic video object detection. IEEE Transactions on Circuits and Systems for Video Technology, 15(10):1210–1224, Oct. 2005. [6] V. Haarslev and R. M¨oller. Description of the racer system and its applications. In Proceedings International Workshop on Description Logics (DL-2001), Stanford, USA, 1.-3. August, pages 131–141, 2001. [7] V. Haarslev, R. M¨oller, and M. Wessel. Querying the semantic web with racer + nrql. In Proceedings of the KI-2004 International Workshop on Applications of Description Logics (ADL’04), Ulm, Germany, September 24, 2004. [8] A. Jaimes and J. Smith. Semi-automatic, data-driven construction of multimedia ontologies. In Proc. of IEEE Int’l Conference on Multimedia & Expo, 2003. [9] R. Leonardi and P. Migliorati. Semantic indexing of multimedia documents. IEEE Multimedia, 9(2):44–51, April-June 2002. [10] D. Reidsma, J. Kuper, T. Declerck, H. Saggion, and H. Cunningham. Cross document ontology based information extraction for multimedia retrieval. In Supplementary proc. of the ICCS03, Dresden, July 2003. [11] J. Strintzis, S. Bloehdorn, S. Handschuh, S. Staab, N. Simou, V. Tzouvaras, K. Petridis, I. Kompatsiaris, and Y. Avrithis. Knowledge representation for semantic multimedia content analysis and reasoning. In European Workshop on the Integration of Knowledge, Semantics and Digital Media Technology, Nov. 2004. [12] X.Yu, C. Xu, H. Leung, Q. Tian, Q. Tang, and K. W. Wan. Trajectory-based ball detection and tracking with applications to semantic analysis of broadcast soccer video. In ACM Multimedia 2003, volume 3, pages 11–20, Berkeley, CA (USA), 4-6 Nov. 2003 2003.