Multimedia Ontology Based Computational Framework for Video Annotation and Retrieval Alberto Del Bimbo and Marco Bertini Universit` a di Firenze - Italy Via S. Marta, 3 - 50139 Firenze
[email protected]
Abstract. Ontologies are defined as the representation of the semantics of terms and their relationships. Traditionally, they consist of concepts, concept properties, and relationships between concepts, all expressed in linguistic terms. In order to support effectively video annotation and content-based retrieval the traditional linguistic ontologies should be extended to include structural video information and perceptual elements such as visual data descriptors. These extended ontologies (referred in the following as multimedia ontologies) should support definition of visual concepts as representatives of specific patterns of a linguistic concept. While the linguistic part of the ontology embeds permanent and objective items of the domain, the perceptual part includes visual concepts that are dependent on temporal experience and are subject to changes with time and perception. This is the reason why dynamic update of visual concepts has to be supported by multimedia ontologies, to represent temporal evolution of concepts.
1
Position
The relevance of audio-visual data and more in general of media other than from text in modern digital libraries has grown in the last few years. Modern digital libraries are expected to include digital content of heterogeneous media, with almost the same percentage of text and visual data, and to a lower extent of audio and graphics, with the perspective that visual data will have the major role in many specialized contexts very soon. In this scenario, digital video is the media that has probably the highest relevance. There are, in fact, huge amounts of digital video daily produced for news, personal entertainment, educational, and institutional purposes, by broadcasters, media companies, government institutions and individuals. However, video presents important challenges to digital librarians, mainly due to the size of files, the temporal nature of the medium, and the lack of bibliographic methods that leverage non-textual features. In principle, similarly to images, video can be accessed according to either perceptual or semantic cues of the sequence content. But indeed, search and retrieval at syntactic level is impractical. Few examples using the QBE approach for content-based retrieval at perceptual level have been presented for video [1, 2, 3], none of them of practical interest. In fact, creating dynamic examples is utmost complex and N. Sebe, Y. Liu, and Y. Zhuang (Eds.): MCAM 2007, LNCS 4577, pp. 18–23, 2007. c Springer-Verlag Berlin Heidelberg 2007
Multimedia Ontology Based Computational Framework
19
unfeasible for common users. According to this, librarians have traditionally indexed video collections with textual data (expliciting the producer’s name, the date and time, and a few linguistic concepts that summarize video content at semantic level), so that they can be accessed almost in the same way as textual documents. Effective examples of retrieval by content of video clips using textual keywords have been presented for news [4, 5, 6] and sports video domains [7, 8], among the many others. In this context, a crucial importance is given to the availability of appropriate metadata schemes and cross-language access techniques [9, 10], and appropriate tools for the organization of concepts into knowledge structures, also referred to as ontologies [11], so as to support effective access to video content for annotation and content based retrieval [12]. Ontologies are defined as the representation of the semantics of terms and their relationships. Traditionally, they consist of concepts, concept properties, and relationships between concepts, all expressed in linguistic terms. In particular, video ontologies can describe either the video content domain - in this case they are static descriptions of entities and highlights that are displayed in the video and their relationships, as codified by human experience - or the structure of the media - in this case they describe the component elements of the video, the operations allowed on its parts, and the low-level video descriptors that characterize their content. However, traditional domain ontologies, are substantially inadequate to support complete annotation and retrieval by content of video documents. In fact, concepts and categories expressed in linguistic terms, are not rich enough to fully describe the diversity of the visual events that are present in a video sequence and cannot support video annotation and retrieval up to the level of detail of a pattern specification. A simple example may help to understand this point. Let’s consider the video of a soccer game, and attack action highlights. The highlights that can be classified as attack actions have many different patterns. The patterns may differ each other by the playfield zone where the action takes place, the number of players involved, the players’ motion direction, the speed and acceleration of the key player, etc. Although we are able to distinguish between them, and make clustering of attack actions into distinct classes, nevertheless, if we would like to express each pattern in linguistic terms we should use a complex sentence, explaining the way in which the action was developed. The sentence indeed should express the translation of the user’s visual memory of the action into a conceptual representation where concepts are concatenated according to spatio-temporal constraints. In this translation, some visual data will be lost (we typically make a synthesis that retains only the presumed most significant elements), some facts will not be reported, and, probably most important, it will be possible that the appropriate words to distinguish one pattern from the other are not found. Some early attempts to solve the inadequacy of traditional linguistic ontologies to support modeling of domain knowledge up to the level of pattern specification, have guessed the need that video domain ontologies incorporate both conceptual and perceptual elements. The basic idea behind all these researches is that,
20
A. Del Bimbo and M. Bertini
although linguistic terms are appropriate to distinguish between broad event and object categories in generic domains, they are substantially inadequate when they must describe specific patterns of events or entities, like those that are represented in a video, and more in general in any perceptual media. In [13], Jaimes et al. suggested to categorize the concepts that relate to perceptual facts into classes, using modal keywords, i.e. keywords that represent perceptual concepts in several categories, such as visual, aural, etc. Classification of the keywords was obtained automatically from speech recognition, queries or related text. In [14], perceptual knowledge is instead discovered grouping previously annotated images into clusters, based on their visual and text features, and extracting semantic knowledge by disambiguating the senses of the words in the annotations with WordNet and image clusters. Visual prototypes instances are then manually linked to the domain ontology. In [15, 16], a linguistic ontology for news videos is used to find recommendations for the selection of sets of effective visual concepts. A taxonomy of almost 1000 concepts was defined providing a multimedia controlled vocabulary. The taxonomy was mapped into the Cyc knowledge base [17] and exported to OWL. Other authors have explored the possibility of extending ontologies so that also structural video information and visual data descriptors could be accounted for. In [18], a Visual Descriptors Ontology and a Multimedia Structure Ontology, respectively based on MPEG-7 Visual Descriptors and MPEG-7 Multimedia Description Schema, are used together with a domain ontology in order to support video content annotation. A similar approach was followed by [19] to describe sport events. In [20], a hierarchy of ontologies was defined for the representation of the results of video segmentation. Concepts were expressed in keywords using an object ontology: MPEG-7 low-level descriptors were mapped to intermediate level descriptors that identify spatio-temporal objects. In all these solutions, structural and media information are still represented through linguistic terms and fed manually to the ontology. In [21] three separate ontologies modeling the application domain, the visual data and the abstract concepts were used for the interpretation of video scenes. Automatically segmented image regions were modelled through low-level visual descriptors and associated to semantic concepts using manually labeled regions as a training set. Text information available in videos, obtained through automatic speech recognition and manual annotation, and visual features were automatically extracted and manually assigned to concepts or properties in the ontology in [22]. In [23], qualitative attributes that refer to perceptual properties like color homogeneity, low-level perceptual features like model components distribution, and spatial relations were included in the ontology. Semantic concepts of video objects were derived from color clustering and reasoning. In [24] the authors have presented video annotation and retrieval based on high-level concepts derived from machine learned concept detectors that exploit low level visual features. The ontology includes both semantic descriptions and structure of concepts and their lexical relationships, obtained from WordNet.
Multimedia Ontology Based Computational Framework
21
Other researchers have proposed integrated multimedia ontologies that incorporate both linguistic terms and visual or auditory data and their descriptors. In [25] a multimedia ontology, referred to as “pictorially enriched ontology”, was proposed where concepts with a visual counterpart (like entities, highlights or events) were modeled with both linguistic terms and perceptual media, like video and images. Perceptual media were associated with descriptors of their structure and appearance. In [26] a taxonomy was defined for video retrieval, where visual concepts were modeled according to MPEG-7 descriptors. Video clips were manually classified according to the taxonomy and unsupervised clustering was employed to cluster clips with similar visual content. However, none of these researches has fully exploited the distinguishing aspects and the consequences of accounting, in the ontology, both the perceptual and conceptual facts. Within multimedia ontologies, the linguistic and perceptual parts have substantially different characteristics and behaviors. On the one hand, the linguistic part of the ontology embeds permanent, self-contained, objective items of exact knowledge, that can be taken as concrete and accepted standards by which reality can be considered. On the other hand, the perceptual part includes concepts that are not abstractions, but are mere duplicates of things observed in reality, and correlated to entities. Differently from the abstract concepts, these concepts are dependent on temporal experience, and hence subjected to changes with time and perceptions. They are not expressed in linguistic terms but rather by a set of values that are attributed to a few features that are agreed to model the perceptual content of the concept. These values have a substantially different dynamics than linguistic terms. In that they are representatives or prototypes of patterns in which real objects or facts manifest, they might change in dependency of the patterns observed and their number. In other words they depend on the “experience” of the context of application. According to this, multimedia ontologies, in that they permit to link concepts to their real manifestations, should adjust the prototypes of the perceptual facts to the changes and modifications that may occur through time, using proper mechanisms for updating the prototypes of perceptual patterns, as the result of knowledge evolution. In the reported contributions, perceptual elements of the ontology are instead regarded as static entities, similarly to the abstract concepts expressed in linguistic terms. These elements are defined at the time in which the ontology is created and no mechanism that supports their modification through time is conceived in the ontology. We believe instead that evolutionary multimedia ontologies architectures must include include both abstract concepts and descriptors of perceptual facts and must support mechanisms for temporal evolution and update of the visual concepts. We have experimented a novel ontology architecture, for video annotation and search, where the visual prototype acts as a bridge between the domain ontology and the video structure ontology. While the linguistic terms of the domain ontology are fed manually according to the agreed view of the context of application and typically present little or no changes through time, representatives
22
A. Del Bimbo and M. Bertini
of perceptual facts are fed automatically and may change on the basis of the observations. Unsupervised pattern clustering is the mechanism used to define and update the perceptual part of the ontology. Patterns in which perceptual facts manifest are distinguished each other by clustering their descriptors, and the centers of the clusters that are obtained are assumed as pattern prototypes in the multimedia ontology, i.e. as representatives of classes of patterns that have distinct appearance properties. New patterns that are analyzed for annotation are considered as new knowledge of the context. Therefore, as they are presented to the ontology, they modify the clusters and their centers. We discuss examples with application to the soccer domain considering highlight clustering. Effects of adding new knowledge, and particularly the changes induced in the clusters (some patterns can move from one cluster to another and their cluster centers are redefined accordingly) are discussed and presented with reference to the soccer championship domain. We also discuss the way in which use of OWL DL for the multimedia ontology description allows to use a reasoning engine to extend and exploit the knowledge contained in the ontology to perform extended annotations and retrieval by content according to video semantics.
References 1. Naphade, M.R., Wang, R., Huang, T.S.:Multimodal pattern matching for audiovisual query and retrieval. In: Proc. of SPIE Storage and Retrieval for Media Database (2001) 2. Zhang, H., Wang, A., Altunbask, Y.: Content based video retrieval and compression: A unified solution. In: Proc. of the IEEE International Conference on Image Processing (1997) 3. Chang, S.F., Chen, W., Meng, H.J., Sundaram, H., Zhong, D.: Videoq: An automated content based video search system using visual cues. In: Proc. of the IEEE International Conference on Image Processing (1997) 4. Eickeler, S., Muller, S.: Content-based video indexing of tv broadcast news using hidden markov models. In: Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pp. 2997–3000 (1999) 5. Sato, T., Kanade, T., Hughes, E.K., Smith, M.A.: Video ocr for digital news archive. In: IEEE International Workshop on Content–Based Access of Image and Video Databases CAIVD’ 98, pp. 52–60 (1998) 6. Hauptmann, A., Witbrock, M.: Informedia: News–on–demand multimedia information acquisition and retrieval. Intelligent Multimedia Information Retrieval, 213– 239 (1997) 7. Ekin, A., Tekalp, A.M., Mehrotra, R.: Automatic soccer video analysis and summarization. IEEE Transactions on Image Processing 12(7), 796–807 (2003) 8. Yu, X., Xu, C., Leung, H., Tian, Q., Tang, Q., Wan, K.W.: Trajectory-based ball detection and tracking with applications to semantic analysis of broadcast soccer video. In: ACM Multimedia 2003. Berkeley, CA (USA), 4-6 November 2003, vol. 3, pp. 11–20 (2003) 9. Marchionini, G., Geisler, G.: The open video digital library. D-Lib Magazine 8(11) (December 2002) 10. European cultural heritage online (echo). Technical report (2002), http://echo.mpiwg-berlin.mpg.de/home
Multimedia Ontology Based Computational Framework
23
11. Gruber, T.: Principles for the design of ontologies used for knowledge sharing. Int. Journal of Human-Computer Studies 43(5-6), 907–928 (1995) 12. Athanasiadis, T., Tzouvaras, V., Petridis, K., Precioso, F., Avrithis, Y., Kompatsiaris, Y.: Using a multimedia ontology infrastructure for semantic annotation of multimedia content. In: Proc. of 5th International Workshop on Knowledge Markup and Semantic Annotation (SemAnnot ’05), Galway, Ireland (November 2005) 13. Jaimes, A., Tseng, B., Smith, J.: Modal keywords, ontologies, and reasoning for video understanding. In: Int’l Conference on Image and Video Retrieval (CIVR) (July 2003) 14. Benitez, A., Chang, S.F.: Automatic multimedia knowledge discovery, summarization and evaluation. IEEE Transactions on Multimedia (Submitted) (2003) 15. Kender, J., Naphade, M.: Visual concepts for news story tracking: Analyzing and exploiting the nist trecvid video annotation experiment. In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). vol. 1, pp. 1174–1181 (2005) 16. Naphade, M., Smith, J., Tesic, J., Chang, S., Kennedy, L., Hauptmann, A., Curtis, J.: Large-scale concept ontology for multimedia. IEEE Multimedia 13(3), 86–91 (2006) 17. Lenat, D., Guha, R.: In: Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project. Addison-Wesley, Reading (1990) 18. Strintzis, J., Bloehdorn, S., Handschuh, S., Staab, S., Simou, N., Tzouvaras, V., Petridis, K., Kompatsiaris, I., Avrithis, Y.: Knowledge representation for semantic multimedia content analysis and reasoning. In: European Workshop on the Integration of Knowledge, Semantics and Digital Media Technology (November 2004) 19. Vembu, S., Kiesel, M., Sintek, M., Bauman, S.: Towards bridging the semantic gap in multimedia annotation and retrieval. In: Proc. First International Workshop on Semantic Web Annotations for Multimedia (SWAMM), Edinburgh (Scotland) (May 2006) 20. Mezaris, V., Kompatsiaris, I., Boulgouris, N., Strintzis, M.: Real-time compresseddomain spatiotemporal segmentation and ontologies for video indexing and retrieval. IEEE Transactions on Circuits and Systems for Video Technology 14(5), 606–621 (2004) 21. Simou, N., Saathoff, C., Dasiopoulou, S., Spyrou, E., Voisine, N., Tzouvaras, V., Kompatsiaris, I., Avrithis, Y., Staab, S.: An ontology infrastructure for multimedia reasoning. In: Proc. International Workshop VLBV 2005, Sardinia (Italy) (September 2005) 22. Jaimes, A., Smith, J.: Semi-automatic, data-driven construction of multimedia ontologies. In: Proc. of IEEE Int’l Conference on Multimedia & Expo. (2003) 23. Dasiopoulou, S., Mezaris, V., Kompatsiaris, I., Papastathis, V.K., Strintzis, M.G.: Knowledge-assisted semantic video object detection. IEEE Transactions on Circuits and Systems for Video Technology 15(10), 1210–1224 (2005) 24. Snoek, C., Huurnink, B., Hollink, L., de Rijke, M., Schreiber, G., Worring, M.: Adding semantics to detectors for video retrieval. IEEE Transactions on Multimedia(Pending minor revision) (2007) 25. Bertini, M., Cucchiara, R., Del Bimbo, A., Torniai, C.: Video annotation with pictorially enriched ontologies. In: Proc. of IEEE Int’l Conference on Multimedia & Expo. (2005) 26. Grana, C., Bulgarelli, D., Cucchiara, R.: Video clip clustering for assisted creation of mpeg-7 pictorially enriched ontologies. In: Proc. Second International Symposium on Communications, Control and Signal Processing (ISCCSP), Marrakech, Morocco (March 2006)