ICBR - Multimedia management system for

0 downloads 0 Views 983KB Size Report
objective measurements of the visual content and supporting retrieval by content ... adaptation module and a set of MPEG-7 audio/visual descriptors. 4. System ...
ICBR – Multimedia management system for Intelligent Content Based Retrieval Janko Ćalić1, Neill Campbell1, Majid Mirmehdi1, Barry T. Thomas1 Ron Laborde2, Sarah Porter2, Nishan Canagarajah2 1

Department of Computer Science, Department of Electrical & Electronic Engineering, University of Bristol, Merchant Venturers Building, Woodland Road, Bristol BS8 1UB, UK {janko.calic, neill.campbell, m.mirmehdi, barry.thomas, ron.laborde, sarah.porter, nishan.canagarajah }@bris.ac.uk 2

Abstract. This paper presents a system designed for the management of multimedia databases that embarks upon the problem of efficient media processing and representation for automatic semantic classification and modelling. Its objectives are founded on the integration of a large-scale wildlife digital media archive with a manually annotated semantic metadata organised in a structured taxonomy and media classification system. Novel techniques will be applied to temporal analysis, intelligent key-frame extraction, animal gait analysis, semantic modelling and audio classification. The system demonstrator will be developed as a part of an ICBR project within the 3C Research programme of convergent technology research for digital media processing and communications.

1. Introduction The overwhelming growth of multimedia information in private and commercial databases, as well as its ubiquity throughout the World Wide Web, present new research challenges in computing, data storage, retrieval and multimedia communications. Having in mind the user’s need to intuitively handle this vast multimedia information the development of content based multimedia indexing and retrieval appears to be the in the spotlight of multimedia and computer vision research. However, evolution of a functional multimedia management system is hindered by the “semantic gap”; a discontinuity between the simplicity of content descriptions that can be currently computed automatically and the richness of semantics in user’s queries posed for media search and retrieval [1]. In order to bridge that gap, it is essential to focus our research activities towards the knowledge behind the links that connect perceptual features of the analysed multimedia and their meaning. Thus, a large-scale system that merges the semantic text-based retrieval approach to multimedia databases with content-based feature analysis and investigates the signification links between them ought to be the next milestone of research in the multimedia management field.

2 Janko Ćalić1, Neill Campbell1, Majid Mirmehdi1, Barry T. Thomas1 Ron Laborde2, Sarah Porter2, Nishan Canagarajah2

The ICBR (Intelligent Content-based Retrieval) project aims at developing a largescale centralized multimedia server in order to enable large quantities of media to be stored, indexed, searched, retrieved, transformed and transmitted within a framework encompassing issues related to multimedia management, content analysis and semantic multimedia retrieval. The system is designed to be large enough to test the problems associated with real-word demands for such media content with regard to bandwidth, processing power, storage capacity and user requirements. Moreover, ICBR brings a unique opportunity to tackle semantic gap issues by integrating a large multimedia database together with the media’s semantic description organised in a structured taxonomy.

2. Related Work In the first generation of visual retrieval systems attributes of visual data were extracted manually, entailing a high level of content abstraction. Though operating on a conceptual level, search engines worked only in the textual domain and the cost of annotation was typically very high. Second-generation systems addressed perceptual features like colour, textures, shape, spatial relationships obtaining fully automated numeric descriptors from objective measurements of the visual content and supporting retrieval by content based on combinations of these features. A key problem with second-generation retrieval systems was the semantic gap between the system and users. Query by Example retrieval is a typical technique that utilises this paradigm [2], [3]. However, retrieval of multimedia is generally meaningful only if performed at high levels of representation based on semantically meaningful categories. Furthermore, human visual cognition is much more concerned with the narrative and discourse structure of media than merely with its perceptual elements [4]. Therefore, a lot of research effort has been put recently into developing a system that will enable automatic semantic analysis and annotation of multimedia, video as the most complex media in particular. Initially, temporal structure of media is analysed by tracking spatial domain sequence features [5], or more recently compressed domain features [6]. A video sequence is summarised by choosing a single key-frame [7] or a structured set of keyframes [8] to represent the content in a best possible way. Low-level features like colour, texture, shape, etc. are extracted and stored in the database as video metadata. Using this information, various methods of media representation have been proposed [9], [10] targeting user centred retrieval and the problem of semantic gap. Utilising metadata information, attempts to apply semantic analysis to a limited contextual space were presented [11]. In order to bring together research and industrial knowledge and develop large reference content-based retrieval systems essential for real evaluation and experimentation, a number of collaborative projects in the field of content-based multimedia retrieval have been set up: IST Network of Excellence SCHEMA [12], reference system COST211[13], digital video library INFOMEDIA [14], etc.

ICBR – Multimedia management system for Intelligent Content Based Retrieval 3

3. ICBR Approach Incorporating a large manually annotated multimedia archive and its structured semantic classification ICBR has a unique potential to tackle a wide spectrum of research problems in content-based multimedia retrieval. The major guideline of ICBR development process is the semantic information, considered as the ground truth, upon which content-analysis algorithms base their internal representations and knowledge. Not only does the semantic annotation of a vast amount of multimedia data enable semantic inference, classification and recognition, but it enhances the feature extraction process as well. This approach imitates human semantic visual cognition where high-level semantic categories influence the way low-level features are perceived by the visual system. In the initial stage of temporal analysis and content summarisation, production knowledge and ground-truth annotation are utilised in algorithm design to achieve robust results on real-world data. Production information like camera artefacts, overexposures, camera work as well as the shot composition type and region spatiotemporal relations are automatically extracted to enable content summarisation and temporal representation on higher a semantic level than conventional summarisation methods. In addition to algorithms developed within predecessor projects AutoArch and Motion Ripper [15], the ICBR project brings novel methods of motion and gait analysis for content analysis and classification. Objectives set for the semantic modelling module include identification of overall scene type as well as individual object classification and identification. The approach is to design unsupervised classification algorithms trained on the rich and structured semantic ground-truth metadata that exploits specific media representations generated by the content adaptation module and a set of MPEG-7 audio/visual descriptors.

4. System Overview As depicted in Figure 1, our ICBR system consists of four major units: Media Server, Metadata Server, a set of feature extraction and content analysis algorithms and a user-end Demonstrator comprising of a Demonstrator engine and a platform independent front-end GUI. The Media Server, as a content provider, delivers audio and/or visual media through a QuickTime interface, either as a bandwidth-adaptive HTTP/RTSP stream for presentational purposes or as a QuickTime API’s direct file access. Content analysis algorithms access media through a specific QT-based API designed to optimise various ways of access to audiovisual information. Furthermore, the modules of the content analysis algorithms that focus on the semantic analysis are able to access the Metadata Server and read/write, search and retrieve both textual and binary metadata through an MPEG-7 based API. The Metadata Server is designed to fully support the MPEG-7 descriptor set and is based on an Oracle database engine.

4 Janko Ćalić1, Neill Campbell1, Majid Mirmehdi1, Barry T. Thomas1 Ron Laborde2, Sarah Porter2, Nishan Canagarajah2

Fig. 1. ICBR Architecture shows the main entities of the system and data flow between them

The Demonstrator engine merges and controls the processes involved during search, browse and retrieval triggered by the front-end GUI. The system will support various types of interaction with multimedia database: query by example, simple text based queries, combined text and audio-visual queries, browsing based on both semantic and perceptual similarities, automated searches integrated in the media production tools, etc. The following sections describe the major parts of the ICBR system and the main content analysis modules in more detail. 4.1. Media Server The Media Server is designed to support both real-time streaming to the various modules of the ICBR system and file based access. The archive comprises an Apple Xserve content server and streaming servers including 10Tb of Raid Storage and a 100Mbps fibre channel network. The multimedia content within the ICBR project is established from the 12000 hours of Granada media digitised and transcoded into following formats: § Master Media format, either uncompressed 601 YUV or DV which can be 525 or 625 or possibly HD formats in future § QuickTime wrapped MPEG-4 Simple Profile, 1Mbps @25 fps (95.8kb/s) with 16bit Stereo 48k audio for LAN streaming purposes 352X264pix/keyframe@every250f. The content comprises of various wildlife footages and has been professionally digitised from both film tapes and analogue camera sources. The crucial value for the ICBR research is that the whole media archive has been manually annotated by production professionals. Although there are no strict rules of the language used to describe the media’s content, the majority of these semantic descriptions use limited vocabulary for particular description categories like frame composition, camera work,

ICBR – Multimedia management system for Intelligent Content Based Retrieval 5

shooting conditions, location and time, etc. In addition to that, the events and objects are described in natural language. This valuable semantic information brings the research opportunities to a completely new level, enabling the ICBR researchers to embark upon the problem of the semantic gap. 4.2. Metadata Server The Metadata Server unit stores, handles and facilitates querying and retrieval of both technical (e.g. descriptors, timecode, format, etc.) and semantic metadata. The underlying software and database system brings: § A high-level semantic approach to metadata cataloguing and classification, specifically its self indexing, auto classification and multiple schema handling technologies § Merging of semantic high-level and low-level audio-visual descriptors to enable cataloguing and classification in a novel multimedia database environment. § Media classification: a concept node structure which is designed to hold and maintain the basic knowledge schema with its controlled vocabulary (synonym and homonym control) and cross-hierarchic, associative attributes. At the heart of this classification system is a “knowledge map” of some 50,000 concepts of which nearly 13,000 have been implemented covering many fields of knowledge: natural history, current affairs, music, etc. Manually generated semantic annotation is transferred from the free text format into the media classification structure, forming an irreplaceable ground truth and knowledge base. 4.3. Segmenter The Segmenter module parses digitised video sequences into shots, additionally labelling camera artefacts and tape errors on the run. It utilises block-based correlation coefficients and histogram differences to measure the visual content similarity between frame pairs. In order to achieve real-time processing capability, a two-pass algorithm is applied. At first, shot boundary candidates are labelled by thresholding chi-square global colour histogram frame differences. In the second pass, more detailed analysis is applied to all candidates below a certain predetermined threshold. At this stage, hierarchical motion compensation is employed to ensure the algorithm is robust in the presence of camera and object motions. It is shown to achieve a higher recall and precision compared with the conventional shot detection techniques [16]. In addition, the algorithm detects gradual shot transitions by measuring the difference between the visual content of two distant frames. Motion estimates obtained from the shot cut detection algorithm are used to track regions of interest through the video sequence. This enables the distinction between content changes caused by gradual transitions and those caused by camera and object motions.

6 Janko Ćalić1, Neill Campbell1, Majid Mirmehdi1, Barry T. Thomas1 Ron Laborde2, Sarah Porter2, Nishan Canagarajah2

4.4. Key-frame Extraction In order to achieve meaningful content understanding and classification, a robust and complex, yet efficient system for video representation has to be developed. The crucial stage of that process is abstraction of a data intensive video stream into a set of still images, called key-frames that represent both the perceptual features and the semantic content of a shot in the best possible way.

Fig. 2. Flowchart of the rule based key-frame extraction module, where the rules applied depend upon the shot types

Current aims of the project are to extract a visual skim consisting of 8 frames to visualise the temporal content of a shot, whilst the majority of the low-level metadata will be extracted from a single key-frame. Furthermore, information extracted in the temporal parsing module is utilised to calibrate the optimal processing load for the key-frame extractor. This information includes shot boundaries, their types and frame-to-frame difference metric, extracted using colour histogram. Due to the fact that video data demands heavy computation and that our goal is to achieve real-time processing, we need to reduce the complexity of the data input by using features extracted directly from the compressed domain video stream. In order to tackle a subjective and adaptive intelligent key-frame extraction, we need to analyse spatio-temporal behaviour of the regions/objects present in the scene, as depicted in Figure 2. For segmentation purposes Eigen-decomposition based techniques, like nCut graph segmentation and non-linear filtering methods like anisotropic diffusion have been investigated [17]. A set of heuristic rules is to be designed in order to detect the most appropriate frame representative.

ICBR – Multimedia management system for Intelligent Content Based Retrieval 7

4.5. Motion and Gait Analysis The animal gait analysis has focused on developing algorithms to extract quadruped gait patterns from the video footage. It is anticipated that this information will be useful in training classifiers for low-level query by example searches as well as in semantic modelling. The algorithm generates a sparse set of points describing region trajectories in an image sequence using a KLT tracker [18]. These points are then separated into foreground and background points and the internal motion of the foreground points is analysed to extract animal’s periodic motion signatures.

Fig. 3. Animal gait and quadruped structure from a sparse set of tracked points

A more robust tracker will allow the segmentation of the points' trajectories into many different models. This will facilitate the identification of numerous foreground objects and the separation of individual objects' motion into multiple models. 4.6. Semantic Modelling The key resource underpinning this research is a very large database of video and audio footage that will hold a variety of data types, from text to long image and audio sequences. Our aim is to automate the process of ground truth extraction by learning to recognize object class models from unlabeled and unsegmented cluttered scenes. We will build on previous works such as [19]. In addition, identification of the overall scene type is essential because it can help in identifying and modelling specific objects/animals present in the scene. Thus, a set of probabilistic techniques for scene analysis that apply Naïve Bayesian Classifiers to exploit information extracted from images will construct user-defined templates suitable for wildlife archives. Therefore the query database becomes a kind of visual thesaurus, linking each semantic concept to the range of primitive image features most likely to retrieve relevant items. We also explore techniques which allow the system to learn associations between semantic concepts and primitive features from user feedback by annotating selected regions of an image and applying semantic labels to areas with similar characteristics. As a result, the system will be capable of improving its performance with further user feedback.

8 Janko Ćalić1, Neill Campbell1, Majid Mirmehdi1, Barry T. Thomas1 Ron Laborde2, Sarah Porter2, Nishan Canagarajah2

4.7. Audio A primary aim on the audio side is to provide content production with 'query-byexample' retrieval of similar sounds from the Audio Database, consisting at present of about 200Gbytes of sound. The major requirement is not to classify sounds per se, but to be able to retrieve sounds similar to a query sound from the database during the creative post-processing sessions when composing and laying down audio tracks on completed (silent) documentaries. Our approach, if the research proves successful, will point towards the provision of an active database. Given an initial, hopefully small, set of models trained from hand selected sounds, our thesis is that clusters of new sounds with perceptual similarities can be identified from them. Such clusters can then be used to train new models. This way we can bootstrap the Audio Database into an evolving (Active) Database. The sound clusters of new models trained “overnight” being presented to the user of the database for rejection as a nonperceptual clustering or acceptance as a new cluster with certain perceptual “features”. Provision of a textual description of the perceptual features of each new cluster would enable us to build up a complementary semantic model space. We will then be able to support both query by example sound and query by semantic (textual) description of a sound. An automatic approach to segmentation appears essential. Our basic unit of audio sound will be a segment within a track with the content provider storing tracks as separate files. Each track in the Audio Database, mostly originally identified with a single semantic label on the CD from which it had been extracted, is likely for the sound production purposes to contain different segments of sounds related to similar sound segments across track rather than within track. These must if possible be identified, perceptually clustered and classified. We have gained invaluable experience from using the Matlab implementations of MPEG7 audio reference software developed by Michael Casey and Matthew Brand - centred upon spectrogram feature extraction, redundancy reduction and hidden Markov model classification - and will further benefit from making quantitative comparisons between our and the MPEG7 approach.

5. Conclusions This paper describes the major objectives of the ICBR project and the approach taken to utilise the unique opportunity in having both large real-world multimedia database and manually annotated semantic description of media organised in a structured taxonomy. The final demonstrator will be integrated into a media production system showing functionalities of a third generation content-based multimedia retrieval system such as semantic retrieval and browsing, automatic content generation and adaptation, etc. bringing the novel multimedia management concept into a real world of media production.

ICBR – Multimedia management system for Intelligent Content Based Retrieval 9

Acknowledgements The work reported in this paper has formed part of the ICBR project within the 3C Research programme of convergent technology research for digital media processing and communications whose funding and support is gratefully acknowledged. For more information please visit www.3cresearch.co.uk . Many thanks to David Gibson for the Figure 3.

6. References [1] J. Calic, "Highly Efficient Low-Level Feature Extraction for Video Representation and Retrieval", PhD Thesis, Queen Mary, University of London, December 2003 [2] M. Flickner et al., ”Query by image and video content: The QBIC system”, IEEE Computer 28, pp 23-32, September 1995. [3] A. Pentland, R.W. Picard, and S. Sclaroff, “Photobook: Content-based manipulation of image databases”, Intern. J. Comput. Vision, 18(3), pp 233-254, 1996. [4] A. B. Benitez, J. R. Smith, “New Frontiers for Intelligent Content-Based Retrieval”, in Proc. SPIE 2001, Storage and Retrieval for Media Databases, San Jose CA, January 2001. [5] H. J. Zhang, A. Kankanhalli, and S. Smoliar, “Automatic Partitioning of Full-Motion Video”, Multimedia Systems, Vol. 1, No. 1, pp.10-28, 1993. [6] B.-L. Yeo and B. Liu, “Rapid Scene Analysis on Compressed Video”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 5, No. 6, pp. 533-544, December 1995. [7] J. Calic, E. Izquierdo, “Efficient Key-Frame Extraction and Video Analysis”, 2002 International Symposium on Information Technology (ITCC 2002), 8-10 April 2002, Las Vegas, NV, USA. IEEE Computer Society, pp. 28-33, 2002, [8] A. Girgensohn, J. S. Boreczky, “Time-Constrained Keyframe Selection Technique”, Multimedia Tools Appl., 11(3): 347-358 (2000) [9] M. Davis, “Knowledge Representation for Video”, Proc. Of 12th National Conference on Artificial Intelligence (AAAI-94), Seattle, USA, AAAI Press, pp. 120-127, 1994. [10] C. Dorai, S. Venkatesh: Computational Media Aesthetics: Finding Meaning Beautiful. IEEE MultiMedia 8(4): 10-12 (2001) [11] M. Naphade, T. Kristjansson, B. Frey, T. S. Huang, “Probabilistic Multimedia Objects Multijects: A novel Approach to Indexing and Retrieval in Multimedia Systems”, Proc. IEEE ICIP, Volume 3, pages 536-540, Oct 1998, Chicago, IL [12] SCHEMA Network of Excellence in Content-Based Semantic Scene Analysis and Information Retrieval www.schema-ist.org/SCHEMA/ [13] COST 211quat, Redundancy Reduction Techniques and Content Analysis for Multimedia Services, www.iva.cs.tut.fi/COST211/ [14] Informedia Digital Video Library, www.informedia.cs.cmu.edu/ [15] Motion Ripper, http://www.3cresearch.co.uk/3cprojects/motion_ripper [16] Sarah V. Porter, Majid Mirmehdi, Barry T. Thomas, “Temporal video segmentation and classification of edit effects”, Image Vision Comput. 21(13-14): 1097-1106, 2003. [17] J. Calic, B. T. Thomas, "Spatial Analysis in Key-frame Extraction Using Video Segmentation", WIAMIS'2004, April 2004, Instituto Superior Técnico, Lisboa, Portugal [18] D. Gibson, N. Campbell, B. Thomas, “Quadruped Gait Analysis Using Sparse Motion Information”, Proc. of ICIP, IEEE Computer Society, September 2003. [19] M. Weber, M. Welling and P. Perona, “Unsupervised learning of models for recognition”, Proc. ECCV2000, pp. 18--32, Lecture Notes in Computer Science, Springer, 2000