and virtual meetings tend to be increasingly relevant within business processes. Knowledge ... order to retrieve and reuse this data in a useful way. Additionally ...
MPEG-7 Requirements and Application for a MultiModal Meeting Information System: Insights and Prospects within the MISTRAL Research Project Christian Gütl, Victor Manuel García-Barrios (Institute for Information Systems and Computer Media (IICM) - Faculty of Computer Science at Graz University of Technology, Austria {cguetl,vgarcia}@iicm.edu)
Abstract: The increasing amount of multi-media information in Intranets and on the Internet demands for advanced processing, management and retrieval. The MISTRAL research project aims at enhanced semi-automatic procedures for semantic annotations and semantic enrichment of multi-modal data streams. The MISTRAL system is designed to be applicable in a wide range of use cases and consists of so-called ‘conceptual units’ for ‘uni-modal data stream processing’, ‘multi-modal merging of extracted features’, ‘semantic enrichment of concepts’ and ‘semantic applications’. The main implementation field focuses on processing, analysis, management, search and retrieval of meeting information. According to literature, face-to-face and virtual meetings tend to be increasingly relevant within business processes. Knowledge workers, technicians and managers spend a noteworthy amount of their working time in meetings. Consequently, big financial and human efforts are invested in order to create new knowledge or to distribute it among meeting participants. Thus, knowledge addressed or generated in meetings represents a valuable resource, which is worth to store and make accessible for reusing it. In this paper, we give a brief overview of the MISTRAL research idea and its architecture, followed by a detailed discussion of the meeting application scenario. Based on that, the requirements for and application of MPEG-7 and MPEG-21 are addressed. Finally, intended results and contributions to the MPEG community are depicted. Keywords: meeting information system, multi-modal data, MPEG-7, MISTRAL Categories: (H.4, J.4, K.3, I.5)
1
Introduction
Nowadays, our business and private activities are intrinsically tied to the use of modern electronic equipment that provides a variety of multi-media capabilities and produces data streams of different media types. Therefore, we are confronted with a dramatic increase of multi-modal data streams and their management. According to [Lyman and Varian 2003] and based on an assumed world population of 6.3 billion, an amount of 800 Megabyte of new data is created per year and per capita. Further, the same study estimates that of a total amount of 5 Exabytes of information available world-wide about 92% exists in electronic form, with a 170 Terabyte share being available on the Internet. As multi-media data is becoming the predominant form of information forming this global asset, semantic information has to be attached in order to retrieve and reuse this data in a useful way. Additionally, there is no doubt about the fact that human resources are not able to keep pace with this situation. Thus, improved and new methods are needed, and this need represents the motivation of the
MISTRAL (Measurable intelligent and reliable semantic extraction and retrieval of multimedia data) project (see also [MISTRAL]). The MISTRAL research aims at enhanced semi-automatic procedures for semantic annotations and semantic enrichment of multi-modal data streams and corpora. An extensible, interchangeable and flexible system is given by the application of a distributed component-based approach and by the usage of Web Service technology. In order to process multi-modal corpora the system consists of ‘conceptual units’ in sequential order for (1) uni-modal data stream processing, (2) multi-modal merging of extracted features, (3) semantic enrichment of concepts, and (4) semantic applications. Furthermore, a benchmarking framework, UI & visualization services as well as security and privacy issues complement the concept units in an orthogonal way. In general, the system is designed to be applicable for a wide range of application scenarios by means of orchestration and choreography of MISTRAL-internal and external Web Services. However, the main application scenario for detailed research and prototype implementation is based on the field of meeting information processing, which is discussed in the following section.
2
Meeting Information Processing and Retrieval
Face-to-face and virtual meetings increasingly take place in today’s business processes. As stated in [Romano and Nunamaker 2001], managers and knowledge workers spend between 25% and 80% of their working time in meetings. Further, Romano and Nunamaker report that the median number of participants in the analyzed meetings was nine (9). From our literature investigation (see e.g. [Romano and Nunamaker 2001] and [Whiteside and Wixon 1988]) we could identify the following most common meeting purposes: reconciliation of conflicts, facilitating staff communication, group decisions, solving problems, learning and training, knowledge exchange, reaching a common understanding, exploration of new ideas and concepts. Thus, big financial and human efforts are invested in order to create new knowledge or to transfer it between meeting participants. Furthermore, it is clear that this knowledge should be preserved and made accessible for all members of a company. Based on the observations stated so far, an economization potential is identifiable if the effectiveness of meetings is increased. That is, effective meetings are reached by applying improved processing methods and technology support as well as meeting information management and retrieval, such as electronic meeting systems, group support systems, meeting browsers or meeting systems (see [Antunes et al. 2003], [Lalanne et al. 2005], [WIKIPEDIA]). In this context, multi-modal meeting recording applications and meeting information systems are of emerging interest. Several research projects in these fields have been conducted recently or were conducted at present (see e.g. [AMI], [CALO], [CHIL] [IM2], [M4], [NIST], and [Begeman et al. 1986]). Despite the increasing research activity, our extensive recherché has shown that there is still a lack of integration facilities into knowledge management and e-learning systems. This fact motivated us to design and implement the semantic meeting information application within the MISTRAL research project. A brief overview of this application from different viewpoints is given in the following paragraphs and
represents the basis for the remaining chapters (for details see [Guetl et al. 2005]). The overall architecture of the Mistral systems is depicted in Fig. 1 on the next page.
Figure 1: MISTRAL’s component architecture for the meeting application scenario. The main components of the Mistral system for multi-modal data stream processing are designed in accordance to the conceptual units: (1) Uni-modal System encompassing video, audio, speech-to-text and text processing units, (2) Multi-modal System, (3) Semantic Enrichment, and (4) Semantic Application. In addition, Data Management is responsible for the storage, retrieval and access of multi-modal data and its metadata as well as the management of the knowledge base for meetingrelevant domain knowledge. The Automation Engine in co-operation with the Semantic Application handles service orchestration and choreography. From the meeting recording point of view, various meeting scenarios (e.g. a faceto-face meeting at one location or virtual meetings at different locations) and seminar scenarios (like workshops, conferences, presentations, etc.) have to be considered.
Our prototype solution (see Fig. 1) allows the recording of multi-modal meeting data streams, such as audio and video streams, and the collecting of multi-modal sensory input from/about meeting participants as well as non-video and non-audio artifacts (agenda, presentation slides, etc.). Considering these materials, a speech-to-text transformation is done for audio streams and further semantic features from the multimodal data streams are extracted, annotated and merged. Thereafter, additional semantic enrichment is conducted by exploiting a domain specific knowledge base. From the semantic applications viewpoint of MISTRAL, the focus is set on the integration of meeting information into knowledge transfer and e-learning processes. Thus, we would like to call attention to the following requirements: (1) personalized support for meeting attendees and absentees to enhance knowledge development and integration, e.g. user-tailored access to meeting information by semantically rich annotations; (2) adaptive knowledge transfer to other members of the company, e.g. depending on position and current job tasks; (3) personalized views on and management of knowledge structures as well as trustworthy feedbacks to the multimodal meeting conceptual units; (4) processing and archiving knowledge created and transferred in meetings as integral part of the corporate memory; (5) visualization and retrieval of multi-modal semantics; and (6) linking knowledge captured in meetings to learning and training activities, e.g. its application in experiential-based learning..
3
Multi-Modal Processing Features
The purpose of this section is to introduce a selection of important meeting information processing features based on the main research objectives of the meeting application scenario. The insights gained from these features represent the basis for requirements of metadata applications, as discussed in the next section. Interesting groups of features and processing units involved are described in the following paragraphs. It is also worth to mention that to each feature extracted or processed will be attached a confidence level. Thus, the examples for relevant meeting information processing feature groups are: (a) meeting participant localization and recognition, (b) speech-to-text and text processing, (c) object and sound recognition, (d) multi-modal sensor data (presentation and click-data processing), and (e) semantic enrichment. (a) Meeting Participant Localization and Recognition This feature group allows recognizing and localizing meeting participants. Both, audio and video unit process these features independently. In addition, the audio unit supplies the system with voice characteristic features, such as stress, sex and age group of participants. The video unit also delivers movement information as well as gesture and facial expression. The multimodal merging unit combines corresponding features and is responsible for confidence boosting, e.g. correct participant localization.
(b) Speech-to-Text and Text Processing The speech-to-text unit extracts as much as possible textual information from participants’ oral talks based on a corresponding phoneme dictionary. The text unit processes speech-totext information as well as meeting documents, such as agenda, project documents, and background documents. These units deliver high-level features, such as extracted concepts, content summaries, text classifications, and semantic content clusters. In this context, the multi-modal merging unit tries to detect and correct speech-to-text extraction errors. (c) Object and Sound Recognition The sound unit delivers information about sound events, such as phone ring, laughing, clapping hands, etc. The video unit performs the recognition of trained objects, such as a mobile phone or a briefcase. Again, the multi-modal merging unit is responsible for combining sound and video events in order to perform spatio-temporal synchronization as well as for adapting the confidence level of extracted concepts, e.g. the correctness of the spatio-temporal combination of ‘phone ring’ and ‘mobile device’ increases the confidence level of both events. (d) Multi-modal Sensor Data Within the Mistral project multi-modal sensor data processing is restricted to particular interactions with a presentation device, such as selecting or opening a document, click-data stream, or visiting some URLs. In combination with the above mentioned feature groups (a) to (d) further useful information can be gained and processed by the multi-modal merging unit. As an example for this case consider the correction of speech-to-text errors by using and analyzing the content of a presentation text in the corresponding slide.
(e) Semantic Enrichment Based on the domain of different meeting scenarios that are modeled in the knowledge base, the semantic enrichment unit can deliver further high-level features and additional semantic annotations. On the one hand, the semantic enrichment unit is responsible for conflict detection, e.g. a person can not sit in the foreground and stand in the background at the same time. On the other hand, the semantic enrichment unit performs semantic conclusions, e.g. a person who opens the meeting, introduces the other participants, asks the most questions and closes the meeting, is the ‘moderator’. The semantic application unit builds on the feature groups discussed so far and on semantic concepts in order to support knowledge management activities and learning/training on the job. The front-end to users and other external services is called ‘semantic application portal’. This portal provides access to personalized functionality and handles authentication and authority. The back-end of the application is represented by the ‘retrieval module’, which is responsible for identifying and accessing task-dependent sets of semantic information provided by the feature groups. Furthermore, the retrieval module builds proper data structures to support semantic application functionality as depicted in the chapter before. In order to meet these objectives, further modules are needed within the semantic application unit. The ‘modeling module’ encompasses user modeling and context modeling, and in combination with the ‘adaptation module’ they are responsible e.g. for personalization or recommendation purposes. Finally, the ‘visualization module’ supports users to get graphical views on extracted features (properly represented by means of similarity techniques as well as by hierarchies, graphs and time-line plots).
4
Project-relevant Metadata Requirements
According to the objectives of the MISTRAL research project and its application scenario as discussed before, specific project-relevant metadata requirements emerge. The goal of this section is to highlight the most important requirements, followed by the discussion of a possible application of MPEG standards. Indeed, the main requirement is the description of meeting-specific data. This encompasses basic metadata for multi-modal data streams (video, audio, speech-totext, and sensor data given by the interaction of the presentation computer) and related documents (such as agenda, project description, reports and background information). In addition, the system requires easy data annotation given by feature extraction and semantic enrichment. This semi-automatic data enrichment is processed by distributed conceptual units (uni-modal units, multi-modal unit and semantic enrichment unit) and by user input and user feedback. It is worth to mention
that the distributed system units need write-access in a coordinated way. In addition, analogous units also need proper read-access to particular metadata. From the semantic application point of view, metadata need to be properly identified in order to be easily recognized, accessed and processed by generic procedures. This enables a more useful fulfillment of task-specific demands on particular information. Furthermore, for meeting data delivery synchronization of data streams, content adaptation and personalization, as well as quality of service (QoS) issues is also of particular interest. In order to support benchmarking procedures, metadata should be easy to annotate by humans as well as to apply as training and test data following common or standardized techniques and data exchange formats. Last but not least, security and privacy aspects have to be considered, because there are numerous chunks of sensitive data and information offending privacy aspects. Following the requirements stated so far in this section, we have decided to apply MPEG-7 and MPEG-21 because of their standardization and flexibility as well as their increasingly adaptation in research and industrial development (see [Martinez et al. 2002], [Martinez 2002] and [Burnett et al. 2003]). From the MPEG-7 application point of view, there are several interesting research and development issues. Mistral builds on the ‘Detailed Audiovisual Profile (DAVP)’, a profile suggestion by [Bailer et al. 2005] based on years of experience. Yet, there is a need for adaptation and extension of MPEG in order to cover also related meeting documents, multi-modal sensor data and speech-to-text data. Furthermore, research and development efforts have to be conducted to meet our requirements for read & write access by distributed units and for benchmarking issues. In particular, research work is conducted in the field of security and privacy as well as in content adaptation and personalization.
5
Conclusions and Future Work
In this paper we have highlighted the MISTRAL research objectives and its application scenario in the field of meeting information processing, management and retrieval. Based on that, the requirements for specific project-relevant metadata have been discussed. Above all, we have decided to apply MPEG-7 and MPEG-21 because of its standardization and high flexibility. Although first experiences show good prospects, the MISTRAL meeting application requires special attention in order to adapt and extend the MPEG standards. Based on our ongoing research and implementation work, we expect mutual contributions with the MPEG community due to the very specific requirements arising from our meeting information application. These expected exchange of efforts and results are e.g. (a) experiences about applying MPEG-7 (in particular DAVP) and MPEG-21 in our particular application field, (b) insights into the use of multi-modal training techniques and test data, and (c) the release of parts of modules and services as open source. Acknowledgements The project results have been developed in the MISTRAL project (http://www.mistral-project.at). MISTRAL is financed by the Austrian Research Promotion Agency (http://www.ffg.at) within the strategic objective FIT-IT under the project contract number 809264/9338.
Special thanks also to Internet Studio-Isser for providing images and graphical support.
6
References
[AMI] AMI Project; official Web Site; last visit 2005-05-18, http://www.amiproject.org [Antunes et al. 2003] Antunes, P., Costa, C.: ‘Perceived Value: A Low-Cost Approach to Evaluate Meetingware’; Lecture Notes in Computer Science, 9th International Workshop CRIWG 2003, Volume 2806, 2003, 109-125. [Bailer et al. 2005] Bailer, W.; Schallauer, P., Hausenblas, M., Thallinger, G.: ‘MPEG-7 Based Description Infrastructure for an Audiovisual Content Analysis and Retrieval System’; Conference on Storage and Retrieval Methods and Applications for Multimedia, USA, 2005. [Begeman et al. 1986] Begeman, M., Cook, P., Ellis, C., Graf, M., Rein, G., Smith, T.: ‘Project Nick: meetings augmentation and analysis’; In Proceedings of the 1986 ACM conference on Computer-supported cooperative work (CSCW '86), ACM Press, NY - USA, 1986, 1-6. [Burnett et al. 2003] Burnett, I., V. d. Walle, R., Hill, K., Bormans, J., Pereira, F.: ‘MPEG-21: Goals and Achievements’; In IEEE Multimedia, Vol. 10, Oct-Dec 2003, 60-70. [CALO] CALO (Cognitive Agent that Learns and Organizes) Project; last visit 2005-05-18, http://www.cse.ogi.edu/CHCC/Projects/CALO/main.html [CHIL] CHIL Project; Web Site; last visit 2005-05-18, http://chil.server.de [Guetl et al. 2005] Guetl, C.; García-Barrios, V.M.: ‘Semantic Meeting Information Application: A Contribution for Enhanced Knowledge Transfer and Learning in Companies’; In Proceedings of ICL 2005, Villach, Austria, 2005. [IM2] IM2; The Swiss National Center of Competence in Research (NCCR) on Interactive Multimodal Information Management; last visit 2005-05-18, http://www.im2.ch [Lalanne et al. 2005] Lalanne, D., Lisowska, A., and others: ‘The IM2 Multimodal Meeting Browser Family’; Interactive Multimodal Information Management Project, report, 2005. [M4] M4 Project; official Web Site; last visit 2005-05-18, http://www.m4project.org [Martinez et al. 2002] Martinez, J.M., Koenen, R., Pereira, F.: ‘MPEG-7: The Generic Multimedia Content Description Standard, Part 1’; In IEEE Multimedia, Apr. 2002, 78-87. [Martinez 2002] Martinez, J.M.: ‘MPEG-7: Overview of MPEG-7 Description Tools, Part 2’; In IEEE Multimedia, Jul. 2002, 83-93. [MISTRAL] MISTRAL Project; Web Site; last visit 2005-06-24, http://www.mistral-project.at [NIST] NIST Meeting Room Project; Web Site; last visit 2005-06-21, http://www.nist.gov [Romano and Nunamaker 2001] Romano, N.C., Nunamaker, J.F.: ‘Meeting Analysis: Findings from Research and Practice’; In Proceedings of HICSS-2001, Hawaii, 2001. [Whiteside and Wixon 1988] Whiteside, J., Wixon, D.: ‘Contextualism as a world view for the reformation of meetings’; In Proceedings of CSCW 1988, ACM Press, Oregon – USA, 1988. [WIKIPEDIA] WIKIPEDIA: Meeting system; Wikipedia – The free Encyclopedia, last visit 2005-05-15, http://en.wikipedia.org/wiki/Meeting_system