Experience Sharing by Retrieving Captured Conversations using Non

0 downloads 0 Views 272KB Size Report
Oct 15, 2004 - Experience Sharing by Retrieving Captured Conversations using Non-Verbal Features. Christof E. Müller. ATR Media Information Science ...
Experience Sharing by Retrieving Captured Conversations using Non-Verbal Features Christof E. Muller ¨

Yasuyuki Sumi1

ATR Media Information Science Laboratories Seika-cho, Soraku-gun, Kyoto 619-0288 JAPAN

Graduate School of Informatics, Kyoto University Yoshida-Honmachi, Sakyo-ku, Kyoto 606-8501 JAPAN

[email protected]

[email protected]

Kenji Mase1

Megumu Tsuchikawa

Information Technology Center, Nagoya University Furu-cho, Chigusa-ku, Nagoya City 464-8601 JAPAN

ATR Media Information Science Laboratories Seika-cho, Soraku-gun, Kyoto 619-0288 JAPAN

[email protected]

[email protected] ABSTRACT

wearable sensors has been developed in order to collect behaviours and interactions among humans and artifacts [11]. Besides video and audio the system records other sensory data, e.g. about gazing and location of humans, and performs an automatic annotation of all collected information. This so called interaction corpus is created as a source for both researchers and computers to analyze and understand human interactions. It is also meant as a source for sharing experiences among people. The original domain of the capturing system is a poster exhibition for example at an academic conference, but efforts are made to port it to other domains like meetings or lectures and finally to daily life usage. Several applications have been developed or are in development for the use of experience sharing in this domain using the collected sensory data. A video-based story-telling system [4], a video summarization system [11] and a system that projects the captured video data into a 3-dimensional space [13] focus on the re-experience after the actual exhibition event. The system called Ambient Sound Shower presented in this paper aims to be used not only for re-experiencing the exhibition, but also for retrieving additional information about the different exhibits at the exhibition site during the actual event. The additional information consists of recorded conversations held in the past by exhibitors or other exhibition visitors. The system focusses on presenting the audio part of the conversations using an earphone or some headphones rather than displaying the captured video data on a head-mounted display. This procedure gives the possibility to provide the information without disrupting the user, so he can still focuse on the actual exhibits. It seems to be a promising approach to use automatic speech recognition (ASR) for extracting semantic information of the recorded speech signals and to use this information for the retrieval of the conversations. However ASR might work today quite well in situations that provide laboratory conditions, but in a noisy surrounding like a poster exhibition it is still error prone [2]. We therefore neglect the

We present a system that retrieves the voice part of human communications captured by our collaborative experience capturing system. For segmenting, interpreting, and retrieving past conversation scenes from a huge amount of captured data the system focusses on the non-verbal aspects, i.e. the contextual information captured by ubiquitous sensors, rather than the verbal (semantic) aspects of the data. The retrieved communications are presented to other persons being in similar situations as the communicators. This experience sharing enables people to gain more information about their situation or surroundings. The system’s current domain is a poster exhibition at an academic conference where the system provides a visitor with additional information about the exhibited posters or can be used to re-experience the event after the exhibition is finished. Categories and Subject Descriptors: H.3.3 [Information Search and Retrieval]: Retrieval Models General Terms: Algorithms, Human Factors Keywords: ubiquitous sensors, contextual information, experience sharing, exhibition assistance, non-verbal information

1.

INTRODUCTION In previous work a capturing system using ubiquitous and

1 also affiliated with ATR Media Information Science Laboratories

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CARPE’04, October 15, 2004, New York, New York, USA. Copyright 2004 ACM 1-58113-932-2/04/0010 ...$5.00.

93

semantic information extraction by using ASR, but focus on the use of non-verbal contextual information to explore how useful they can be. However if ASR technology becomes capable of working accurately in noisy environments we will combine our research results about non-verbal contextual information with semantic driven methods in order to realize one of our long-term goals, a conversational agent that operates as a personal virtual presenter.

2.

CONCEPT

An interactive poster exhibition which accompanies an academic conference is a good example of a situation where people with the same interests share their experience and ideas with each other. The exhibition visitors can get highly active by discussing with exhibitors or with each other about the information the posters contain. The exhibitors can provide additional information or give a presentation in order to make the posters’ content more clear. Poster exhibitions can have certain scheduled times where the poster exhibitors are present and also a large number of visitors will be present during this time. At other times there might be only a few people there, e.g. during lunch time or if conference talks are scheduled. In these times a visitor might not have much possibility to interact with exhibitors or other visitors. This is a situation in which the Ambient Sound Shower will provide additional information in the form of recorded conversations. This means the system will capture a large amount of data, i.e. spoken conversations and connected contextual information, during times of high interaction rates between visitors and exhibitors and present these captured experiences to other visitors during times of low interaction rates. The use of wearable and ubiquitous sensors and therefore the availability of information about the physical context of the user allows the system to observe and analyze the user’s situation and act proactively, i.e. provide information without explicit user feedback, but with considering the user’s interruptability. In order to accomplish the above concepts the system needs to perform the following tasks:

Figure 1: The different sensors of the capturing system: a) wearable set b) ubiquitous sensors attached to posters and ceilings. approach instead we try to enable the experience sharing among the visitors which was also a concept of the c-map system described in [12]. With the Ambient Sound Shower visitors and exhibitors can contribute to a dynamic repository of data which is used for providing additional information about exhibits. With systems like the Locust Swarm [10], the Audio Aura [5] or the system described by Rekimoto et al. [6] people can attach digital messages to objects which are automatically or manually retrieved and presented if other persons gaze at the object or enter a certain location. The retrieval of the messages is compared to our system not personalized, which means that all messages attached to one object or location will be presented to the user regardless if they are of interest for him or not. Also the systems present the messages either only if they are selected manually or they present them automatically, but without taking the user’s interruptability into acount. The DyPERS [8] and the Wearable Remembrance Agent [7] concentrate on personal memory aid and use the physical context of the user to retrieve certain information. A more recent project of Kern et al. described in [3] not only uses contextual information to retrieve captured data of meetings, but also uses the context to detect the user’s interruptability.

• capture conversations • process and mix the captured audio data • observe and analyze the user’s situation • select conversations

4. CAPTURING SYSTEM

• provide the selected audio data to the user

3.

The capturing of the conversations is performed by a ubiquitous/wearable sensor system which was developed for collecting interactions between humans and artifacts in order to create a so called interaction corpus [11]. This sensor system consists in its configuration for the Ambient Sound Shower of three parts:

RELATED WORK

A wide range of projects exist which relate to our system in the area of nomadic information systems, some of them using ubiquitous or wearable sensors. In the HIPS project [1] a hand-held electronic museum guide was developed. The Museum Wearable [9] uses a similar wearable and ubiquitous sensor system to ours. In both projects the presented information is already existing in the way that it is contained and retrieved from a static database. The sensory data is only used to decide which information should be selected or in which way it should be arranged, but it is not captured and used to be presented to other visitors as additional information about the exhibits. In our

• The wearable sensors (see figure 1) include video camera, microphone, throat microphone, infrared light emitting ID tags and corresponding tracker for detecting ID tags. They are connected by wires to a subnotebook. • Similar sensors are attached to objects and ceilings. • A backbone of database and application servers is connected to the sensors via a wireless network.

94

sensor ID tags and tracker throat microphone microphone video camera

usage gazing and location information speech detection experience capturing experience capturing

MODE ambient sound

exhibit overview

Table 1: The different sensor types and their usage in the capturing system.

specific conversation

Table 1 shows the usage of each of the sensors. The collected sensory data is stored in a database and on a file server. After the data is segmented in the time scale it is scanned for primitive events like looking at an object (LOOK AT ) or talking to somebody (TALK TO). Depending on primitive events so called composite events are detected, e.g. GROUP DISCUSSION, TOGETHER WITH or JOINT ATTENTION. The event structure is also stored in the database.

5.

SITUATION user shows no specific interest; low presence of people user is interested in specific exhibit user is very interested in specific exhibit

SOUND mix of all captured conversations

FUNCTION creates stimulating environment

mix of all related conversations one interesting conversation

overview; indicates grade of popularity additional information

Table 2: Situation, sound and function of the playback modes. USER − current time − current location/gazing − conversation partners in the past − conversations already presented by the system − interactions with exhibits

RETRIEVAL SYSTEM

5.1 Architecture

RECORDED

The retrieval system extends the distributed architecture of the capturing system by a database and several servers. The application server controls the playback, changes between different playback modes, selects the right conversations for playback and streams the audio data over a wireless network to the client software running on the subnotebook which controls the user’s wearable sensors. The system uses the application’s own database to store relevant information. The audio signal processor and mixer enhances the quality of the captured audio signals, performs mixes of the signals and stores the processed and mixed signals in audio files on the application’s file server.

CONVERSATIONS

CONVERSATION X − location (exhibit) − time − duration − participants

PARTICIPANT Y − current location − interaction with exhibits

Figure 3: Matching the context of user, conversations and conversations’ participants.

5.2 Playback Modes While observing and analyzing the user’s situation the system will start the playback of audio data if it detects that the user is interruptable and needs additional information. During the playback the system will automatically change between three different playback modes depending on the user’s behaviour. Figure 2 depicts the situations in which the modes are chosen. In case the user is at an exhibition which not many people attend at that time the system will first try to establish a stimulating ambient atmosphere by playing back on the user’s earphone a mix of all conversations that were held at the exhibits by other participants so far. If the user is then showing interest for a particular poster by focussing it the system will switch to playback a mix of all conversations that were held only about this particular exhibit. The number of conversations the user can hear shows him the amount of popularity this exhibit had. If he keeps the poster in focus the system assumes that he is still interested in it and starts the play back of only one conversation at a time. The presented conversation is assumed to be of particular interest for the user. It should give him additional information about the focussed exhibit. Table 2 summarizes the different modes.

the playback which contains the most meaningful information for the user. As the system has no possibility for semantically analyzing the verbal information of the conversations the system can only use non-verbal contextual information of user, conversations and conversations’ participants for the process of selecting a conversation. These contextual information include time, location, gazing direction and interaction information. Figure 3 gives an overview of the used contextual information. The selection is made by using heuristics and calculating a score for every conversation. Table 3 shows some criterias which are used to calculate this score. Another important criteria to the selection process is to find conversations with participants that share the same interests as the user. For this reason we analyze the intensity of user’s and participants’ past interactions with other exhibits and assume that a more intense interaction with an exhibit indicates a stronger interest in the topic of the exhibit. As a measure of intensity we use the time which was spend for looking or staying at the particular exhibit and the time of conversations which were detected as related to the exhibit. The system calculates a score for every participant that expresses the amount of interest that is shared by participant and user about the different exhibition topics. An average score of all participants of one conversation

5.3 Selection of Captured Conversations Being in the specific conversation playback mode the system has to select one particular captured conversation for

95

AMBIENT SOUND

EXHIBIT OVERVIEW

SPECIFIC CONVERSATION Figure 2: The system changes automatically between three playback modes. CRITERIA participant’s gazing direction and location conversation’s length type of participant participants are still at exhibition site participants interaction with exhibit participants have interacted with user before participant was a participant in an already presented conversation user is participant conversation was presented before

FUNCTION detect to which exhibit the conversation’s topic might be related exclude too short and too long conversations conversation with exhibitor might be more exhibit related user can talk to them later

Shower therefore provides the information proactively and tries to keep the needed interaction of the user with the system to a minimum. To accomplish this the system needs to observe and analyze the user’s situation and decide when he is in need of information. Also the visit of a poster exhibition includes many situations in which the user does not want to be disrupted. So the system also needs to take the user’s interruptability into account. Kern et al. propose in [3] for the detection of the user’s interruptability the direct use of the sensory data instead of trying to define general situations in which the user is interruptable or not. We have a similar approach and use only primitive and composite event types which stay on an abstract level. The used contextual information is divided into four categories depicted in figure 4:

more interest in exhibit means more interesting conversation more interesting to hear conversations of known people user knows his voice and gets a broader image of the person user knows the conversation already user knows the conversation already

• conversational status

Table 3: Criterias and their functions for selecting conversations.

• social status • implicit user feedback

is calculated and taken into account in the conversation’s score.

• explicit user feedback The conversational status describes the users involvement into conversations with other people. It might be very distractable for the user to hear the system’s played back conversations while speaking himself or listening to another person’s speech. The social status gives information about the user’s social environment, i.e. if he is surrounded by other people or if

5.4 Playback Control The visit of a poster exhibition represents a situation in which the user has no fixed location, but is moving around and his visual perception is focussed on the exhibits. This makes it difficult to find a non-disruptive and decent interface for the user to control the system. The Ambient Sound

96

IMPLICIT USER FEEDBACK

of some persons. Thus the signals’ volume is normalized by searching the highest peak of each signal and amplifying the whole signal so that the highest peak equals the maximal acceptable value. For the later playback of a conversation all audio signals belonging to this conversation have to be played back synchronously. In order to reduce the used bandwith the signals are not streamed separately to the client, but are mixed together before the actual streaming process. This is done by recursively adding the samples of two signals together and normalizing the resulting signal afterwards to avoid clipping. This process is done right after a conversation was captured and the resulting mixed signal is stored as a sound file on the application’s file server. In the playback modes ambient sound and exhibit overview the system plays back a mix of several conversations. This mix is done following the above procedure for the mixed signals of conversations. As constantly conversations are captured during the exhibition time, this process has to be repeated regularly. In order to avoid an abrupt change in the audio playback when the system changes its playback mode, the system performs a crossfade of the starting and ending signals which means it performs a fade-in or fade-out of each signal’s volume before starting respectively stopping the play back of the signal.

EXPLICIT USER FEEDBACK

calculate score PLAYBACK

NO PLAYBACK compare with threshold

CONVERSATIONAL STATUS

SOCIAL STATUS

Figure 4: The system adapts its behaviour to the user’s situation.

he is walking through the exhibition together with another person. Implicit user feedback expresses how much interest the user showed for exhibits for which the system was presenting information, i.e. if he kept on focussing on the exhibit while the system was providing information about it and also if the user is interested into exhibits at all or is maybe walking through the exhibition for other reasons (e.g. meeting people). As the system can not always accurately detect the user’s needs it provides at least a minimal user interface consisting of a start, pause and stop button in order to push the system to play back audio data, tell the system to pause playback for some system dependent time interval or to stop the playback completely. This explicit user feedback is used not only for an explicit reaction of the system, but also for an implicit one as the user’s long-term use of the buttons is influencing the frequency the system presents information. A score is calculated for each of these categories indicating if the user’s situation demands a presentation of additional information or forbids it. An overall score is calulated by a weighted summing of the scores. By comparing it to a predefined threshold the system decides about playing back audio data or not.

5.6 Audio Streaming The wearable sensors of the user are controlled by a small portable personal computer which also contains the client software for the Ambient Sound Shower. In order to save resources we try to perform as much tasks as we can on the server which is connected over wireless network, so that we yield a thin client structure. The only task of the client is to receive the stream of audio data and play it back via a soundcard and a connected earphone. One difficulty in the audio streaming is that the data should be delivered in real time to the client as a change of the current users situation might result in a change of the played back audio data, but as the data rate on the network is not predictable audio data has to be sent in advance to the client or the audio quality will become poor. Having a natural latency time between transmission and playback the buffering of audio data prolongs it even more. Therefore an accurate size of the audio buffer has to be selected in order to ensure a good sound quality without noticable delay of signal changes.

5.5 Audio Signal Processing and Mixing

6. SUMMARY AND FUTURE WORK

The captured audio signals of conversations are digitized (16 bit, 22,05 kHz) and stored unprocessed together with the video data on a file server. As the speech of each participant is recorded with another microphone several audio signals are recorded for one conversation. A microphone signal contains not only the speech of the microphone’s wearer, but also environmental noise and the speech of other people. In order to reduce the noise level we mute the audio signal during intervals of no speech by the microphone’s wearer. The speech is detected by an algorithm that uses the signal of the throat microphone. The muting also avoids echoes of voices in the playback. As each person speaks with a differnt volume the microphone signals will differ in their signal amplitudes. In a later playback it can be difficult to understand the speech

In this paper we presented a system that automatically captures conversations of participants of a poster exhibition and presents these conversations to a visitor during times when only few or even no other people are present to interact with. The playback of the conversations’ audio signals is used for creating a stimulating atmosphere or to provide the user with additional information. For selecting the most valuable conversations for the user the system uses only nonverbal contextual information as the use of ASR technology is still error prone. The contextual information are also used to analyze the user’s current situation and interruptability and so provide him proactively with information when it is applicable. The system is in an early prototyping status. Extensive

97

evaluation has to be done now to prove the accurate functioning of our system. As during the use of the system both user’s and system’s context are captured in a database this can help to simplify the evaluation process and by applying common machine learning algorithms it can be used to improve the system’s behaviour. Further work will be the inclusion of other output modalities, like information display on a head-mounted display. This displayed data could be information about the participants of the played back conversation or it could show the user how to get to the current position of one of the participants. As the capturing system is ported to other domains and even daily life usage it would be interesting to see how the Ambient Sound Shower can be applied in these domains.

7.

[6] J. Rekimoto, Y. Ayatsuka, and K. Hayashi. Augment-able reality. Situated communications through physical and digital spaces. In Proceedings of the 2nd International Symposium on Wearable Computers, pages 68–75, October 1998. [7] B. J. Rhodes. The Wearable Remembrance Agent: A system for augmented memory. In Proceedings of the First International Symposium on Wearable Computers (ISWC ’97), October 1997. [8] B. Schiele, N. Oliver, T. Jebara, and A. Pentland. DyPERS: Dynamic Personal Enhanced Reality System. In Proceedings of International Conference on Computer Vision Systems (ICVS’99), Gran Canaria, Spain, 1999. [9] F. Sparacino. The Museum Wearable: real-time sensor-driven understanding of visitors’ interests for personalized visually-augmented museum experiences. In Proceedings of Museums and the Web, Boston, USA, 2002. [10] T. Stamer, D. Kirsch, and S. Assefa. The Locust Swarm: An environmentally-powered, networkless location and messaging system. In Proceedings of the 1st International Symposium on Wearable Computers, pages 169–170, October 1997. [11] Y. Sumi, S. Ito, T. Matsuguchi, S. Fels, and K. Mase. Collaborative Capturing and Interpretation of Interactions. In Proceedings of Pervasive 2004 Workshop on Memory and Sharing of Experiences, Vienna, Austria, 2004. [12] Y. Sumi and K. Mase. Conference assistant system for supporting knowledge sharing in academic communities. Interacting with Computers, 14(6):713–737, 2002. [13] Y. Sumi, K. Mase, C. E. M¨ uller, S. Iwasawa, S. Ito, M. Takahashi, K. Kumagai, Y. Otaka, M. Tsuchikawa, Y. Katagiri, and T. Nishida. Collage of Video and Sound for Raising the Awareness of Situated Conversations. In Proceedings of International Workshop on Intelligent Media Technology for Communicative Intelligence (IMTCI 2004), Warsaw, Poland, September 2004.

ACKNOWLEDGMENTS

Very valuable contributions and support to this project were given by various of our colleagues at ATR. The project was partly funded by the National Institute of Information and Communications Technology, Japan.

8.

REFERENCES

[1] G. Benelli, A. Bianchi, P. Marti, E. Not, and D. Sennati. HIPS: Hyper-Interaction within Physical Space. In Proceedings of the IEEE international conference on multimedia computing, Florance, Italy, 1999. [2] R. V. Cox, C. A. Kamm, L. R. Rabiner, J. Schroeter, and J. G. Wilpon. Speech and language processing for next-millenium communications services. In Proceedings of the IEEE, volume 88, pages 1314–1337, August 2000. [3] N. Kern, B. Schiele, H. Junker, P. Lukowicz, G. Tr¨ oster, and A. Schmidt. Context Annotation for a Live Life Recording. In Proceedings of Pervasive 2004 Workshop on Memory and Sharing of Experiences, Vienna, Austria, 2004. [4] N. Lin, Y. Sumi, and K. Mase. An Object-centric Storytelling Framework Using Ubiquitous Sensor Technology. In Proceedings of Pervasive 2004 Workshop on Memory and Sharing of Experiences, Vienna, Austria, 2004. [5] E. D. Mynatt, M. Back, R. Want, M. Baer, and J. B. Ellis. Designing Audio Aura. In Proceedings of the SIGCHI conference on Human factors in computing systems, Los Angeles, USA, 1998.

98

Suggest Documents