Recognizing Context for Annotating a Live Life Recording Nicky Kern, Bernt Schiele Department of Computer Science Darmstadt University of Technology, Germany {nicky.kern,schiele}@informatik.tu-darmstadt.de Albrecht Schmidt Embedded Interaction Research Group Universit¨at M¨unchen, Germany
[email protected]
Abstract
ranging from written and audio notes, to X-ray and computer tomography (CT) images and movies, to sensor data from Electro-Encephalogram (EEG) and Electrocardiogram (ECG). Advances in digital technology enable people to gather and conveniently store massive amounts of data. In particular four different ways for accumulating information can be distinguished.
In the near future it will be possible to continuously record and store the entire audio-visual lifetime of a person together with all digital information that person perceives or creates. While the storage of this data will be possible soon, retrieval and indexing into such large data sets is an unsolved challenge. Since today’s retrieval cues seem insufficient we argue that additional cues, obtained from bodyworn sensors, make associative retrieval by humans possible. We present three approaches to create such cues, each along with an experimental evaluation: the users physical activity from acceleration sensors, his social environment from audio, and his interruptability from multiple sensors.
• capture - acquiring images, audio, video and sensor data. • creation - writing of text, drawing images and plans. • download - obtaining information from online sources such as the WWW and storing the retreived data locally.
Keywords: Context-Awareness, Information Retrieval, Sensing Systems, Context Recognition, Wearable Computing
1
• communication - information created in the process of communication, resulting in email archives, chat logs, video and audio archives.
Introduction
The separation is more on a conceptual level as on a technical. It is interesting to observe that with current technologies many of these processes are taking place in settings where the user is mobile or in a particular environment. The context in which data is gathered can be an interesting and vital resource for further use of such data [1]. In our view context describes the situation in the real world in which data is acquired. This may include the location, the social environment, the activity, and the physical environment, as suggested in Schmidt et al. [2]. Besides annotating audiovisual data it becomes apparent that sensor data collected (e.g. ECG) can benefit from meta-information on activity. Automated annotation of data gathered based on sensor information is the central approach that we describe in this paper.
Technologies for capturing and creation are becoming ubiquitously deployed. E.g. smart phones that allow image, video, and audio capture are common; word processing and email are by now standard ways for creating text and for communication. Also many people everyday are reading, scanning and perceiving huge amounts of data and information in the office, at home or on the move through various means such as the web, documents or newspapers. Collecting information is common for personal use as well as in professional environments. Considering hobbyists, collecting photos and video are a central issue. In the professional domain the types of documents that are collected vary very much. For example in the medical profession large amounts of information are collected 1
The overall amount of information stored and recorded is impressive [3]. However storing the amount of information is technically not the central problem. Even recording an entire audio-visual lifetime of a person is feasible with current technology and people are experimenting on this [4]. One can estimate that this would result in about 500 TB of data1 . Finding a particular set of information is in many scenarios an enormous challenge. Our approach is to aid the retrieval process by providing real world meta-information to the data captured. The additional meta-information is also increasing the value of the data capture. Creating data in context is a great opportunity for collecting meta-information. Working in context has become the norm with devices becoming pervasively used in environments beyond the desktop. A lot of information originates from mobile devices where context information can be very rich. An example is someone writing a text on a PDA or notebook computer at different places in different social contexts during different activities. Similarly taking photos is often in the context of a particular situation (e.g. family gathering, introduction of a new machine, etc.), see [5] and [6]. We explore to use sensors to acquire contextual information that can be used to increase the value of the data collected. Retrieval traditionally makes use of meta-information (e.g. for querying data and for the creation of indexes) such as the date of creation, type of a document, or the size of a document. People then use this information when searching for a particular piece of information. Here it is useful to us to discriminate between explicit meta-information (e.g. searching by the year it was created, a file name) and implicit meta-information (e.g. pictures I stored on my laptop, documents on the computer in the office). For retrieval and filtering it can be seen that people make use of both concepts. We assume that giving users the ability to draw from a larger set of meta-information will improve their ability to find and use data they have collected. The use of meta-information for using and finding information is drawing on the fact that many people work and think associatively. People often remember parts of the context in which something took place or – in our case – data was created. By using contextual meta information we aim to give users means for effectively reducing the search space in large data collections. Beyond this we expect that contextual information can be used to enhance and aid automated search technologies. We have seen several approaches for the creation of meta-information and annotation.
gathered. However in most cases this seems not to be practically feasible. From a small set of informal interviews we could conclude that even for small collections (e.g. a few thousand photos taken by individuals) the burden of annotating them is considered very high and therefore are not doing this systematically. • A different approach is post-capture analysis. Here the creation of annotations and meta-information is done (offline) after gathering the information. In this approach information is analysed after recording and meta information is extracted (e.g. text analysis for documents, image recognition for movies and pictures, speech recognition for audio, etc.). In many ways this is an important direction and can be seen complementary to the approach we suggest. The drawback of postprocessing is that one can only use the information that is captured within the documents. Therefore often the context is missing, e.g. the person taking the picture is rarely on the picture itself. • The approach we outline in more detail in this paper is to use multiple sensors and wearable computer technology to create meta-information and annotations in situ. Additionally to the information gathered metainformation is acquired based on sensor devices worn by the user. The main contribution in this paper is the assessment of wearable technologies and algorithms for creating contextual annotations. We investigate how meta-information for any kind of data gathered in real world environments can be automatically obtained using a variety of sensors. First we show two low-level context-recognition and annotation examples. In section 2 we show how to get the activity context from acceleration sensors. In section 3 we outline how to acquire information about the social context from ambient audio. In section 4 we present an example of a higher level context: interruptability. In this part we investigate how multiple heterogeneous sensors can be used to create metainformation on the interruptability of a person. Such information can be used for annotation but as the algorithms introduced work while a user is in the situation they can also enhance communication applications.
2
Activity Context From Acceleration
The user’s physical activity provides interesting context for the data gathered. One could for example search all people one shook hands with, or find paintings in a museum in front of which the user stood for a while, instead of just walking past them (such as in [7]).
• The obvious way is manual annotation of data that is 1 assuming a lifespan of 100 years, 24h recording per day, and 10 MB per min recording results in approximately 500 TB
2
There are many ways for recognizing human activity from body-worn sensors. The choice of sensor, the number of sensors and their respective locations and necessary sampling rates are hardware-related questions that need to be answered. On the recognition side there is a basic choice between having the system learn which activities are important and explicitly modeling the activities to be recognized. On top of that the choice of features and appropriate classification mechanisms is important. We present here a platform for recording data from 12 body-worn 3D acceleration sensors. We show experimentally how activities such as sitting, standing, walking, or shaking hands can be recognized. Using those activities we investigate how many sensors are necessary for recognizing these activities.
2.1
Related Work on Activity Recognition using Acceleration Sensors
While it has been tried to use the user’s physical activity to retrieve information in the special case of meeting scenarios (Kern et al. [8]), most previous work focuses on extracting the physical activity from various sensors. Most work focuses on acceleration sensors because they are small, cheap, light-weight and consume only little power. Farringdon et al. [9], Randell and Muller [10], and M¨antyj¨arvi et al. [11] presented early recognition experiments. Kern et al. built a 12 acceleration-sensor setup and investigated basic, explicitly modeled activities and the minimal required set of sensors. Bao and Intille [12] have used 5 2D acceleration and investigated a large set of 20 everyday activities that were explicitly modeled. Van Laerhoven et al. worked on an automatically finding contexts that are interesting to the user using Self-Organizing Maps and investigated several different sets of sensors: various sensors mounted on the knee [13], 32 acceleration sensors [14] and ball switches [15]. In medicine applications of such techniques are being investigated for fall detection [16] or gait changes [17].
2.2
Figure 1. Activity Recognition Platform worn by a User. The acceleration sensors are marked with circles. The data is annotated using the IPAQ the user is holding in his hand.
four lines Robustness, Ease of use, Sensors, and Length of Recording. The sensors have to be robust enough to be worn for long periods of time, cables must not rip apart from small forces, etc. For the platform to be easy to use it is important that it can be mounted very easily and that it does not hinder the user’s movement in any way. From a recognition point of view it is important that there are sufficiently many sensors and that they are sampled at a high enough frequency. This way, sensor data can be recorded for many different sensor locations and the locations can thus be compared against each other. Finally, to allow for long recordings, it is important that sufficient storage space is available and the power consumption is low enough, s.t. the platform does not become overweight from the weight of the batteries. The platform we have designed is divided up into three parts: a laptop and iPAQ for recording, Smart-Its for A/D conversion of the analog sensor data and the actual 3D acceleration sensor nodes. A standard laptop provides ample CPU-power and disk space for recording the sensor data. Attached (via RS232) to the laptop is an iPAQ that we use to annotate the data online. Figure 1 shows the entire system as worn by a user.
Acceleration Sensor Platform
As the basis for our research we have designed a sensor platform that is robust enough to record long stretches of data from many sensors. While we have limited ourselves to 12 body-worn 3D acceleration sensors, the platform is extensible enough to record data from more and/or other sensors such as gyroscopes and pressure sensors. First results using this platform were presented in [18]. A recording system has to be carefully designed in order to allow for long recordings or real activities in potentially rough environments, e.g. while roller-blading. We have designed our platform to meet requirements along the 3
The A/D conversion of the sensor values is done by Smart-Its, a small embedded microprocessor system with a PIC microcontroller (PIC16F877 or PIC18F252) offering RS232 communication, a wireless link with 19.2kBits/s, 8K of non-volatile RAM, a power-supply circuit and an external connector for add-on boards [19, 20]. Figure 2(b) shows a picture of a Smart-It. Since the Smart-It only offers a limited number of analog I/O pins, we have designed an addon board with analog multiplexors, so that every Smart-It offers 24 analog inputs (see Figure 2(c)). The Smart-It and Multiplexer-Board configuration is fast enough to sample all 24 input at 93 Hz – this is sufficient to capture body motions, most of which are slower than 5Hz. Our goal was to attach 3D acceleration sensors to all major body joints. We chose three-dimensional over twodimensional acceleration sensor, to capture the maximal information possible. Since the ADXL 311JE sensors we used offer only 2D acceleration, we have packaged two of them together to form a 3D acceleration sensor node (with one axis doubled). These 3D nodes are placed in very robust casings that are small nevertheless (see Figure 2(a)). To cover all major body joints we have attached such sensor nodes to both shoulders, elbows, wrists, hips, knees, and ankles, resulting in a total of 12 3D acceleration sensors. Since we can attach only 6 of them to one Smart-It, our system uses two Smart-Its for A/D conversion.
2.3
2.3.1
Results and Discussion
Figure 3 shows the recognition rates using different subsets of the available sensors. ’All Sensors’ recognition rates were obtained using all 12 sensors for recognition. ’Left’ and ’Right’ use only the right and left sensors respectively (six sensors each). While ’Upper Body’ refers to the sensors on both shoulders, elbows, and wrists, ’Lower Body’ refers to the sensors on both sides of the hip, both knees, and ankles. The average recognition rate over all activities (the last set of bars) shows that, in general, the results get better the more sensors are used. For simple activities such as sitting, standing, or walking good recognition performance can be achieved with a subset of the sensors. Comparing the upper and the lower parts of the body, we note that the recognition rate for the lower body is significantly lower, because the ’other’ activities (writing on the whiteboard, shaking hands, and typing on a keyboard) cannot be recognized well. This is natural, since the main part of these activities does not involve the legs. As expected the ’leg-only’ activities (sitting, standing, walking, upstairs, downstairs) are better recognized using the lower part of the body. However, the upper part still performs reasonably. Apparently the overall body motion for these activities can be captured using sensors on the upper part of the body. When comparing the right and left side of the body, we note that for the leg-only activities both sides are nearly equal in recognition rate. However the recognition rates for the other activities are quite different. Since shaking hands is a right-handed activity in which the left side of the body plays only a minor role the right set of sensors obtains the best results. Quite interestingly writing on the white-board cannot be recognized well with the right set of sensors but rather with the left side, which is due to the position of the left arm which seems to be more discriminative. The low performance of the right side on the keyboard typing activity seems also quite interesting: since the right hand was used to annotate the data using the iPAQ the right side is not very discriminant.
Experiments
In this section we present an experimental evaluation of how well basic activities can be recognized from multiple body-worn acceleration sensors. In particular we investigate the placement and required number of sensors. Data Set. The goal of this experiment is to recognize everyday postures and activities. We thus included data for sitting, standing, and walking. To capture the user’s current activity, we included writing on a white-board and typing on a keyboard. We also included shaking hands in order to capture important social interactions of the user. We recorded 18.7 minutes of data for all the above activities using the acceleration sensor platform described in Section 2.2.
2.4 Recognition Methodology. The choice of both appropriate features and recognition methodology is crucial for successful recognition. The proper investigation of these topics is beyond the scope of this paper. We have thus restricted ourselves to a single classifier/feature combination. We use a Na¨ıve Bayesian classifier, which classifies every feature vector separately, without taking any time-series into account. As features we use mean and variance over the last 50 samples, which, at a sampling rate of 93 Hz, corresponds to roughly 0.5 seconds.
Summary
The user’s physical activities, such as sitting, standing, typing, shaking hands, are a powerful contextual cues. We have investigated their recognition using body-worn acceleration sensors, and found that they can be recognized well with a set of well distributed sensors. While for general activities more sensors increase the recognition rate, there is a trade-off between activity complexity and required number of sensors. Simple activities can reliably be recognized with only a few sensors. 4
(a) 3D Acceleration Sensor Nodes and their Robust Casing
(b) Lancaster Smart-It
(c) Multiplexer Board to Connect 6 3D Acceleration Sensors to One Smart-It
Figure 2. The Different Parts of the Acceleration Recording Platform
3
Social Context From Audio
In this section we present an algorithm for auditory scene classification. We experimentally show its performance, and present an investigation of feature selection mechanisms, classification window lengths and required sampling rates.
Many and diverse different kinds of meta-information can be extracted from audio. The spectrum ranges from ’simple’ auditory scene classification to full speech recognition – each different kind of meta-information offering different possibilities for retrieval. In this section we will discuss some opportunities that arise from the use of audio and present an algorithm and experimental results to analyze the auditory scene of the user. The range of different kinds of meta-information that can be extracted from audio can be divided into four different categories. In the first category are speech recognition and keyword spotting. This information would allow to ’google’ within a recording using exactly the same techniques or have the system automatically summarize a certain recording. The next category is speaker identification, which permits to find situations with certain people. Related to this is speaker change detection. This allows to tell, if the user was alone, interacted with only one or a larger group of people (Kern et al. [8] used this to tell presentations from discussion in meeting recordings). The last category is auditory scene classification, in which the general environment of the user is classified. Classifying for example lecture settings would give a direct cue to find all lecture notes the user ever wrote. We are interested in the social context of the user, if there is people around, if he is alone, out on the street or in a car. To this end, we investigated the use of auditory scene recognition. Despite recent advances in speech recognition, recognizing speech from multiple speakers with arbitrary background noise with a microphone that is fixed to only one speaker is still a very hard problem. Speaker identification and speaker change detection can only give clues about the social context of the user, if there is people around. However, the social context on a street is considerably different than that in a restaurant, even though there may be few (discriminable) people around.
3.1
Auditory Scene Classification
Auditory scene classification tries to classify the environment of the user by using audio information alone, for example if the user is on a street or in a restaurant. This can be used as cue about the user’s location and his social environment. With location we refer to a semantic location (in a restaurant) rather than an actual physical location. In auditory scene analysis, three important technical questions need to be answered: the required sampling rate, the required length of audio and the best features for classification. While for speech recognition or speaker identification the band of the interesting signal is well known, this is not the case for auditory scene classification. For example engine sounds are in a different portion of the spectrum and have to be captured to classify them properly. Peltonen et al. [21] report that humans achieve an average classification score of 70% after on average 20 seconds. Using the best features is of course crucial to successful classification. Substantial work has been done on this problem in the area of speech recognition. However, features that are well-suited for speech recognition might not work well for auditory scene classification, because audio signals of very different characteristics are being classified. While it is possible to select features for a given recognition task by hand, it is easier to compute a large number of features and select the best ones for a given recognition task or transform the available features to a different space that makes recognition easier. Computing the correlation between features and discarding features that strongly correlate with others is an example for the first approach. Examples for the second approach are Principal Component 5
Leg-Only Activities 100 90 80 70 60 50 40 30 20 10 0
All S ens ors R ight Left Upper B ody Lower B ody
S itting
S tanding
Walking
Ups tairs
Downs tairs
Average Leg-Only
Other Activities 100 90 80 70 60 50 40 30 20 10 0
All S ens ors R ight Left Upper B ody Lower B ody
S hake Hands
Write on B oard
K eyboard T yping
Average all Activities
Figure 3. Recognize User Activity from Body-Worn Acceleration Sensors. Recognition Rates for Different Sub-Sets of Sensors
Analysis (PCA) and Linear Discriminant Analysis (LDA). PCA finds the directions of maximum variance in the feature space and thus works without any knowledge of the different classes. It returns vectors of the same dimensionality as the original feature vectors, where the first dimension is the direction with maximal variance, etc. LDA maximizes the inter-class distance and thus has to know to which class every sample belongs. It returns (#classes-1)-dimensional vectors. For both PCA and LDA it is necessary at runtime to compute all features, transform them, and select the required subset. This might be undesirable for systems that are constraint in computation power.
3.2
ter Tap, and Car. Lukowicz et al. [25] use audio to detect activities in a workshop scenario. They used two microphones and multiple accelerometers to classify 8 different activities. They used LDA-transformed features and a nearest neighbor classifier. While this largely unrelated to the social situation of the user, it is, from a technical point of view, a closely related problem.
3.3
Experiments
We present in this section an investigation of how well five different auditory scenes can be classified, namely Restaurant, Street, Lecture, Conversation, and a garbage class. In particular, we have explored the problems of feature selection, sampling rate and required length of audio in more depth.
Related Work Auditory Context Recognition
Previous work uses two distinct approaches: implicit and explicit modeling of the scene. Implicit modeling (see for example Clarkson et al. [22]) tries to find important auditory scenes automatically by clustering the data. In explicit modeling, the models for each scene class are trained in a supervised manner. Peltonen et al. [21] have done substantial work in this area, classifying sounds out of 26 different classes. B¨uchler [23] investigated algorithms for classification within and control of hearing aids. Korpip¨aa¨ et al. [24] also use audio (among other sensors) to classify the context of a person. They use a set of MPEG-7-based features, which are extracted over 1 second segments and a Na¨ıve Bayesian classifier to classify the surrounding audio into Classical Music, Rock Music, Other Sounds, Speech, Wa-
Data Set. We have recorded a total of 2.1 hours of data in five classes: Street (17.6%), Restaurant (10.7%), Lecture (35.6%), Conversation (28.6%), and Other (7.4%). The Other class mainly consists of transitions such as walking from the lecture hall to the restaurant. All data was recorded at 44.1kHz using a wearable Sony ECM TS-125 clip-on microphone. Features. We have selected a set of 20 different features. Ten are Cepstral Coefficients, computed over 30ms windows and averaged over one second. Ten were selected along the results of B¨uchler [23]: the spectral center of 6
95
90
90
85
85
85
80 75 70 65
R e c o g n itio n R a te
95
90 R e c o g n itio n R a te
R e c o g n itio n R a te
95
80 75 70 65
80 75 70 65
60
60
60
55 4 4 .1 k H z
55 4 4 .1 k H z
55 4 4 .1 k H z
15
22kH z
1
S a m p lin g R a te
C la s s ific a tio n S e g m e n t
S a m p lin g R a te
C la s s ific a tio n S e g m e n t
(b) Reduced Feature Set
95
90
90
85
85
85
75 70 65
R e c o g n itio n R a te
95
90 80
80 75 70 65
65 60
55 4 4 .1 k H z
55 4 4 .1 k H z
15 10
S a m p lin g R a te
5kH z
1
C la s s ific a tio n S e g m e n t
(d) 15 PCA Coefficients
15
22kH z
10
1 1 kH z S a m p lin g R a te
5 5kH z
1
C la s s ific a tio n S e g m e n t
(e) 20 PCA Coefficients
C la s s ific a tio n S e g m e n t
70
60
5
1
75
55 4 4 .1 k H z
1 1 kH z
5kH z
80
60
22kH z
5
(c) 10 PCA Coefficients
95
R e c o g n itio n R a te
R e c o g n itio n R a te
(a) All Features
1
10
1 1 kH z
5 5kH z
15
22kH z
10
1 1 kH z
5 5kH z
S a m p lin g R a te
15
22kH z
10
1 1 kH z
15
22kH z
10
1 1 kH z S a m p lin g R a te
5 5kH z
1
C la s s ific a tio n S e g m e n t
(f) LDA Transformed Features
Figure 4. Recognition Results for Different Feature Sets. The Reduced Feature Set was selected using the inter-feature correlation.
gravity (CGAV), its temporal fluctuation (CGFS), tonality, mean amplitude onsets (onsetm), common onsets across frequency bands (onsetc), power spectrum histogram width, variance, mean level fluctuations strength (MLFS), zero crossing rate, and total power. All of those are also computed over one second segments. Thus, a 20-dimensional feature vector is emitted every second.
setm, MLFS, tonality, variance, total power). The best results of 88.6% for ’raw’ features and 94.4% for the LDA transformed features shows that it is indeed possible to classify the auditory scene using this approach. Too many features seem to confuse the classifier: the recognition scores for the reduced feature set are more than 18% better than the scores for all features. There seems to be a trade-off between quick classification and required sampling rates. In general, the higher the sampling rate and the longer the classification segment, the better the classification score. So, if fast classification is needed, high sampling rates seem to be necessary and if low sampling rates are desirable, e.g. due to power constraints, longer classification segments are needed to obtain the same classification results. Comparing the different sets of PCA coefficients we note that 10 PCA coefficients do not cover a sufficient portion of the variance: the recognition scores are maximally 8.5% lower than for 15 PCA coefficients. Increasing the number of coefficients to 20 again confuses the classifier, and the classification rate drops by maximally 5.4%. The recognition scores for the LDA-transformed features are by far the best. With the peak performance at 94.4% the results on average 11% better than for the reduced set of features. Furthermore the results are more stable with lower sampling rates and shorter classification windows. Using
Classification. The classification is done using 2-state fully connected Hidden Markov Models (HMM). The feature vector are grouped into classification segments, so that the influence of the length of the available data for classification can be evaluated. All results are 5-fold crossvalidated. 3.3.1
Results and Discussion
Figure 4 shows the recognition scores for six different feature sets. The first plot for all features, then a Reduced Set, three sets of PCA coefficients and one which has been transformed using LDA. Each plot shows the recognition scores as a function of the sampling rate and the length of the classification segment. The data for the lower sampling rates was obtained by down-sampling the original data. In the Reduced Set the inter-feature correlation was used to discard features that highly correlated with other features (on7
interruption ok
interruption no problem
Bar Boring Talk
Having a coffee
Restaurant
don't disturb
Personal Interruptability
terruptability of the environment, referred to Social Interruptability along the axes of a two-dimensional space. This space allows to classify the user’s interruptability for different situations, as indicated in Figure 5. This space can also be used to select notification modalities by assigning modalities to different regions within the space (see [26] for a more detailed description). Thus, notification modalities that only notify the user but not the environemnt, such as vibrating devices or head-mounted displays, can be modeled. We have conducted a user study which shows that there is only little correlation between the two axes and thus both axes are required [27]. After introducing the related work, we introduce an algorithm to estimate the user’s Social and Personal Interruptability. We present three extensions to that algorithm that make it scalable to large amounts of data, easy to adapt to new sensors and contexts, and allow for online adaption of the system. We present an evaluation of the algorithm on two data sets consisting of acceleration, audio, and location data with a maximum length of two days.
Lecture Driving a car
don't disturb
interruption ok
interruption no problem
Social Interruptability
Figure 5. The User’s Social and Personal Interruptability for Different Situations
LDA classification rates of more than 90% can be obtained with only 5kHz sampling rate and a 5 second classification segment.
3.4
4.1
Summary
It is well accepted in the literature that interruptions decrease human performance, see eg. Cutrell et al. [28] for a study on instant messaging interruptions or Hudson et al. [29] for a study on interruptions in the daily lives of research managers. Active management of notifications has been done for very specialized scenarios such as Military Command Control [30] and first applications for desktopbased interruption management are being published. Currently much work focuses on estimating the interruptability of the user in desktop settings by Fogarty et al. [31] and mobile settings by Kern et al. [27]. There are many ways to model interruptability: [31] uses an interruptable yes/no scheme, [27] uses a more complex two-dimensional scheme which distinguishes between interrupting the user and his environment. McCrickard et al. [32] have introduced a model that is not driven by sensing and classifies interruptions according to the user’s interruption, the required reaction, and the comprehension of the underlying event.
Audio offers many and rich opportunities to extract meta information for later retrieval. The spectrum ranges from speech recognition and keyword spotting, over speaker identification and speaker change identification, to classifying the auditory scene of the user. We have presented an algorithm to classify the user’s auditory scene and investigated the questions of features and their selection, required sampling rates, and necessary classification window length. In general there is a trade-off between sampling rate and classification window length. Even though the best classification rates can be achieved using LDA-transformed features, good classification performance (88.6%) can be achieved with ’raw’, i.e. untransformed, features.
4
Related Work on Interruptability
Interruptability from Multiple Sensors
The contexts that are being investigated in Sections 2 and 3 are quite low-level. We have investigated the extraction of higher-level information that cannot be inferred directly from a single sensor modality, namely the interruptability of the user. We introduce a model of human interruptability and show a way to infer the interruptability from multiple sensors. While the interruptability as such is of peripheral interest for retrieval, it is a good case-study for the extraction of higher-level information. We model the interruptability of the user and the environment separately. Figure 5 shows the interruptability of the user, referred to as Personal Interruptability and the in-
4.2
Automatic Recognition from Multiple Sensors
An intuitive approach to estimate the user’s interruptability would be to recognize situations such as those depicted in Figure 5 and infer the interruptability from them. However many situations do not give a precise enough hint about the user’s interruptability. For example the situations Lecture and Boring Talk are actually part of a continuum that covers the entire left third of the space. Similarly the situations Restaurant, Having A Coffee, and Bar belong to a 8
continuum that covers the diagonal of the space (see Figure 5). Increasing the granularity of the situations would of course help, but introduces so many special cases, making the problem intractable. Thus, we estimate the interruptability directly from the output of low-level context sensors. A context sensor is an algorithm that gives information about a basic context, such as the user’s physical activity or the auditory scene as presented in Sections 2 and 3. The basic idea of our recognition approach is that for every low-level context sensor we can give a Tendency where the interruptability is likely to be. For example, for a Lecture Hall context sensor (eg. based on audio) we know that the interruptability is likely in the lower-left-hand corner of the space, because both the user and the environment should probably not be interrupted. Since we can never know such low-level context for sure, different context sensors might be inconsistent or even contradicting. Since context sensors are actually recognition sub-systems, we get a recognition score l(s, t) for every context sensor s and reading at time step t. To incorporate all information available, we weigh the Tendencies Ts (x, y) ((x, y) being the two directions in the interruptability space) with the recognition scores of the respective sensors l(s, P t). We then sum the weighted Tendencies Intr(x, y, t) = s Ts (x, y)·l(s, t) and obtain a likelihood map over the interruptability space, in which we search the maximum. Figure 6 shows an example of the procedure for four different context sensors. We performed a first experimental evaluation of this approach. We recorded 37 min of data (audio, one 2D accelerometer above the knee, location from the strongest WLAN Access Point). The audio was classified into Street, Restaurant, Conversation, Lecture Hall, and Other and the acceleration data into Sitting, Standing, Walking, Up/Downstairs, using the methods presented in Sections 3 and 2 respectively. Audio and Acceleration were annotated online on an attached PDA, the interruptability ground truth was annotated directly after the recording. We modeled the Tendencies as uni-modal Gaussians and set their values by hand. Figure 7(a) shows the Tendencies we used. Because we want to use this information for modality selection, we accept the estimation error if it is smaller than half the distance between the grid lines in Figure 5. Using this configuration we achieved a classification rate of 86% for the Social Interruptability and 96.2% for the Personal Interruptability (see Table 1).
4.3
Location: Lecture Hall
Activity: Sitting
Activity: Standing
Social Activity: Conversation
Tendencies for sensor s: Ts(x,y)
Sensing Score at time t: l(s, t)
1.0
0.1
0.9
0.8
S Interruptability Estimate: Intr(x,y,t) = STs(x,y) . l(s,t) s
Figure 6. Automatic Recognition of Interruptability from Low-Level Contexts
makes adding new context sensors very hard. Furthermore, the Tendencies need not be uni-modal as we have modeled them. These limitations limit the extensibility of the system considerably, it is for example not possible to re-train the system during run-time. To counter these short-comings we have modified the algorithm in two ways: firstly we changed the representation of the Tendencies to allow for multi-modality and learning and secondly we have used clustering to find significant low-level context automatically. We divide the Tendencies into bins along a fixed resolution grid. This allows for a natural representation of multi-modal Tendencies. We use essentially the same formula for estimating the interruptability, but sum all bins independently: P Intr(ˆ x, yˆ, t) = s Ts (ˆ x, yˆ) · l(s, t), where (ˆ x, yˆ) is the discrete two-dimensional bin-number. For training the values of l(s, t) are known, and the Ts (ˆ x, yˆ) are to be estimated. For every training example and bin we obtain a linear equation. Given a set of training examples we can solve this over-determined problem per bin using least-squares. Figure 7(b) shows the Tendencies that were thus obtained for the same low-level contexts as before. The assumption of uni-modality does obviously not hold. The recognition results are with 92.7%/97.75% for the Social and Personal Interruptability respectively much better than before (see Table 1, second row). The second modification finds low-level contexts automatically from the data by clustering the data using k-means instead of using classifiers that are trained in a supervised manner. Although this slightly reduces the recognition score to 91%/97.4% for the Social and Personal Interruptability respectively (see Table 1, third row), this small performance penalty is acceptable given that the results were obtained completely unsupervised.
Scalable Interruptability Estimation
Even though the results presented in the last section are very promising, this approach does not scale well: handcrafting the Tendencies and training the classifiers for the low-level context is very tedious and labor-some. This 9
Sitting
Standing
Upstairs
Walking
Acc: S itting
Downstairs
Acc: S tanding
Acc: Up s tairs
Acc: Walking
Acc: Down s tairs
3
3
3
3
3
3
3
3
3
3
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
2
3
0
Conversation
1
2
3
0
Restaurant
1
2
3
0
1
2
3
0
Lecture
Street
1
2
3
0
1
2
3
0
Audio: C onvers ation
Other
1
2
3
0
Audio: R es taurant
1
2
3
0
1
2
3
0
3
3
3
3
3
3
3
3
3
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
1
2
3
0
Location: Office
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
Location: Office
1
2
3
0
2
3
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
0
0
0
0
0
0
1
2
3
0
1
2
3
0
1
2
3
0
3
1
0
3
3
2
3
2
Location: Lecture Hall 3
Location: Outdoor
3
2
1
3
3
1
0
3
3
Outdoor
Location: Cafeteria
Location: Lab
1
3
Location: Lecture Hall 3
0
Location: Lab
0
1
2
3
(a) Hand-crafted Gaussian Tendencies (white denotes 0, black 0.2).
0
1
2
3
0
0
1
2
3
0
0
1
2
3
0
1
2
3
Audio: Other
3
0
0
Audio: Lecture
Audio: S treet
0
1
2
3
Location: C afeteria
1
0
1
2
3
0
0
1
2
3
(b) Automatically Learned Tendencies on a 30x30 Grid (white denotes -0.22, black 0.4.).
Figure 7. Hand-crafted and automatically learned tendencies Variant Hand-crafted tendencies Low-level contexts from supervised training Learned Tendencies (6x6 bins) 5-fold cross-validated Low-level contexts from supervised training Learned Tendencies (6x6 bins) 5-fold cross-validated Low-level contexts automatically found (20 clusters)
Social Inter. 86.0 %
Personal Inter. 96.2%
92.72 %
97.75 %
91.0 %
(a) Wrist-Mounted Display for (b) Screen-Shot of the ApplicaData Annotation tion on the WMD. The Interruptability Space is used for Annotation, the Buttons for controlling the recording.
97.4 %
Figure 8. Wrist-Mounted Display for Data Annotation
Table 1. Comparison of Recognition Scores for 3 Algorithmic Variants
4.4
We used 10 cepstral coefficients as audio features, computed over 30ms windows and averaged over one second. The average and variance over one second were used as features for every acceleration sensor. The strongest WLAN Access Point was sampled as location estimate every second. These three vectors were concatenated to obtain a feature vector every second. The Tendencies have a resolution of 6x6 bins. We used 50 automatically found contexts. All results are 5-fold cross-validated.
Experimental Evaluation and Discussion
We have conducted two experiments to validate that our approach works in real-world settings. In the first we have recorded 3.5 hours of continuously annotated data. Since the continuous annotation can add a systematic bias to the data, we have recorded another set of two days in which we have obtained ground truth using experience sampling.
4.4.1
Continuously Annotated Data
For the first experiment we recorded 3.5h of continuously annotated data. The data set consists of several walks within and inbetween buildings of ETH Zurich, conversations with colleagues, a bike ride, working in an office, and a visit to a restaurant. Figure 9 shows the recognition scores for all single sensor modalities and their combinations. All acceleration sensors count as a single one. The recognition results are very similar to the ones ob-
Experimental Setup For recording long stretches of data from acceleration, audio, and location sensors we have modified the platform presented in Section 2.2. We have exchanged the laptop with a Charmed Wearable PC, added a microphone for recording the audio, and exchanged the iPAQ with a Wrist-Mounted Display (see Figure 8). This allows for easy on-the-fly annotation of the data. 10
100.0% 96.7%
96.3%
97.1%
and asking him to annotate his current interruptability (on the Wrist-Mounted Display shown in Figure 8). We have thus obtained 54 interruptability samples. During the course of the two days 2.5 hours were annotated continuously as training data for the Tendencies. The data set mainly consists of working on a computer and walking to downtown Zurich and eating in fast-food restaurants. The Social Interruptability was correctly classified in 53 out of 54 samples (98.1%). However, for the Personal Interruptability only 44 samples (81.5%) were correctly classified. 9 out of these 10 samples belonged to the same activity (namely ’having a coffee’) and in fact constitute all occurrences of that activity in the data set. This suggests that this activity was under-represented in the training set. Adapting the model online by relearning low-level contexts and tendencies could leverage this problem.
97.2%
95.0%
90.3% 90.0%
90.4%
89.5%
90.5%
89.0%
85.0% 81.6% 80.8% 80.0% 77.9% 76.1%
75.7%
75.0% Social Interruptability Personal Interruptability 70.0% Audio
Location Acceleration
Location + Location + Audio + Audio Acceleration Acceleration
Location + Audio + Acceleration
Figure 9. Comparison of sensor combinations (‘acceleration’ implies all 12 sensors)
4.5
With more and more pervasive use of communication technology, managing interruptions is an important problem. We have introduced an algorithm to estimate the user’s Social and Personal Interruptability from multiple sensors. Our experimental evaluation shows that the algorithm achieves 97.2%/90.5% for Social and Personal Interruptability respectively.
tained with the small set from Section 4.2: the score for the Social Interruptability is 4.5% higher (97.2% instead of 92.7%), while the Personal Interruptability score has dropped by about the same rate (90.5% instead of 97.7%). Even though the score of the Personal Interruptability is lower these results are very good, considering that the data is considerably longer (3.5h instead of 37min) and more complex. The recognition of the Personal Interruptability seems harder than that of the Social Interruptability. We believe that the Personal Interruptability changes more often and is thus harder to recognize. Furthermore, many changes in the Personal Interruptability might not be reflected in the output of the sensors we used. Biometric sensors such as Galvanic Skin Resistance, heart rate, etc. could alleviate this problem. When comparing the different sensor modalities, acceleration clearly performs best, with 96.3/89.5% for the Social and Personal Interruptability respectively. Audio performs relatively low with only 77.9/76.1% recognition score. Interestingly the location performs well (90.3%) for the Social Interruptability but only about as good as the audio for the Personal Interruptability (75.7%). While the combination of all sensors performs best, the combination acceleration+audio performs nearly as good. Given that the position estimate we used is quite coarse, it would be interesting if better results can be achieved with a finer positioning system. 4.4.2
Summary
5
Conclusion and Outlook
Creating data in context is a powerful concept for data retrieval. Since humans remember associatively storing the context in which particular data was created or captured allows for retrieving the information later. We have presented three possibilities of creating such meta-information, in particular the user’s physical activity, eg. sitting, standing, walking, or shaking hand, from acceleration sensors, his social environment, eg. in a restaurant or lecture, from audio, and his interruptability from multiple sensors. We have evaluated each of these to validate the approaches. Since no single sensor modality can provide all context information, a multi-sensor approach is necessary in order to be able to capture the richest possible set of information. Obviously many issues remain to be addressed. The practical use of such meta-information on a large scale needs to be investigated. Also, the classes of context information we chose are closely bound to what is possible with current technology. User tests are required to determine classes that are generally useful for retrieval. As the context cues are always uncertain, the presentation and user interaction with this kind of information should be investigated. In the context of the interruptability estimation, we have automatically found ’interesting’ low-level contexts in
Experience Sampling
For the second experiment we have recorded two entire days (8h each) of data. We have sampled the user’s interruptability by interrupting him 2-3 per hour using an audio alarm 11
the available data. The correspondence of these with the actual context is an interesting research challenge.
[11] J. M¨antyj¨arvi, J. Himberg, and T. Seppanen. Recognizing human motion with multiple acceleration sensors. In Proc. Systems, Man and Cybernetics, pages 747–752, 2001.
References
[12] L. Bao and S. Intille. Activity recognition from userannotated acceleration data. In Proc. Pervasive, pages 1–17, Vienna, Austria, April 2004. Springer-Verlag Heidelberg: Lecture Notes in Computer Science.
[1] K. Aizawa, D. Tancharoen, S. Kawasaki, and T. Yamasaki. Efficient retrieval of life log based on context and content. In Proc. 1st ACM workshop on continuous archival and retrieval of personal experiences, pages 22–31, 2004.
[13] K. van Laerhoven, K.Aido, and S. Lowette. Real-time analysis of data from many sensors with neural networks. In Proc. ISWC, pages 115–123, 2001.
[2] A. Schmidt, M. Beigl, and H-W. Gellersen. There is more to context than location. In Computers & Graphics Journal, volume 23, pages 893–902, 1999.
[14] K. van Laerhoven, A. Schmidt, and H.-W. Gellersen. Multi–sensor context–aware clothing. In Proc. ISWC, pages 49–57, Seattle, USA, October 2002.
[3] P. Lyman, H. R. Varian K. Swearingen, P. Charles, N. Good, L. Jordan, and J. Pal. How much information 2003? Technical report, UC Berkeley, 2003. http://www.sims.berkeley.edu/research/projects/howmuch-info-2003/.
[15] K. van Laerhoven and H-W. Gellersen. Spine vs. porcupine: a study in distributed wearable activity recognition. In Proc. ISWC, pages 142–149, Arlington, USA, Nov 2004.
[4] S. Mann. Continuous lifelong capture of personal experience with EyeTap. In Proc. 1st ACM workshop on continuous archival and retrieval of personal experiences, 2004.
[16] Th. Degen, H. Jaeckel, M. Rufer, and S. Wyss. Speedy: A fall detector in a wrist watch. In Proc. ISWC, White Plains, NY, USA), pages = 184-188, October 2003.
[5] S. N. Patel and G. Abowd. The ContextCam: Automated Point of Capture Video Annotation. In Proc. Ubicomp, volume 3205 of Lecture Notes in Computer Science, pages 301–318, Nottinham, United Kingdom, October 2004. Springer.
[17] M. Akay, M. Sekine, T. Tamura, Y. Higashi, and T. Fujimoto. Unconstrained monitoring of body motion during walking. IEEE Engineering in Medicine and Biology Magazine, 22(3):104–109, 2003. [18] N. Kern, B. Schiele, and A. Schmidt. Multi–sensor activity context detection for wearable computing. In Proc. EUSAI, LNCS, volume 2875, pages 220–232, Eindhoven, The Netherlands, November 2003.
[6] P. Holleis, M. Kranz, M. Gall, and A. Schmidt. Adding context information to digital photos. In Proc. 5th Intern. Workshop on Smart Appliances anh Wearable Computing (IWSAWC), Columbus, Ohio, United States, June 2005.
[19] A. Schmidt. Ubiquitous Computing - Computing in Context. PhD thesis, University of Lancaster, 2002.
[7] B. Schiele, N. Oliver, T. Jebara, and A. Pentland. An interactive computer vision system, DyPERS: Dynamic personal enhanced remembrance system. In Proc. ICVS, pages 51–65, Jan 1999.
[20] M. Beigl, T. Zimmer, A. Krohn, C. Decker, and P. Robinson. Smart-Its - Communication and Sensing Technology for UbiComp Environments. Technical Report ISSN 1432-7864. Technical report, TeCo, University Karlsruhe, Germany, 2003/2.
[8] N. Kern, B. Schiele, H. Junker, P. Lukowicz, and G. Tr¨oster. Wearable sensing to annotate meetings recordings. In Proc. ISWC, pages 186–193, October 2002.
[21] V. Peltonen, J. Tuomi, A. Klapuri, J. Huopaniemi, and T. Sorsa. Computational auditory scene recognition. In Proc. ICASSP, pages 1941–1944, 2002.
[9] J. Farringdon, A. Moore, N. Tilbury, J. Church, and P. Biemond. Wearable sensor badge & sensor jacket for context awareness. In Proc. ISWC, pages 107–113, San Francisco, 1999.
[22] B. Clarkson, N. Sawhney, and A. Pentland. Auditory context awareness via wearable computing. Workshop on Perceptual User Interfaces, November 1998.
[10] C. Randell and H. Muller. Context awareness by analysing accelerometer data. In Proc. ISWC, pages 175–176, Atlanta, GA, USA, Oct 2000.
[23] M. B¨uchler. Algorithms for Sound Classification in Hearing Instruments. PhD thesis, ETH Zurich. Diss. No. 14498, 2002. 12
[24] P. Korpip¨aa¨ , M. Koskinen, J. Peltola, S-M. M¨akel¨a, and T. Sepp¨anen. Bayesian approach to sensor-based context awareness. Personal and Ubiquitous Computing, 7(2):113–124, 2003. [25] P. Lukowicz, J. Ward, H. Junker, M. St¨ager, G. Tr¨oster, A. Atrash, and T. Starner. Recognizing workshop activity using body-worn microphones and accelerometers. In Proc. Pervasive, pages 18–32, Vienna, Austria, April 2004. Springer-Verlag Heidelberg: Lecture Notes in Computer Science. [26] N. Kern and B. Schiele. Context–aware notfication for wearable computing. In Proc. ISWC, pages 223–230, White Plains, NY, USA, October 2003. [27] N. Kern, S. Antifakos, B. Schiele, and A. Schwaninger. A model of human interruptability: Experimental evaluation and automatic estimation from wearable sensors. In Proc. ISWC, pages 158–165, Washington DC, USA, Nov 2004. [28] M. Cutrell, M Czerwinski, and E. Horvitz. Notification, disruption, and memory: Effects of messaging interruptions on memory and performance. In Proc. of Interact, pages 263–269, 2001. [29] J.M. Hudson, J. Christensen, W.A. Kellogg, and T. Erickson. ‘I’d Be Overwhelmed, But It’s Just One More Thing to Do’: Availability and Interruption in Research Management. In Proc. ACM CHI, pages 97– 104. ACM Press, 2002. [30] R. W. Obermayer and W. A. Nugent. Human– Computer Interaction (HCI) for alert warning and attention allocation systems for the multi–modal watchstation (MMWS). In Proceedings of SPIE, volume 4126, 2000. [31] J. Fogarty, S. E. Hudson, and J. Lai. Examining the robustness of sensor-based statistical models of human interruptability. In Proc. ACM CHI, pages 207–214, 2004. [32] D. Scott McCrickard, Richard Catrambone, C.M. Chewar, and J. T. Stasko. Establishing tradeoffs that leverage attention for utility: Empirically evaluating information display in notification systems. International Journal of Human–Computer Studies, 8(5):547–582, May 2003.
13