Practical Experience Recording and Indexing of Life Log Video Datchakorn Tancharoen
Toshihiko Yamasaki
Kiyoharu Aizawa
Dept. of Electronic Engineering The University of Tokyo 5-1-5 Kashiwanoha, Kashiwa,
Dept. of Frontier Informatics The University of Tokyo 5-1-5 Kashiwanoha, Kashiwa,
Dept. of Frontier informatics The University of Tokyo 5-1-5 Kashiwanoha, Kashiwa,
Chiba, 277-8561 JAPAN Phone: (+81)-4-7136-3888
Chiba, 277-8561 JAPAN Phone: (+81)-4-7136-3888
Chiba, 277-8561 JAPAN Phone: (+81)-4-7136-3888
[email protected]
[email protected]
[email protected]
GPS receiver, motion sensors, and brain wave analyzer [1]. These data were transferred to a personal notebook computer. We have continuously developed not only recording but also retrieval system to memorize our experiences as we called the Life Log system. To log our life, the amount of captured data is very large. Therefore, efficient retrieval techniques are needed to navigate our experiences. In our previous studies, we applied context information based on person’s brain waves as retrieval keys to extract human’s interest [1]. Audio/visual information was used as content to detect the conversation scenes and GPS data was applied as context to extract spatiotemporal key frames from time and distance sampling [2]. In [3], a novel concept to integrate various features from content and context was introduced to retrieve the Life Log video.
ABSTRACT This paper presents an experience recording system and proposes practical video retrieval techniques based on Life Log content and context analysis. We summarize our effective indexing methods including content based talking scene detection and context based key frame extraction based on GPS data. The voice annotation and detection is proposed for practical indexing method. Moreover, we apply an additional body sensor to record our life style and analyze human’s physiological data for Life Log retrieval system. In the experiments, we demonstrated various video indexing results which provided their semantic key frames and Life Log interfaces to retrieve and index our life experiences effectively.
There are some related works on capture and retrieval of life experiences including [4], in which the user’s context such as location, encounters with other people, and some activities were stored and used as retrieval keys. In [5], capture by a wearable camera and PC was developed without consideration of retrieval. In [6], user’s real-time physiological reactions were used as triggers for switching a wearable camera on and off. A person’s skin conductivity, heart rate, respiration rate, muscle activity, were also used for key detection. In [7], Life Log video scenes were classified according to events detected by analyzing the data from a wearable camera and a microphone. Moreover, sense wear armband was produced by body media [8] and also provided some benefit data for analyzing human’s life style including motion data, heat flux, galvanic skin response (GSR) and skin temperature [9]. Thus, it is benefit to use this device to record our experiences.
Categories and Subject Descriptors H.3.3 [Information Systems]: Information Storage and Retrieval, -Information Search and Retrieval
General Terms Human Factors
Keywords Life Log, video retrieval, content, context, wearable computing.
1. INTRODUCTION Nowadays, many people have their personal digital cameras and video camcorders to record their preferable experiences since digital imaging devices are available and portable. However, they may miss some to record some interesting experiences in their life because the recording device is not ready all the time. For this reason, we have developed the wearable video system which can capture continuous video and also various environmental features synchronously. The wearable video system was applied to record both of audio/visual information and environmental data including locations, human’s movements and feelings by using
In this study, we present our Life Log system in terms of recording and retrieval. We have applied a compact touch screen computer instead of previous notebook computer and also associated devices including mini-wearable camera, USB sensitive microphone, and body sensors to record our experiences. We can use body sensors armband to capture all possible recorded experiences because it is convenient to wear almost all the time. This armband can be used to record physical activities, and also physiological data. Life Log capturing system was applied to record the interesting events, in which we want to retrieve and view the experiences. Effective Life Log video retrieval techniques were summarized. Also, a practical indexing method based on voice annotation was introduced. Furthermore, preliminary analysis and retrieval of Life Log data based on the features from body sensor was demonstrated in this paper.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CARPE’05, November 11, 2005, Singapore. Copyright 2005 ACM 1-59593-246-1/05/0011…$5.00.
61
The development of Life Log system is presented in Section 2. Effective video indexing methodology including content and context based techniques are explained in Section 3. In Section 4, body sensor armband is introduced with their useful features for retrieval system. The experiments of practical technique based on voice annotation and preliminary results from body sensor features are demonstrated in Section 5. The last section is conclusion.
2. LIFE LOG SYSTEM DEVELOPMENT Personal Life Log system was created to record our life experiences in form of multimedia information. The original system was created using optical wearable camera, line-in small microphone, gyro, acceleration sensors, and GPS receiver connected to a notebook PC as shown in Figure 1(a). Our current Life Log system was developed for comfortably wearable usage, which contains various practical devices including a compact personal touch screen computer with mini wearable camera and USB microphone, motion sensor, GPS receiver and also body sensor armband as demonstrated in Figure 1(b). We used a compact touch screen computer which is easy to carry anywhere anytime to keep our experience data. We applied practical devices including mini-wearable camera, and USB sensitive microphone to capture audio/visual information. We used GPS receiver to capture the relative location at the same time of recording video. Body sensors armband was introduced to capture all possible recorded experiences because it is convenient to wear almost all the time. This armband contains acceleration sensors to analyze physical activities and was used to replace the previous motion sensor which was attached on a cap. Moreover, it can record physiological data such as skin conductivity, heat flux, and skin’s
Figure 2. Block diagram of current Life Log data stream. Life Log retrieval system. The block diagram of data stream Life Log system is shown in Figure 2.
3. VIDEO INDEXING METHODOLOGY There are two fundamental video summarization methods by representing moving pictures and still images. There are some significant differences between these two methods. Still image summarization can be built faster and displayed as a story board. However, moving image summarization can make more sense to display the video. We have investigated both of these advantages to our Life Log summarization by using key frame extraction in which each key frame can represent the moving video at certain time. We can display the desired video by selecting the extracted key frame and see relative location on GPS map. Video retrieval dialog is shown in Figure 3. We can select a preferable indexing method using this interface. The retrieval system can provide multiple retrieval results based on date and time duration, in which we select to view our experiences. The representative key frames in retrieval results are based on selected indexing method. The following techniques are applied to satisfy the video retrieval system.
3.1 Context based Extraction We analyze contexts from GPS data including latitude, longitude, speed, direction and relative time. We can extract the key frames by using time sampling, distance sampling, speed detection and also direction changing information [2].
(a)
(b) Figure 1. Our Life Log system (a) Original system (b) New investigated system Figure 3. Video retrieval and indexing interfaces.
temperature. Various data are transmitted though USB and PCIMCA port to a personal compact computer and analyzed for
62
Content information can be acquired from audio/visual data which are recorded from a microphone and a wearable camera. Talking scene is an example which we could apply audio/visual content for video indexing. We applied voice detection to detect talking sound by considering of human talking characteristics and also human face detection based on skin color to detect existing faces [3]. Face talking scene detection was applied to increase talking detection accuracy. The key frames from face talking scene detection are demonstrated in Figure 4. We assume that face talking scene should contain more important talking topic. Thus, the detected scenes need to satisfy both of voice detection and face detection.
apply context based extraction from GPS data otherwise we extract key frames based on the contents of audio/visual information. We can apply voice detection, view detection, speed detection, direction changing, distance sampling and time sampling. Then extracted key frames are concerned with time interval to maintain the entire information. In Figure 5, we demonstrate the extracted key frames based on adaptive sampling by using voice detection, speed detection, direction changing and time sampling. The extracted results contained satisfying key frames such as speed detection and direction changing detection which provide some semantic scenes such as stopping at crossing way or low speed to see some interesting places and also offer some hints by using voice annotation.
3.3 Adaptive Key Frame Extraction
3.4 Voice Annotation
3.2 Content based Extraction
Annotation is very useful in indexing and retrieval process. We can make annotations in interesting events and remark desired scenes. In Life Log capturing, it is convenient to remark interesting experiences using voice annotations.
Generally speaking, content based processing is more computational expensive than context. Therefore, contexts are applied to extract key frames in general traveling scenes. However, contents are used to extract interesting key frames in specific event and when GPS signal is unavailable. GPS signal is detected by analyzing GPS data. If GPS signal is receivable, we
The characteristic of voice annotation is different from background noise that it provides dominant frequency band and gives discontinuous signal power. On the other hand, background noise has smooth power continuously during a time period. Voice annotation can be made both of short and long sentences. In Life Log capturing, we used a small microphone attached to user’s collar. Thus, the user’s voice power is quite high compared to other sounds. An important point is to consider background noise power. Thus, an adaptive threshold generalized from background noise was applied in this purpose. Based on this method, we can detect voice key frames from Life Log video and synchronize them with video data and relative GPS signal as demonstrated in Figure 6. The proposed video indexing method by using voice annotation is explained as follows. Audio signal is processed separately to index user’s voice. Firstly, audio signal was filtered according to sampling rate (fs) and cutoff percentage (pc) to maintain low frequency voice band.
f c = ( f s / 2) * pc
Figure 4. Face Talking Scene Detection.
where, fc is the cut-off frequency of the low pass filter. Then the filtered signal (fi) is divided to equal frame with period T and calculated their signal power (Pf ).
Pf =
1 T ∑ ( f i |) 2 T i =1
We applied adaptive threshold for voice detection which can consider the background noise power. The threshold (THn) is calculated based on audio signal during previous D duration.
Pavg *Wn , n ≤ D TH n = Pavg n *Wn , n ≥ D Pavg n =
Figure 5. Adaptive Key Frame Extraction.
1 n ∑ Pi D i =1+ n − D
where, Pavg is an average power during previous time duration D.
63
body sensor and Life Log content and context can be synchronized based on relative time and analyzed in Life Log retrieval system.
Wn is the weight value of threshold value compared to an average power and also an adjustable parameter to determine the amount of desired key frames (large value will provide less key frames).
0, Pn < TH n Sn = 1, Pn ≥ TH n
4.1 Body Sensor Armband The body sensor sense wear armband as presented in Figure 7(a) has been designed to collect and analyze a broad range of data from the body and its movement allowing us to quantify physical activity and energy expenditure. In addition, the sense wear armband can also record heat flux and temperature sensors. It can measure heat produced by the body as a result of basic metabolism and also all forms of physical activity. This combination of multiple sensors enables the sense wear armband is advantage and overcome many limitations of earlier devices.
Voice signal (Sn ) will be detected in the case of signal power (Pn) is more than adaptive threshold value (THn). Then detected signal is corrected by using some heuristic rules to complete voice period and remove some short utterances. Voice annotation is a practical technique using small computation based on voice power and adaptive threshold to separate environmental sounds. In addition, this technique can be applied with context information such as GPS data to extract the semantic key frames for more efficient video indexing.
The sense wear armband contains 11 data collection channels which can collect physiological data at a rate up to 32 times per second as shown in Figure 7(b). The recorded data contains motion forces measured by an accelerometer totally 6 channels, heat flux which is the rate of heat exchanged from a person’s arm to the outside environment, Galvanic skin response (GSR) which is a measure of the electrical conductivity between two points on the skin as a skin’s conductivity, skin temperature, near body temperature, and step counter. These features are useful to be analyzed for human’s physiological activities.
4. INTRODUCTION OF BODY SENSOR In our investigated Life Log system, we introduced a body sensor armband which we could wear almost all the time to examine human’s daily life. We can use the wearable video system when we want to capture the preferable events. The various data from
Library
(a)
(a)
Library
(b) Figure 7. Body sensors and their features (a) Body sensor armband and the position of various sensors (b) Block diagram of 11 data channels in body sensor armband.
(b)
Figure 6. Video indexing based on voice annotation (a) Key frames (b) Events on GPS map.
64
5. EXPERIMENTAL RESULTS
1
5.1 Experiments on Voice Annotation
0.8
In this experiment, we examined Life Log video sequences in daily life on traveling scene. On the way, we made some annotations by using voice at remarkable places and interesting objects. Three Life Log video with voice annotation on the way from university to home were examined. These video sequences were recorded from 25 to 45 minutes long and captured in open environment including various sounds. Annotations were made duration traveling by user’s voice in each Life Log video at some interesting events. The investigated parameters were set up as follows. fc was 0.125x fs for low pass filtering. Audio signal was divided into frames (T) of 125 ms and the power of each frame was calculated. We applied the adaptive threshold based on previous average signal power by considering previous 2 seconds duration (D). Heuristic rules were used to connect an annotation within 5 seconds and remove short utterances less than 2 seconds period as noise.
0.6
0.2 0 W=2
Recall
W=2
43
0.47
1.00
W=3
30
0.60
0.90
W=4
21
0.86
0.90
W=5
12
0.92
0.55
W=5
Precision Recall
Video1
Video2
Video3
(b) Figure 8. Evaluation of video indexing based on voice annotation. (a) Variable weighting factor evaluation (b) Video indexing precision and recall rate.
5.2 Preliminary Experiments on Body Sensor Various data recorded from sense wear armband was analyzed to explore our experiences. Physical activities were detected based on motion forces from transverse and longitudinal accelerometer. The accelerometer is a 2-axis micro-electro-mechanical sensor (MEMS) device that measures motion. The accelerometer mean absolute difference (MAD) can measure movement. We used physiological data from body sensor to classify the physical activities as active and passive movement and synchronize them with Life Log video based on physical activity detection algorithm [9]. Video indexing results were examined in Figure 9. Physical activities can be detected by using accelerometer in body sensor. Active scenes are shown while riding a bicycle and walking. On the other hand, passive scenes are presented while shopping in a supermarket and selecting some goods. In addition, we investigated physical activities from body sensor armband by considering accelerometer MAD including running, walking, shopping and sitting. This experiment demonstrated the difference of accelerometer MAD for each activity as shown in Figure 10(a). Running period presents high accelerometer MAD, while walking and shopping have the similar MAD level but different signal period. MAD gives low value while sitting or no move. By considering these experiments, it looks possible to apply these features to classify the user’s physical activities.
Table 1. Retrieval results of variable weighting factor for video indexing based on voice annotation. Precision
W=4
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Figure 8 shows the video indexing evaluation based on voice annotation. We can determine the amount of desired key frames by adjusting threshold weight value (Wn ). Table 1 demonstrates that when threshold weight value is higher, the number of key frames and recall rate would be smaller. Thus, we need to consider both of recall and precision rates. In Figure 8(a), precision and recall rates of variable weighting factors are presented. If we need the highest recall rate by determining a low threshold, a precision rate will be low. To maintain recall rate and keep acceptable precision, a suitable weight is determined based on empirical experiments. In Figure 8(b), three Life Log video sequences were examined based on voice annotation and weighting factor was determined as W=4. Video indexing results gave the recall rate 0.9, 1.0 and 0.86, respectively in three video sequences and preserved their precision rate over 0.75. Furthermore, we can remove unacceptable key frames by using our Life Log interface to maintain only the desired events.
Detect
W=3
(a)
A video indexing result based on voice annotation is shown in Figure 6. Most of the key frames had semantic meanings due to voice annotation including annotation on buildings, intersection, crossing and specific locations and some interesting events. Some detected scenes contain sound from cars and other loud noises. However, we can ignore or remove undesired scenes by user interface. We can also display each key frame as a moving video and see associated location on GPS map. As we can see the key frame of voice annotation about university library and relative location on GPS map in Figure 6. On the other hand, we can select the location on the map to display an associated video.
Weighting (W)
Precision Recall
0.4
Furthermore, the sense wear armband was applied to record our experiences not only in physical activities but also during sleeping period. We can estimate the activities such as lying down, sleep,
65
and deeply sleep duration. These activities could be analyzed using average value from 2-axis: transverse and longitudinal accelerometers. While lying down, transverse accelerometer average was close to gravity and longitudinal accelerometer average was around zero. We can also observe our sleeping activities as shown in Figure 10(b). The top bar presents lying down duration and the lower bar shows sleeping period. This experiment shows that user could sleep well after lying down. Sense wear armband is a practical device to record our experiences since wake up until sleep.
8. REFERENCES
Armband body sensor can also provide other useful data such as heat flux, GSR, skin temperature, and etc. These features are related to human’s subjective feeling and physiological data. Thus, analysis of these features for experience retrieval is our challenging future research.
[4] Lamming, M., and Flynn, M. “Forget-me-not” Intimate Computing in Support of Human Memory. Proceedings of FRIEND21, 94 Int. Symp. Next Generation Human Interface, (Feb. 1994), 125-128.
[1] Aizawa, K., and Ishijima, K., Summarizing Wearable Video. Proc. Intl. Conf. of ICIP 2001, Vol.3, (Oct. 2001), 398–401. [2] Aizawa, K., Tancharoen, D., Kawasaki, S., and Yamasaki, T., Efficient Retrieval of Life Log Based on Context and Content, ACM Workshop CARPE 2004, (Oct. 2004), 22-31. [3] Tancharoen, D., and Aizawa, K., Novel Concept for Video Retrieval in Life Log Application, Pacific-Rim Conference on Multimedia (PCM), (Dec. 2004), 915-923.
[5] Mann, S. “WearCam” (the wearable camera): Personal Imaging System for Long-Term Use in Wearable Computer Mediated Reality and Personal Photo/Video Graphic Memory Prosthesis. Proceedings of ISWC’98, (Oct. 1998), 124-131.
6. CONCLUSION An investigated Life Log capturing and retrieval system was explained. Various indexing methods for Life Log retrieval system were presented to retrieve the remarkable events in human’s experience and demonstrate the importance of Life Log content and context. The experiments demonstrated that voice annotation and talking scene detection were applied to retrieve semantic key frames based on audio/visual content. On the other hand, GPS data and body sensor features were applied as contexts to detect noticeable key events. Therefore, the efficient combination of content and context from Life Log data would be advantageous for practical human’s experience recording and retrieval system.
[6] Clarkson, B., Mase, K., and Pentland, A. Recognizing User Context via Wearable Sensors. Proceedings of ISWC’00, (Oct. 2000), 69-75. [7] Healey, J., and Picard, R.W. StartleCam: a cybernetic wearable camera. Proc. of ISWC’98, (Oct. 1998), 42-49. [8] Liden, C.B., Wolowicz, M. Benefits of the Sense Wear Armband over Other Physical Activity and Energy Expenditure Measurement Techniques, Body Media White Paper, http://www.bodymedia.com/research/whitepapers.jsp [9] Krause, A., Siewiorek, D.P., Unsupervised Dynamic Identification of Physiological and Activity Context in Wearable Computing, Proceedings of ISWC 2004, 88-97.
7. ACKNOWLEDGMENT The author would like to thank Japanese Government for supporting the Monbukagakusho scholarship and also many colleagues for database and their helpful discussions.
(a)
Sleep Duration
Figure 9. Video Indexing based on physical activity.
Acc. Avg. Long. Tran.
(b) Figure 10. Activities based on body sensor features (a) Physical activities (b) Sleep duration.
66