Context-based Video Retrieval for Life-Log Applications Kiyoharu Aizawa The University of Tokyo Dept. of Frontier Infomatics & Dept. of Elec. Eng. 7-3-1 Hongo Bunkyo Tokyo 113-8656
[email protected]
Introduction "We want to keep our entire life by video" is the motivation of this research. Personal experiences are usually maintained by such media as diaries, pictures and movies. For example, we memorize our daily experiences in a diary, and we use photos and movies to keep our special events such as travels etc. However, those existing media, so far, keep records of only a small part of our life. Even though we make use of photos and movies, we always miss the best moments because the camera is not always ready. Significant progress has been being made in digital imaging technologies: cameras, displays, compression, storage etc. Small size integration also advances so that computational devices and imaging devices become intimate as wearable computers and wearable cameras [1] . These small wearable devices will provide fully personal information gathering and processing environments. We believe that by using such wearable devices, an entire personal life will be able to be imaged and reproduced as video. Then, we will never miss the moment that we want to maintain forever. Life log will be useful for business purposes as well as private purposes. Personal experiences captured by video, audio and various sensors will be helpful for managing data of performance, functioning, behavior etc of a person. For example, life-log of a person in security business will be beneficial. Content-based vs Context-based Imagine we wear a single camera and continuously record what we see, how huge would be the amount of images that we could capture during 70 years? (The captured video would be first stored in the wearable device and then occasionally moved to a huge storage.) Video quality depends on the compression. Assuming 16 hours per day is captured for 70 years, the amount of video data is listed below. • TV phone quality (64kbps) 11T bytes • VCR quality (1Mbps ) 183 T bytes • Broadcasting quality (4Mbps ) 736 T bytes HDD storage is developing very fast. 100GB HDD is available and not expensive. Even today, if we had 100 of them, we could store 70 years at TV phone quality. Video data is not physically large. However, such a long-term imaging requires very efficient retrieval. For example, imagine we have a year-long or much longer video recording, how can we retrieve scenes that we want to recollect? Manual handling is nonsense because manual operation takes longer than the length of recording. In the field of image processing, content-based image & video retrieval has been intensively investigated. In the framework of content-based retrieval, visual features such as color histogram etc are utilized to retrieve scene [2]. So far, content-based retrieval has been successfully applied to broadcasting TV programs.
On the other hand, in the life-log application, the content-based framework is almost useless because we do not have enough example to perform content-based retrieval. Instead, for the lifelog application, context is more important to retrieve desired scenes. Context such as when, where, whom, how etc can be better keys to retrieve the desired scenes. We think there are two kinds of context: one is objective context such as time, location, behavior of the person and the other is subjective context such as feelings, interests of the person. Both contexts are important for effective retrieval. The objective context is very useful to manage the data, and the subjective context is very helpful to summarize highlights. In our previous studies [3,4], we proposed our approach to automatic summarization by making use of physiological signal, that is, brainwave. Brainwave clearly shows status of arousal and we could extract with 100% accuracy the scenes that the person felt interesting during the experiments. The person was asked to report events that he felt interesting during outdoor experiments and the events reported by the person was fully extracted. In addition to them, many scenes were extracted: they could be noise or they could be unconscious interest. In the next section, we will briefly describe our recent development that makes use of objective context [5]. Wearable Imaging System and Context-Based Video Retrieval The system intends to contiunously capture experiances of a person's life. It captures not only audio and video but also various sensor data such as GPS, gyro, motion sensors. (It can accomodate brain wave sensor if necessary.) By using the data from sensors, it can navigate the user to retrieve the scene that he want to see [5]. We also utilize a town directory database, then combining GPS signal detection, the user can even search by a key word a scene of the specific store or place he visited. We describe the capturing system and the retrieval system . *
Wearable imaging system
Figure 1 Wearable Imaging System As shown in Fig.1, we have developed a wearable imaging system that simultaneously records data from a wearable camera, a microphone, a GPS, an acceleration sensor and a gyro sensor. By processing data from the various sensors, the user's context can be appropriately extracted. The acceleration sensor and the gyro sensor are attached to the back of the user's head to capture the motion of the camera. The platform for the system is a notebook PC. All sensors are attached to the PC through the Universal Serial Bus (USB), serial port and PCMCIA slots. All data from the sensors are recorded
directly into the notebook PC. Software is written in Visual C++ and run using Microsoft Windows XP. Visual and audio data are encoded into MPEG4 and MP3, respectively, using Direct Show. In order to simplify the system, we modified the sensors so that they can be powered from the battery of the notebook PC. We also customized the device drivers to recognize the sensors used in our wearable imaging system.
Figure 2 System Block of the Wearable Imaging System * Event detection Currently, event such as category of behaviors of the person is detected. Then, person’s behaviors are categorized into “walking”, “running” and “no movement”. Instead of video and audio, gyro and acceleration sensors are utilized for this purpose. Sampling rate of the sensor data is 30Hz. HMM is utilized for the detection: HMM is trained beforehand, and the sensor data is processed on line to tags. HMM based detection has high accuracy. In our experiments, accuracy of detection of walking, running and no movement is 97%, 88% and 87%, respectively. Skin color is also detected and “face” is detected when size of the skin color region exceeds a threshold. Tags can be manually labelled too. Comments to scenes are also manually added later. * Retrieval system: event-based viewer, location-based viewer, combination with town directory Audio and video is tagged with various sensor data and detected events. GPS data is very efficiently used to browse wearable video collections on a map. Then, video data can be navigated in various ways. • Continuous video sequences A continuously-captured sequence is replayed with a slider control. • Video shots segmented by events. in chronological order The first frames of the events are shown in chronological order for efficient browsing. By choosing any frame, the corresponding shot is replayed. • Video sequences placed on a map: location based retrieval Because GPS signal is simultaneously captured with video, video sequences are projected on a map. The user’s traces are plotted on a map and choosing any part of the traces displays the corresponding part of the video. The user’s location is updated in the window according to replay
of the video. We found that this location-based viewer is very helpful to access a specific part of video collections. • Events display on a map The detected events are also plotted on a map. At present, three tags of behaviors and manual labels are shown on the map along the traces of the person. •
Improving readability of GPS code GPS code is latitude and longitude which is not very readable. Then, referring to a database, GPS code is changed into postal address. Postal address is shown hierarchical manner and the map shown in a window corresponds to the area of the postal address.
•
Combination with a town directory: search by keyword for video without the keyword In a separate window, a town directory can be shown, in which address information of more than a million number of stores and facilities, companies, restaurants etc are classified into genres. Our retrieval system associates the GPS location information with the town directory in two different ways. In a direct usage of the town directory, for example, it can show, in the map window, shops of some genre (Japanese traditional restaurant etc) within some distance (25m etc) from the person’s trace. In the other application of the town directory, the retrieval system can search a video clip by keyword of names of shops etc. For example, the user can look for the video clip when he gets into “Macdonald” and the system can show all candidates that he visited. In this application, we make use of the status of GPS signal because GPS signal is gone when the user is inside a building. Then, the user input the name “Macdonald” by word and the search condition such that the candidates are within 25m from the users previous locations and the duration of the disappearance of GPS signal is more than 1min, for instance. The system immediately search and show the candidates, and the user can choose one of them to watch the video clip. Thus, the system allows the user to search video clips by keywords of the visiting places.
Figure 3 User interface of the retrieval system: event-based, location-based viewer, a town directory based search windows are shown.
References [1] S.Mann, WearCam (The Wearable Camera), ISWC, pp.124-131, 1998 [2] Introduction to MPEG7, Edited by B.S.Manjunath, P.Salembier, T.Sikora, Wiley, 2002 [3]K.Aizawa et al, Summarizing Wearable Video, IEEE ICIP2001, pp.III398-401, 2001 [4]K.Aizawa et al, Summarizing wearable video –Indexing subjective interest-, Trans. IEICE D-II (in Japanese) Vol.J86-D-II, No.6, pp.807-815, May, 2003 [5]Y.Sawahata and K.Aizawa, Wearable imaging system for summarizing personal experiences, IEEE ICME 2003 pp.I45-I48, July, 2003