Nov 3, 2004 - on low-power realization suitable for a miniaturized imple- mentation. ... other hand, most coffee machines tend to make a very dis- tinct sound, which can be .... crocontroller and a RFM DR3001 wireless transceiver working at ...
ISWC 2004: Proceedings of the 8th IEEE International Symposium on Wearable Computers, p. 138-141, Arlington, 31. Oct - 3. Nov 2004
Implementation and Evaluation of a Low-Power Sound-Based User Activity Recognition System Mathias St¨ager1 , Paul Lukowicz1,2, Gerhard Tr¨oster1 1 Wearable Computing Lab, ETH Z¨urich, Switzerland 2 Institute for Computer Systems and Networks, UMIT Innsbruck, Austria {staeger, lukowicz, troester}@ife.ee.ethz.ch Abstract The paper presents a prototype of a wearable, soundanalysis based, user activity recognition device. It focuses on low-power realization suitable for a miniaturized implementation. We describe a tradeoff analysis between recognition performance and computation complexity. Furthermore, we present the hardware prototype and the experimental evaluation of its recognition performance. This includes frame by frame recognition, event detection in a continuous data stream and the influence of background noise.
1. Introduction In many wearable applications user activity recognition is one of the most important context recognition tasks. As a consequence much research effort has gone into activity recognition. Most of this work has built on one of the following three approaches: computer vision, wearable motion sensors or augmentation of the environment. The work described in this paper focuses on a different approach: the use of sound analysis. It is motivated by the fact that many activities that are difficult to detect using the above three approaches are associated with a distinct sound. It also builds on the idea [6, 10] that by comparing intensities of a wrist worn with a chest worn microphone, we can identify sounds occurring immediately next to the user’s hand rather than just somewhere in his general vicinity. This can be taken as a strong indication that the sound has been directly caused by the user. This indication could be further validated by information from body worn motion sensors to show that the sound was correlated with a hand motion. As a simple everyday example consider getting a coffee from a machine. In general, this action is not associated with a single characteristic gesture. Rather, it involves a series of hand actions such as pressing a switch or putting the cup under the outlet. These actions can be performed in a variety of ways and sequences and are thus difficult to
recognize using both vision and wearable motion sensors – in particular from continuous, non segmented data. On the other hand, most coffee machines tend to make a very distinct sound, which can be reliably recognized with a reasonable computational effort. Furthermore, as the user needs to touch the switch and put the cup under the machine with his hand, the intensity difference analysis will very likely register the sound as occurring next to the users hand. Related Work and Paper Contribution The main difference between our work and other auditory scene analysis research (e.g [3, 8]) stems from our emphasis on low-power consumption and suitability for miniaturized hardware implementation required for a truly wearable system. In previous work [10] we have demonstrated that low complexity, low duty cycle methods can be used to reliably recognize a variety of sounds. Furthermore we have verified the validity of the intensity difference analysis as means of identifying sounds occurring next to the user’s hand. This paper describes the next stage of our research. It deals with the design, demonstration, and real life evaluation of a first hardware prototype. We begin by presenting a detailed analysis of the tradeoffs between recognition rate, computation complexity and communication energy which are key system design issues. Next we describe our hardware device and provide the results of power consumption measurements. We then discuss recognition rates achieved with our hardware in a real life experiment. This includes both frame by frame results and event detection using a majority decision over frame sequences. We also demonstrate that the recognition rates achieved by our system are robust with respect to background noise.
2. Low-Power Tradeoff Analysis 2.1. Method Overview As previously described, two devices – one worn on the wrist, the other on the chest – are required for a sound-
ZCR HISTO FLUC FLUC-S FC BW SRF 4-BER
ADD
MUL
N/4 0 2N − 1 2N − 1 2N − 2 5N − 4 7 N −1 4 2N − 2
N −1 0 N +4 N +4 N 5N 7 N +1 4 2N
BRANCH
other
tnorm
N −1 + 50N 0 0 0 0 3 N 4 0
0 0 200 200 36 72 0 144
1339 29184 1751 983 802 3268 1219 1422
1 N2 4
Table 1. Instruction count and execution time based user activity recognition system. Each device samples sound with a relatively low duty cycle (1 to 10 frames per second, each frame sampled with 4.8 kHz for 53 ms [10]), does some local preprocessing and transmits this information wirelessly to a central wearable computer with higher processing capability. The central wearable computer compares the intensities (which correspond to the average energy over one frame) to decide whether the user is causing the sound with his hand or not. Together with other sensory information (e.g. from body worn motion sensors), the information of the wrist device is averaged over several frames to get an event based activity recognition. Since both the chest and the wrist device should operate autonomously for months on a miniature battery, our goal is to minimize their energy consumption. Thus, we have to analyze the complexity of different algorithms running on the devices to chose an algorithm that gives a high recognition rate with a low computational complexity. Furthermore, we need to find a tradeoff between local preprocessing and wireless result transmission.
2.2. Algorithms Complexity Estimates Recognition rates depend on two factors: the acoustical features and the classifier algorithm. We investigated several features and classifier. From the list of features commonly used in audio and speech recognition [5, 8] only features that can be processed on a frame wise basis were evaluated. In the temporal domain, those were zero-crossing rate (ZCR), 90%-10% width of amplitude histogram distribution (HISTO) and fluctuation of amplitude (FLUC). In the frequency domain the features were fluctuation of amplitude spectra (FLUC-S), frequency centroid (FC), bandwidth (BW), spectral roll-off frequency (SRF) and band energy ratio in 4 logarithmically divided subbands (4-BER). To calculate the recognition rates the same database as in [10] was used: The task was to distinguish 5 different sounds made by the user (hammering, sawing, drilling, grinding, filing) or activated by the user (coffee maker, coffee grinder, microwave, hot water nozzle, water tap).
Normalized execution time
frequency
time
Feature
5
×104 All features
4 3
ZCR FLUC
ZCR FLUC BW SRF
ZCR FLUC 4-BER
2 4-BER 1 70
75 85 80 Recognition Rates [%]
90
Figure 1. Feature-set comparison Tab. 1 shows the computational complexity associated with those features as a function of a frame length N . With N = 256 for temporal features and N = 128 for the frequency domain (because a FFT with real-valued input produces a conjugate complex output) and assuming a multiplication taking 4 times longer than an addition or a branch instruction [9], we get the normalized instruction times tnorm = t/tADD in the last column of Tab. 1. Tradeoff: Recognition Rates – Computation Time The goal of a low-power architecture is to find a set of features, that performs well in terms of frame by frame recognition rate while keeping the computational complexity low. To find an optimal set of features, three strategies were chosen: • “brute force” method: comparing recognition rates of different feature sets using different classifiers • calculating correlation between features and discarding correlated features • mutual information method to keep only those features that provide high information about the class and at the same time are independent of each other We noted little difference in the classifiers: a k-NN classifier (with k = 3 . . . 7) and a tree based search algorithm (C4.5 decision tree) showed similar results. A nearest class center classifier produced poorer recognition rates: here the recognition rates were improved by clustering the classes with a Linear Discriminant Analysis (LDA) beforehand. The three different strategies produced similar sets of features. Fig. 1 shows a pareto front to compare recognition rates (with a 3-NN classifier) and execution time of different sets. The execution time, which can be assumed to be proportional to the energy consumption, includes a 128 point complex FFT and takes into account that some calculations can be reused for several features. Using all features results in the best recognition but also in a long execution time. Taking only a few features will result in a poor recognition rate. For our task, we found that ZCR, FLUC, and 4-BER is an optimum set of features.
15 Transmit features. ET X,max
Recognition Rate [%]
Total Energy ET X + EOP [µJ]
85
k-NN. ET X,max 10 C4.5 tree. ET X,max Transmit features. ET X,min k-NN. ET X,min
5
C4.5 tree. ET X,min 0
0.2
0.4
0.6
0.8
Computation Energy EOP [nJ]
65 55 Interfering signal: White Gaussian Noise Sound from same group
45 35
1
Figure 2. Communication Tradeoffs
75
0
1
2
3
4
5
7
6
8
9 10
SNR [dB]
Figure 3. Prototype
Tradeoff: Computation – Communication Generally, wireless transmission is more power consuming than computation. To illustrate this, we assumed a BlueTooth radio being the worst case with a transmit energy of ET X,max = 100 nJ/bit [4]. For the best case, we assumed an ultralowpower transceiver with ET X,min = 40 nJ/bit [7]. For computing devices, the energy per instruction EOP ranges from 10 pJ for basic processors [4] and ASICs to 1 nJ for low to medium performance CPUs [1]. Fig. 2 compares two different strategies: A) to do the classification locally and transmit only the classification result (‘k-NN’ and ‘C4.5 tree’ curve) or B) to transmit the features directly (‘Transmit features’ curve). Since the goal was to show the computation – communication tradeoff, Fig. 2 does not include any ‘overhead energy’, which would be needed to keep the whole system running. If either EOP is small enough or ET X is large, strategy A wins. Furthermore, we can observe that a decision tree classifier (with a tree depth of 500) requires far less energy than a k-NN classifier (with k = 3 and only 20 samples per class). This is mainly due to the distance calculations of the k-NN classifier.
3. Hardware Our long term vision for the hardware is that of a fully autonomous device containing all special purpose processing circuits, wireless communication interface and its own power supply in a high-density button-like package [2]. The prototype described in this paper represents an intermediate step towards such a system. It uses a miniature microphone, analog preprocessing circuits, and a microcontroller to compute the features described in the previous section. However, it relies on off-the-shelf components instead of custom designed circuits for processing, communication and power supply and is implemented on a conventional printed circuit board in the size of half a credit card (3.5 cm × 5.5 cm). The hardware (Fig. 3) contains a MEMS microphone (Knowles Acoustics SP0103NC3-3), a MSP430F149 microcontroller and a RFM DR3001 wireless transceiver
Figure 4. Noise performance
working at 868 MHz with a data rate of 115.2 kbps. The output from the microphone (internal gain of 20) is amplified by 2 and low-pass filtered before being sampled at 4.8 kHz and 8 bit with the internal AD-converter of the microcontroller. The output power of the wireless transmitter is −1.25 dBm, which is sufficient to cover short distances from the wrist to the torso. The hardware is powered by a Fuji NP-40 battery with 3.7 V, 710 mAh. Tab. 2 summarizes the current consumption and the energy needs if one classification per second is performed. Therefore the battery last for almost 2 weeks. The prototype fulfills three functions. Firstly, it provides a platform for conducting real life experiments. Secondly, it proves, that despite the (low quality) miniature microphone and low accuracy fixed-point calculations, acceptable recognition rates can be achieved. Finally, the power measurements show that even without custom VLSI implementation our method is appropriate for a variety of applications such as integration in a watch. Table 2. HW Current and Energy consumption AD-Conversion Feature-Calc. Transmission Standby Total
Current 4.41 mA 4.46 mA 11.94 mA 1.74 mA
Time [ms] 54.00 104.21 1.73 840.06 1000.00
Energy 0.881 mJ 1.720 mJ 0.076 mJ 5.408 mJ 8.085 mJ
4. Performance Evaluation Frame Based Recognition: Previous work demonstrating good recognition performance with our approach has relied on data recorded with high quality microphones and on floating-point computations [6, 10]. The first part of the evaluation addresses the influence of the constraints of our low-power hardware implementation (e.g. limited microphone sensitivity, reduced accuracy of fixed-point calculations, system noise) on the recognition accuracy. Thus, a series of recognitions were performed with our system on two sets of sounds that were also used in [10].
Table 3. Frame Based Recognition Rates Sound Coffee Grinder Coffee Maker Hot Water Nozzle Microwave Watertap Overall
[%] 99 77 73 91 83 85
Sound Saw Drilling Filing Sanding Hammer Overall
[%] 69 65 70 48 83 67
Water tap Microwave Nozzle Coffee Segment based Frame based Ground truth
Grinder Garbage 0
Overall, 12’000 frames recorded with different hand positions were evaluated by the system. Tab. 3 lists the framewise recognition rates using a C4.5 decision tree classifier (although for the ‘workshop’ sounds a 3-NN classifier resulted in a better overall recognition rate of 73%). Compared with the results achieved using high quality microphones and floating-point calculation in [10], we achieved 10% to 15% lower recognition rates. Influence of Noise: An important issue for real life applications is the system’s behavior in the presence of background noise. A systematic investigation has been performed by adding noise to prerecorded sounds for different signal to noise ratios (SNR). Fig. 4 shows the performance of the aforementioned feature set ZCR, FLUC, 4-BER together with a 3-NN classifier for the ‘kitchenette’ sounds in the left column of Tab. 3. We distinguished two noise sources. In a worst case, a ‘noise’ is one of the sounds on which the algorithms are trained on. Obviously, in this case the performance is worse than in a good-natured case, where the background noise can be seen as Additive White Gaussian Noise. Nevertheless, for a relatively low SNR of 5 dB, the recognition rate is still around 65%. Event Based Continuous Recognition: In real life applications we are in general not interested in checking for each activity several times per second. Instead, we would like to be able to detect each activity as a single discrete event in a continuous data stream. With our system this involves a three step process. Firstly, we pick the sound segments that have occurred next to the user’s hand using the ratio of intensities from two devices – one mounted on the user’s wrist, the other on the user’s chest. Then, we apply a frame by frame recognition to these segments. Finally, we perform segment-wise classification using a majority decision over all frames in a segment. To illustrate the performance of the system for such a real life scenario, a subject wearing our devices was asked to pick 20 activities randomly selected from the list of ‘kitchenette’ tasks in Tab. 3 and perform them at random times within a 10 min period. The results of the first 4 minutes are shown in Fig. 5 We counted 18 correctly labeled events, 2 substitutions (a wrongly classified event), 2 insertions (detection of an event although none was there) and 0 deletions.
50
100 time [sec] 150
200
250
Figure 5. Continuous Recognition
5. Conclusions We have demonstrated that sound-based user activity recognition is feasible with low-power compact wearable devices. We have presented a systematic evaluation of recognition vs. energy tradeoff, described the implemented hardware and shown experimental performance results. The contributions presented in this paper constitute an important step on the way to our vision of ultra miniaturized hardware in a button-like form factor [2]. Additionally, the hardware presented in this paper constitutes a versatile tool for experiments with wearable activity recognition.
References [1] U. Anliker et al. A systematic approach to the design of distributed wearable systems. IEEE Transactions on Computers, 53(3):1017–1033, Aug. 2004. [2] N. B. Bharatula, S. Ossevoort, M. St¨ager, and G. Tr¨oster. Towards wearable autonomous microsystems. In Pervasive Computing: Proc. of the 2nd Int’l Conf., pages 225 – 237, Apr. 2004. [3] B. Clarkson and A. Pentland. Extracting context from environmental audio. In ISWC’98, pages 154–155, Oct. 1998. [4] L. Doherty, B. Warneke, B. Boser, and K. Pister. Energy and performance considerations for smart dust. Int’l Journal of Parallel and Distributed Systems and Networks, 4(3):121– 133, Dec. 2001. [5] D. Li, I. Sethi, N. Dimitrova, and T. McGee. Classification of general audio data for content-based retrieval. Pattern Recognition Letters, 22(5):533–544, 2001. [6] P. Lukowicz et al. Recognizing workshop activity using body worn microphones and accelerometers. In Pervasive Computing: Proc. of the 2nd Int’l Conf., pages 18–22, 2004. [7] T. Melly, A.-S. Porret, C. Enz, and E. Vittoz. An ultralowpower UHF transceiver integrated in a standard digital CMOS process: Transmitter. IEEE Journal of Solid State Circuits, 36(3):467–472, Mar. 2001. [8] V. Peltonen, J. Tuomi, A. Klapuri, J. Huopaniemi, and T. Sorsa. Computational auditory scene recognition. In IEEE Int’l Conf. on Acoustics, Speech, and Signal Processing, volume 2, pages 1941–1944, May 2002. [9] J. M. Rabaey. Digital Integrated Circuits - A Design Perspective. Prentice Hall, 1996. [10] M. St¨ager et al. SoundButton: Design of a Low Power Wearable Audio Classification System. In ISWC’03, pages 12– 17, Oct. 2003.