Using Ultrasonic Hand Tracking to Augment ... - Semantic Scholar

Using Ultrasonic Hand Tracking to Augment Motion Analysis Based Recognition of Manipulative Gestures Georg Ogris1 , Thomas Stiefmeier2 , Holger Junker2 , Paul Lukowicz1,2 , Gerhard Tr¨oster2 1 Institute for Computer Systems and Networks, UMIT Innsbruck, Austria 2 Wearable Computing Lab, ETH Z¨urich, Switzerland Abstract The paper demonstrates how ultrasonic hand tracking can be used to improve the performance of a wearable, accelerometer and gyroscope based activity recognition system. Specifically we target the recognition of manipulative gestures of the type found in assembly and maintenance tasks. We discuss how relevant information can be extracted from the ultrasonic signal despite problems with low sampling rate, occlusions and reflections that occur in this type of application. We then introduce several methods of fusing the ultrasound and motion sensor information. We evaluate our methods on an experimental data set that contains 21 different actions performed repeatedly by three different subjects during simulated bike repair. Due to the complexity of the recognition tasks with many similar and vaguely defined actions and person independent training both the ultrasound and motion sensors perform poorly on their own. However with our fusion methods recognition rates well over 90% can be achieved for most activities. In extreme case recognition rates go up from just over 50% for separate classifications to nearly 89% with our fusion methods.

1 Introduction Using ones hands to manipulate objects and devices is a key component of user activity. In a large industrial project (WearIT@Work 1 ) our group has focused on the recognition of the associated so called ’manipulative gestures’ in conjunction with the tracking of car assembly and aircraft manufacturing tasks. The ultimate aim is to recognize what part of the procedure is executed by the worker at any given point in time and either proactively deliver relevant information (e.g. manual pages) or record the progress of the procedure for later verification or training purposes. In general manipulative gestures are characterized by two factors: (1) the motion of the hands and (2) the object that is being manipulated. In our case the latter is a specific part of a large stationary machinery (e.g. car body or aircraft engine). Note that even if a tool is used, the manipulation still targets a specific part of the machinery. 1 Sponsored

by the European Union under contract EC IP 004216

The two main approaches to gesture recognition are video analysis and the use of wearable sensors. In our work we focus on the later. For the motion analysis acceleration sensors and gyroscopes attached to the hands and arms have been shown to be a promising approach. The identification of the object which is being manipulated can accomplished with different techniques including RFIDs [10], switches and sensors incorporated in the objects [13] or sound analysis [8]. For the specific application envisioned in the project none of the above methods is really applicable. While some instrumentation is certainly possible, outfitting all parts of an entire aircraft with RFIDs or other sensors is not feasible. As shown in previous work [8] sound analysis can be useful, however it has a number for limitations. In particular it only provides information about those tasks that actually cause a characteristic sound and does not work in noisy environments.

1.1

Paper Contributions and Related Work

Motivated by the above considerations the work reported in this paper deals with the use of ultrasonic location systems to identify which parts of the machinery are being manipulated during an assembly or maintenance task. The advantage of this approach is that it requires only minimal instrumentation of the environment. All that is needed are at least three ultrasonic beacons placed at predefined locations, a ’listner’ placed at the user’s arm, and data on the dimensions and layout of the machinery (which today is mostly available in electronic format). Up to date ultrasonic location has only been used for tracking of people ([17, 14]). The detailed tracking of dynamic interaction between people and machinery as proposed in this paper has so far not been attempted. Such tracking has to deal with a number of problems that are due to fundamental physical limitations of ultrasonic location which will be discussed in section 2.1. These include low sampling rates (at most a few Hz), frequent occlusions and signal reflections. Contributions Our work demonstrates how, despite the above problems, ultrasonic location can be used to improve the accuracy of manipulative gesture recognition. Specifically the paper presents the following contributions 1. We demonstrate how the inherent errors present in the ultrasonic signal can be handled through plausibility analysis based physical constrains of the system (human anatomy, basic assumptions about plausible motions)

2. We show that by including ultrasonic listeners not just on one hand but also on the upper arm, the system can provide recognition information that goes beyond mere identification of the part of machinery that is being manipulated. 3. We describe and contrast different ways of combining the ultrasonic information with motion information from acceleration sensors and gyroscopes placed at the user’s arms. 4. We present the results of an experimental validation of our method. It is based on a bicycle repair tasks that has been repeatedly performed by three volunteers. The task consists of 21 individual actions that were chosen according to two criteria: (1) being typical for the repair task and (2) being ambitious in terms of recognition. They include activities such as spinning a wheel, turning a pedal or removing the seat, which allow many degrees of freedom in how they can be performed. We show that while the individual classifiers perform poorly on many tasks appropriate fusion of ultrasound and motion information dramatically increases the recognition performance. Overall the best fusion method reaches a performance of 91% raising to 96% if gestures that are not clearly distinguishable (e.g screw and unscrew the same screw) are grouped together. Related Work Investigations of ultrasonic sensors for user localization has been done by [17]. They describe the use of an ultrasonic location system for context aware computing, as well as [5]. Performance of ultrasonic indoor tracking, using the cricket system [11, 1] has been investigated by [14]. A more general overview of automatic location sensing techniques in the field of wearable computing give [7, 6, 4] and [15]. The use of RFIDs to follow the progress of a maintenance task has been studied by [10]. In [13] it is shown how pressure and tilt sensor integrated in tools and components can be used to track a furniture assembly task. The use of motion sensors (mostly accelerometers) for activity and gesture recognition has also been widely studied (e.g. [9, 12, 16, 3]).

2 Methods Overview 2.1

Ultrasonic Analysis

General Considerations In general ultrasonic positioning systems rely on time of flight measurement between a mobile device and at least three reference devices fixed at known positions in the environment. Specific implementations of this idea differ in many ways (e.g., time synchronisation, signaling protocol, additional radio frequency reference signal). However independent of the implementation details there are three issues that all systems have to deal with. 1. Reflections. Ultrasound is reflected by most material present in the environment. Thus the location systems has to deal with false signals resulting from reflections. 2. Occlusions. Ultrasound essentially requires line of sight between the communicating devices. If the reciever turns away from the transmitter or some person/object comes between the two then the signal is lost.

3. Temporal resolution. The temporal resolution is limited by the speed of sound which is about is 340m/s. In general several transmissions are needed to perform 3 D location (either one from three transmitters in the environment or one from every mobile device that needs to be localized). Unless some advanced coding schemes are used the transmissions need to be long enough apart for the reflections to subside. In a room a couple of meters in diameter this reduces the maximum number of transmissions to 10 to 20 a second. This means that the maximum realistic sampling frequency is a couple of Hz. Often (as is the case with the Hexamite sensors) it is about 1Hz. In the indoor location scenario, where ultrasonic devices are mostly used, the above factors can often be neglected. With beacons placed in the ceiling and the personal devices e.g. on the shoulder occlucsions can be minimized. Except for sports related scenarios the temporal resolution in the range of 1 to a few Hz is more than enough. Without occlusions and with sufficient temporal resolution reflected signals can easily be detected as repetitions of the original signals. In the envisioned maintenance scenario things are much more difficult. As the recievers need to be mounted in the arms, occlusions are a frequent problem. They may occure in case the test person (1) is standing behind the maintenance object, (2) occludes the moving devices itself or (3) turns away from the fixed devices. In all these cases two measurements are possible: either no signal reaches the measuring device in time (no measurement) or a reflected signal is measured (wrong measurement). A reflected signal is mostly easy to detect in case the reflection comes from a point far away, e.g., from a wall when the test person is standing in the middle of the room. So occlusions are likely to produce wrong, not detectable measurements in cases where the test person is close to an object, e.g., in cases where maintenance activities are performed. What is more the resulting coordinates of one moving device are dependent on distances to at least three fixed devices. The time frame for aquiring the distance to one fixed device is 1/3.3 = 0.3sec. In condition the calculation of the position of the moving devices is dependent on measurements with a time delay of at best 0.9 seconds. That means that the error for the resulting position is not so much dependent on the accuracy of the measurement system (appr. 2 − 3cm) than on the speed of the moving device. Tracking The combination of factors described above cause the ultrasonic data to contain a large number of false positions. An common choice of smoothing method for noisy tracking data is the Kalman filter. However since in our case the sampling rate of the positioning system is much smaller than the typical motion frequency of the hand, Kalman filtering of the final path makes little sense. Instead the smoothing of the sonic data is done in two steps: (1) on the raw distance signals and (2) on the resulting coordinates. Strategies for filtering the raw distance data are based on constraints that can be defined by various assumptions that can help to identify obviously reflected signals. Strategies for filtering the error of the positioning system itself and the error that is produced from slightly reflected/deflected signals would require

to consult the measured accelerations and rotations as well. Some anatomic contstraints can furthermore filter wrong positions due to impossible distances of the body worn devices. 3D view

x−> y|

3000

2500 3000 2000 2000 1500 1000 1000

0 0

3000 1000

500

2000 2000

1000 3000

0

right hand left hand

x−> z|

2500

0

500

1000

2000

1500

1500

1000

1000

500

1500

2000

2500

3000

2000

2500

3000

y−> z|

2500

2000

0

0

500

0

500

1000

1500

2000

2500

3000

0

0

500

1000

1500

Figure 1. The standard deviation of the ultrasonic positioning during two gestures from our experimental set

Classification A possible approach to do a position based gesture classification is a frame based approach similiar to the one explained in section 2.2 but with windowsize = 1 to satisfy the low sampling rate of the ultrasonic system. The feature vector consists of the x,y and z coordinate of all three body worn ultrasonic devices as shown in Figure 2. The classifiers that were chosen for testing are the C4.5 and the k-Nearest-Neighbor, both explained in more detail in section 2.2. After the frame based classification for the isolated case a majority decision is performed to decide on the overall classification result for each of the isolated segments.

2.2

Motion Sensors Analysis

Model Based Classification Some of the manipulative gestures in our bike repair experiment do not contain periodical motions at all. These gestures contain a sequence of motions which can be modelled using Hidden Markov Models (HMMs). Earlier research at our institute shows, that HMMs are an appropriate choice for modelling and recognition of the dynamically changing motions in our experiment. Each manipulative gesture in our experiment corresponds to an individually trained HMM model. Thorough analysis and evaluation of the number of states per model ranging from 2 to 12 resulted in determining the number of states from 5 to 7. The number of states reflect the complexity of the respective manipulative gesture. We exclusively used so called left-right models. A characteristic property of left-right models is that no transitions are allowed to states whose indices are lower than the current state. As features for the HMMs, only raw inertial sensor data has been used. The used set of features comprises the following subset of available sensor signals: three acceleration and two gyroscope signals from the user’s right hand and three acceleration and two gyroscope signals originating at the user’s right upper

arm. The observations of the used HMMs correspond to the raw sensor signals or features. Their continuous nature is modelled by a single gaussian distribution for each state in all models. Frame Based Classification Another approach of classification is the classic sliding window approach: In a time window of fixed size N , a set of features is computed using the raw sensor data. This set of features forms the so called feature vector. Then the sliding window is moved by an offset which determines the overlap with the last window. The computed feature vectors are used for either training the classifier model or testing on an already trained model. In our bike repair experiment, a special case of the sliding window approach has been chosen, which exhibits no overlap of adjacent windows. After partitioning an isolated manipulative gesture in so called frames of size N , the feature vectors for all frames are computed. Our set of features has been chosen empirically and comprisis mean, variance and median of the raw sensor signal. According to Section 3.3, a part of the feature vectors are passed to the training of the classifier while the remainder is used for being classified by the trained classifier. For comparsion reasons, two classifiers have been chosen. The so called C4.5 classifier is based on building a decision tree during the training of the classifier. A characteristic property of this classifier is its modest computational complexity during testing. In contrast, the second classifier called k-Nearest-Neighbor (k-NN), which is an instance based classifier (IB), requires a lot of computations during testing. After a classification has been performed on all frames in one isolated manipulative gesture, a majority decision is applied to the raw classification results. This yields a filtered decision for the particular gesture constitutes the final result of the frame based classification.

2.3

Fused Classification

Plausibilty Analysis (PA) The most obvious fusion method is the use of wrist position information to constrain the search space of the motion based classifier. Both the frame based and the HMM classifier result in a ranking for either the whole set of gestures (HMM) or for a subset (frame based) of gestures. For the HMM classifier we chose this subset manually by taking the three most likely gesture classes. Begining with the most likely gesture concerning the motion result we analyse the plausibilty concerning the position of this gesture class. If the result fits with the position the result is assumed to be the correct class, otherwise the next candidate is tested. If the whole set of possible candidates is tested and we end up with no plausible class concerning the position the ultrasconic data is assumed to be too bad to be trusted and the most likely motion result is taken as final result. The plausibility analysis is done by calculating the median distance of either the right hand or both hands (for bi-hander gestures) to the trained location points of the tested gesture candidate. If the median distance is under a certain threshhold level it is assumed to be a correct classification gesture. Joint Feature Vector Classification Another obvious fusion method is to construct a joint vector containing the ultrasonic and motion features and use it as input for a frame based classifier.

Classifier Fusion The most complex fusion method is a true classifier fusion where a separate classification is performed on the ultrasonic and the motion signals. In general both classifications produce a ranking starting with the most likely and ending with the least likely class. The final classification is then based on a combination of those two rankings and the associated probabilities. In more advances schemes confusion matrices from a training set can also be used possibly taking into account To compare the plausibility analysis with other fusion methods the position based C4.5 classifier was fused with the k-N-N and the HMM motion classifiers. This was done using three increasingly complex approaches (1) comparing the average ranking of the top choices of both classifiers, we will refer to this as ”a verage of the best” (avgOfBest); (2) comparing the average ranking of all gesture classes (avg); or (3) considering the confusion matrix (CM) that is produced when testing the training sets with the classifier. From this confusion matrix we get an estimation for the probability that classifier recognizes class G i although class Gj is true. Given class GA as result of classifier CA and class GB as result of classifier CB we consider the probabilities P (CA = GA |GB ) and P (CB = GB |GA ) and consider this to be an estimation for the reliablity of the classifiers C A and CB . The most reliable result is believed to be true.

with occluded sensors at the user’s hand due to movements of the hand while performing a certain manipulative gesture. To acquire motion data from the user, a set of nine inertial sensor modules (containing accerlerometers and gyroscopes) have been attached to the user’s hands, arms, chest and legs. However, for recognition of manipulative gestures only a subset of the available inertial sensors have been used. Figure 2 illustrates the placement of the described sensors.

3 Experimental Setup

3.2

To validate our method of fusing inertial and ultrasonic sensor data, we set up an experiment comprising various manipulative gestures based on a bicycle repair task. This subsection describes the experimental setup including the placement of the used sensors, the gestures to be recognized and the evaluation method.

3.1

Experimental Environment

For this experiment, we used a regular bicycle without any special features. No additional extensions have been carried out, i.e. no instrumentation or sensors of any kind have been attached to the bike itself. The bicycle has been mounted on a special repair stand for ease of reaching the different parts. In order to use our ultrasonic system, the room in which the experiments took place has been equipped with three so called ’beacons’. These beacons have been placed at exactly predefined places and serve as the reference for the distance measurement using the ultrasonic sensors. Sensor Placement Three different types of sensors have been used within this experiment: (a) ultrasonic senors for distance measurement2 , (b) acceleration sensors and (c) gyroscopes 3 , the latter two types to capture the motion of relevant body parts of the user. Obviously, a very important information for recognizing manipulative gestures is the location of the user’s hand. Therefore, both wrists have been provided with an ultrasonic sensor module (listener) incorporated in biker gloves. A third listener has been attached to the user’s right upper arm to enable our algorithms to recognize more than just mere hand location but also body location. That way, we additionally have a possibilty to cope 2 www.hexamite.com

3 www.xsens.com

Figure 2. Sensorplacement

Set of Manipulative Gestures

For the described experiment, we determined a set of 21 manipulative gestures which are part of a regular bicycle repair task. These gestures can be divided into two classes: (a) gestures containing periodical motions and (b) gestures comprising only nonperiodical motions. In terms of recognition difficulty the set has been composed in such a way that it contains a broad mixture. Thus there are gestures that contain very characteristic motions as well as ones that are highly unstructured. Similarly there are activities that take place at different, well defined locations as well as such that are performed at (nearly) the same locations or are associated with vague locations only. Table 1 gives a full overview of the used gestures. The key properties in terms of recognition challenges can be summarized as follows. pumping (gestures 1 and 2) In our definition pumping begins with unscrewing the valve. Thus it consists of more than just the characteristic periodic motion. Pumping the front and the back wheel differs clearly in terms of location, however depending on where the valve is during pumping the location is rather vaguely defined. People tend to use different valve positions for the front and the back wheel, which means that statistically there is difference in the acceleration signal as well. screws (gestures 3 to 8) The sequence contains the screwing and unscrewing of three screws at different, clearly separable locations. Of the screws B and C require a screwdriver and A a special wrench. Combined with different arm positions required to handle each screw this provides some acceleration information to distinguish between the screws (in addition to location information). pedals (gestures 9 to 11) The set contains three pedals related gestures: just turning a pedal, turning a pedal and switching gears (with the other hand), and turning the pedals and testing the

breaks (with the other hand). The pedal turning is a reasonably well defined gesture. (dis)assembly (gestures 12,13,18,19) Among the most difficult to recognize gestures in the set are the assembly and disassembly of the front wheel and the seat. Both can be performed in many different ways, while the hand seldomly remains at the same location for a significant time. In addition the gestures 18 and 19 are so short that only few location samples are available. wheel spinning (16,17) The wheel spinning gestures involve hand turning the front or the back wheel. The gestures contain a reasonably well defined motion (the actual spinning). However there is also a considerable amount of freedom in terms of overall gesture. Front and back can be easily distinguished by location. In most cases different hand positions were used for turning the front and the back wheel. carrier (20,21) The most difficult gestures in the set are the placing and removing of items on/from the carrier. The motions involved are nearly entirely free. The location is only vaguely defined and the gesture so short that often just one ultarsonic measurement is available. ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

description pumping at front wheel pumping at back wheel unscrew screw A screw down screw A unscrew screw B screw down screw B unscrew screw C screw down screw C turning pedals turning pedals and applying the back brake turning pedals and switching gears disassemble front wheel assemble front wheel test light generator test bell turn front wheel turn back wheel disassemble seat assemble seat take item from carrier put item on carrier

grouped as: pumping at front wheel pumping at back wheel screw A screw A screw B screw B screw C screw C turning pedals turning pedals and testing turning pedals and testing front wheel front wheel test light generator test bell turn front wheel turn back wheel seat seat carrier carrier

periodic √ -√ -√ √ √ √ √ √ √ √ -

√

√ √ √ -

Table 1. Set of Manipulative Gestures

Data Recording In the course of this experiment, we recorded nine data sets each of them including all 21 manipulative gestures. The according repair task has been performed by three male subjects who are right hander. One of our goals is to show that our method of fusing different types of sensors is supporting multiuser recognition. For recording and labelling the sensor data, the framework proposed by [2] has been used. With this software tool, we have been able to synchronize and merge the 12 sensor streams into one data file and simultaneously label the data using a regular keyboard.

3.3

Evaluation Strategy

To obtain a stable classification result for the different classification and fusion methods, we used the so called biased crossvalidation (BCV). For each iteration of this scheme, a different

subset of the available recorded data sets has been chosen for training and the remaining sets have been used for testing. However, an additional constraint has been applied for each iteration: Out of three available data sets per bike repair subject, always choose two for training and the remaining one for testing. Summing up the number of combinations according to this all possible constraint, we get 32 · 32 · 32 = 33 = 27 iterations of our biased cross-validation scheme. The large number of iterations per BCV ensures a representative classification result. In the evaluation of our bike repair experiment, the BCV has been applied to all classifier and fusion methods. To better understand the performance of the classifier we have grouped the gestures that from their nature provide no or little information to differenciate them by both their position and their motion information (see table 1). The recognition rate for the grouped set is allways given in brackets.

4 4.1

Results and Discussion Separate Classification Results

Ultrasound Classification Results Overall the biased crossed validation results of the ultrasonic classification with the C4.5 classifier produces an average of 58.7% (81.5%) for the overall (grouped) set. For k-N-N the result is slightly better with 60.3% (84.9%). Taken for themselves this numbers are not very impressive. However we need to remember that the set (including the grouped set) contains many gestures that are not distinguishable using hand location alone. In addition the sampling rate of the ultrasonic sensors is so low that only very unreliable position readings are available for the more dynamic gestures. Taking the above into account the corresponding confusion matrix in table 5 shows the ultrasonic classifier to perform surprisingly well. This is due to the fact that the additional sensor on the upper arm provides information about the relative position of the arms segments, which tends to be characteristic for many activities. The classifier achieves 100% on the pumping gestures. Disregarding the difference between screw and unscrew actions 100% is also achieved on 3 out of the 6 screw gestures with the others being 99%, 94% and 85%. Taken as a single gesture (which it is for ultrasound) the pedals gestures also show excellent results. The same is true for the wheel assembly task with confusion occurring only between assembly and disassembly. As expected the performance becomes poor for the seat assembly and the carrier tasks, which are to short and often contain just a single measurement point. Frame Based Motion Classification With an average of 84% (90% for the grouped set) the best frame based motion results were achieved using a k-N-N classifier with mean, variance, and median as features for acceleration sensors and gyroscopes on the wrists and the upper arms. The main confusion occurs between the front and back wheel (in the pumping and wheel spinning gestures), the screw and unscrew as well as (dis)assemble actions. As expected the carrier gestures are also poorly recognized.

k-N-N k-N-N k-N-N k-N-N hmm hmm hmm hmm -

motion 84%(90%) 84%(90%) 84%(90%) 84%(90%) 65%(78%) 65%(78%) 65%(78%) 65%(78%) -

C4.5 C4.5 C4.5 C4.5 C4.5 C4.5 -

position 59%(81%) 59%(81%) 59%(81%) 59%(81%) 59%(81%) 59%(81%) -

AvgOfBest Avg CM PA AvgOfBest Avg CM PA joint classifier

fusion method 80%(90%) 75%(89%) 89%(95%) 90%(96%) 73%(89%) 71%(90%) 71%(85%) 78%(91%) 80%(91%)

Table 2. Comparison of fusion results frame based afer majority decision frame based afer plausibilty analysis HMM HMM afer majority decision position k-N-N position k-N-N after hand plausibility analysis

84.01%(90.30%) 89.59%(95.83%) 65.14%(77.90%) 78.25%(90.95%) 60.3% (84.9%) 60.5% (86.1%)

Table 3. Gains of the plausibility analysis An interesting result is the fact that the motion classifier is able to provide some resolution between seemingly identical gestures performed at different locations. This is particularly true for the screws, where there is only 7% confusion between different locations. This is due to the fact that different tools (screwdriver, wrench) were used at different locations. Also the arm positions vary depending on location. This also explains why at least some degree of recognition was achieved between the front and the back wheels. Model Based Time Series With 65.1% (77.9% for the grouped set) the overall recognition rate of the HMM approach is significantly less than for the frame based classification. This has several causes. For one the available training set is fairly small. Second the HMMs would be expected to have an advantage on the shorter, non-periodic gestures while performing equally to slightly worse on the long, periodic ones. The problem with our data set is that while most of the periodic gestures are fairly easy to recogonize, the non-periodic ones tend to be vaguely defined and unstructured. Together with the small training set this clearly favors the frame based sliding window approach. We have included the HMM classification in the paper since it is interesting to see the effect of fusion with the ultrasonic sensors on a poorer classifier.

4.2

Fusion Results

Plausibility Analysis In table 3 the benefits of this strategy for the most promising classifiers of the frame based and the HMM methods are summarized. Results for the ultrasonic classifier are also given, although, as expected, they display only marginal improvement. It can be seen that the plausibility analysis achieves a considerable improvement for both the k-N-N (90% and 96% for the full and grouped sets respectively) and the HMM (78% and 91%) based classifiers. In table 5 the confusion for the k-N-N case is shown. It can be seen that the pumping gestures go up to 100%

clearing the confusion between wheels. In a similar way values for turning front and back wheel are also improved, although 100% is not reached. Another interesting point are the test gestures (testing light generator and bell). Both are rather subtle hand motions that are poorly recognized by motion analysis alone. In particular testing the light generator is confused with turning the pedals in 22% of the test. With plausibility analysis both test gestures go up to 100%. Note that the recognition rate for light generator using ultrasonic alone is just 31%. For the HMM classification (full confusion matrix not shown) the effect of plausibility analysis can be even more dramatic. For the turning pedals only gesture the recognition rate goes from 11% to 89%. For the wheel gestures the rate goes from 46% and 56% to 88% and 89%. Joint Classification Results As can be seen from table 2 the joint classifier achieves an average recognition rate of 80% (91% for the grouped set). This is actually a slightly worse result than the k-N-N motion classifier. However a close look at the confusion matrix in table 5 reveals that the average results is misleading. It is brought down by the bad performance in the carrier case and the large confusion between assembly and disassembly. Nearly all the other gestures remain the same or improve. The number of perfectly recognized gestures goes up from 1 to 5. Thus, it can be concluded that while not the best, the joint classification is still a viable fusion strategy. Classifier Rankings Classifier fusion can only succeed if the classifiers provide complementary information. As a rough way of checking if this is the case with ultrasonic and motion classifiers table 4 shows how often the correct class is shown as in the top rank of both as well as each one of the classifiers. The same information is also shown for the first two ranks.

first rank

first two ranked

both correct motion correct position correct none both are correct motion correct position correct none

motion k-N-N position cC.45 50.09% 33.92% 8.64% 7.35% 79.13% 14.17% 4.06% 2.65%

motion HMM position C.45 37.80% 27.34% 20.93% 13.93% 72.02% 12.27% 11.17% 4.53%

Table 4. Analysis of top and top two ranking results. It can be seen that in 97.35% (100%-2.65%) of the cases the correct class is in one of the top two ranks of either the k-N-N motion or the ultrasonic C4.5 classifier. By contrast it is in the top two of the k-N-N motion classifier only in 93.30% (79.13+14.17). The gain is even higher for the HMM motion classifier and for the top rank only case. Classifier Fusion Results Of the 3 investigated fusion methods the confusion matrix based one (CM) has proven to be most successful followed by the average of best. The CM method is only slightly (1%) worse than the plausibility analysis. With the size of the data set this must be considered insignificant. However in the confusion matrices a significantly different behaviour can be

seen. In particular with (dis)assembly gestures the CM strategy leads to more balanced results. Whereas the plausibility analysis tends to enforce the motion based preference for one of the two (e.g. 100% for assemble seat, only 58 disassemble), the CM fusion leads to an roughly equal recognition rates (86% and 75%). Summed over the two the recognition rates actually goes up a few percent.

5 Conclusion and Future Work

[5] A. Harter, A. Hopper, P. Steggles, A. Ward, and P. Webster. The anatomy of a context-aware application. In Proceedings of the Fifth Annual ACM/IEEE International Conference on Mobile Computing and Networking, MOBICOM’99, pages 59–68. Seattle, Washington, USA, August 1999. [6] J. Hightower. From position to place. In Proceedings of the Workshop on Location-aware Computing, Seattle, WA, October 2003.

We have demonstrated that despite all its problems ultrasonic hand tracking is a valuable addition to motion sensor based recognition of manipulative gestures. Specifically we have shown that

[7] J. Hightower and G. Borriello. Location systems for ubiquitous computing. IEEE Computer, 34(8):57–66, August 2001.

1. The best motion classification is improved by 6% through ultrasonic tracking based plausibility analysis to reach 90 % in the full and 96% in the reduced data set. 2. The ultrasonic and motion based classifiers provide complementary information. This is shown through ranking analysis in table 4 and through the radical improvement in the recognition rate of some classes (e.g. for front wheel disassembly from 53% and 43% for ultrasound and acceleration respectively to 80% with CM based fusion).

[8] P. Lukowicz, J.A. Ward, H. Junker, G. Tr/oster, A. Atrash, and T. Starner. Recognizing workshop activity using body worn microphones and accelerometers. In Pervasive Computing, 2004.

Next step in our work will be to record and investigate a larger data set, in particular the effect of more training data on the confusion matrix based fusion. Such a data set will also enable such more advanced fusion methods as linear regression analysis of the probabilities of different rank combinations. Furthermore ultrasonic devices with larger sampling rates will be employed and improvements such as multiple, spatially distributed receivers to avoid occlusions will be investigated. We would like to conclude by pointing out that while the work has focused on ultrasonic location many of our results should be applicable to other tracking techniques such as recently developed UWB trackers or magnetic tracking. While some of the problems might be less grave (in particular the sampling rate issue) signal reliability issues as well as fusion consideration are likely to remain relevant.

References

[9] J. Mantyjarvi, J. Himberg, and T. Seppanen. Recognizing human motion with multiple acceleration sensors. In 2001 IEEE International Conference on Systems, Man and Cybernetics, volume 3494, pages 747–752, 2001. [10] T. Nicolai, T. Sindt, H. Kenn, and H. Witt. Case study of wearable computing for aircraft maintenance. In Proc. 2nd Int. Forum on Applied Wearable Computing, Zurich, Switzerland. VDE, March 2005. [11] N.B. Priyantha, A. Chakraborty, and H. Balakrishnan. The cricket location-support system. In Proceedings of the Sixth Annual ACM International Conference on Mobile Computing and Networking, August 2000. [12] C. Randell and H. Muller. Context awareness by analysing accelerometer data. In Digest of Papers. Fourth International Symposium on Wearable Computers., pages 175–176, 2000. [13] B. Schiele S. Antifakos, F. Michahelles. Proactive instructions for furniture assembly. In 4th Intl. Symp. on Ubiquitous Computing. UbiComp 2002., page 351, Gteborg, Sweden, 2002.

[1] H. Balakrishnan and N.B. Priyantha. The cricket indoor location system: Experience and status. In Proceedings of the Workshop on Location-aware Computing, Seattle, WA, October 2003.

[14] A. Smith, H. Balakrishnan, M. Goraczko, and N.B. Priyantha. Tracking moving devices with the cricket location system. In The Second International Conference on Mobile Systems, Applications and Services, Boston, MA, June 2004.

[2] D. Bannach. An online sensor data processing toolbox for wearable computers. In ISWC04, Eighth IEEE International Symposium on Wearable Computers, Octobre 2004.

[15] J.A. Tauber. Indoor location systems for pervasive computing. Technical report, MIT Laboratory for Computer Science, Cambridge, MA, August 2002.

[3] L. Bao and S.S. Intille. Activity recognition from userannotated acceleration data. In Proc Pervasive Computing, 2004.

[16] K. Van-Laerhoven and O. Cakmakci. What shall we teach our pants? In Digest of Papers. Fourth International Symposium on Wearable Computers., pages 77–83, 2000.

[4] D. Fox, J. Hightower, H. Kauz, L. Liao, and D.J. Patterson. Bayesian techniques for location estimation. In Proceedings of the Workshop on Location-aware Computing, Seattle, WA, Octobre 2003.

[17] A. Ward, A. Jones, and A. Hopper. A new location technique for the active office. IEEE Personal Communications, 4(5):42–47, Octobre 1997.

ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

1 100

ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

1 89 11

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

2

3

4

64 36

56 44

100

5

6

1

15

7

8

9

ultrasonic C4.5 10 11 12 4 4 22

13 11

14

15

16 11

17

18 4

19

20

21

41 1

72 27

35 51

5 1 53 41

1 83 17 72 73 23

64 2

15 81 53 25

2 7 4

1

1 11

8 8

7 15

37 52

28

31

1 73

11

9

19

7

89 58 2

1 2

46 25

23 65

22 14 27 6

7 9 47 14

19

20 11 2 5 6

21 11 4 18 5

1 2 7 93

3

90 10

4

5

6

7

8

9

motion k-N-N 10 11 12 10 4

13

14

15

16

17 11

18 11 9

14 86 93 7

2 98 100

7 4 89 99

4 96 4

1 99

5 43 38

22 10 90

1

78 93 7

11 85

10 89 43 27

1 42 31

2 38 51

3

74 26

4

5

6

7

8

9 1

motion HMM 10 11 12 1 1 9 2

13 7 2

14 4

30 70

15

25 69 1

20 78

2

20

21

2 6

5 1

90 7 1

19

16 46

1 86 14

7 4

18

72 4

4

11

4 1

17 1

2 1 93 6

1 21

16 1

100

2 2 1 2

1 32 54

9 90

1 2 53 31

12

14

21 53

1 77

2

6 4 2 57 20 12

7 16 46 28

17 16 56

4 1

56 22 7 1

1 1

16 78 1 4

3

5

7

91 2

41 48

ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

1 100

2

3

4

90 10

14 86

5

6

motion k-N-N merged with ultrasonic C4.5 using PA method 7 8 9 10 11 12 13 14 15 12

16 2

100

17 4 7

18

19

20

21

2 9

93 7

2 98 100

5 95 99 96 4

1 99 47 41

10 90

1

100

2

1

5 4 86 1

4 5 4 86

19

20

21

11

12

41 44

1 99

10 7 67 5

14 16 57 1

18 11 1

19

20 11 2

21 7 10

100 98 89 58 33

1 100

2

3

4

100

5

6

7

8

joint motion/ultrasound classification 9 10 11 12 13 14 14 1 2 1

15

16

1 89 11

17 19

10 90

1 94 5

18 1

100

1 1

11 89 100 100 86

33 79 7

1

99 42 56

22 77

14

67

11 89 100 80 2 7

1 100

2

3

4

88 12

27 73

5

6

100

9 91

100

motion k-N-N merged with ultrasonic C4.5 with CM method 7 8 9 10 11 12 13 14 15 10 4

99 1

16

17 9 2

7 4 89 100

Table 5. Confusion matrices for the separate methods (left) and different fusion strategies

96 4

2 98

2 5

2 80 4

28 72

12 9 93 100

7 100 89 78 2

15 85 79 7

9 53