Classifiers for Driver Activity Monitoring - CiteSeerX

32 downloads 40669 Views 609KB Size Report
a cellular phone, adjusting controls of the dashboard, as well as exhibiting ... Most of the work on driver activity monitoring is motivated by applications in driver ...
1

Classifiers for Driver Activity Monitoring Harini Veeraraghavan, Nathaniel Bird, Stefan Atev, Nikolaos Papanikolopoulos [email protected] [email protected] [email protected] [email protected] Department of Computer Science and Engineering University of Minnesota

Abstract The goal of this work is the detection and classification of driver activities in an automobile using computer vision. To this end, this paper presents a novel two-step classification algorithm, namely, an unsupervised clustering algorithm for grouping the actions of a driver during a certain period of time, followed by a supervised activity classification algorithm. The main contribution of this work is the combination of the two methods to provide a computationally fast solution for deployment in real-world scenarios. The unsupervised clustering groups the actions of the driver based on the relative motion detected using a skin-color segmentation algorithm, while the activity classifier is a binary Bayesian eigenimage classifier. Activities are grouped as safe or unsafe and the results of the classification are shown on several subjects obtained from two distinct driving video sequences.

K EYWORDS Driver activity, unsupervised clustering, activity history images, eigenimage classification, activity classification.

I. I NTRODUCTION The goal of this project is to develop a camera-based system for monitoring the activities of automobile drivers. While the range of activities performed by a driver is large such as driving, eating, talking on a cellular phone, adjusting controls of the dashboard, as well as exhibiting symptoms of fatigue such as micro-sleeps, these actions can be broadly classified into safe and unsafe for the purpose of a monitoring and issuing alerts to the driver. Thus, in this work, we focus on the two-class classification problem of safe and unsafe driving behaviors. Such a classification has potential applications in real world applications such as in-vehicle driver assistance systems, as well as objective reporting of driver activities over long periods as opposed to subjective reports. Most work on driver activity monitoring such as Baker et al. (2004), Jia & Yang (2002), Jiangwei et al. (2004), Smith et al. (2003), Wahlstrom et al. (2003), Zhu & Fujimura (2004), Fletcher et al. (2005) focus on the problem of driver gaze detection, as well as facial expression recognition. However, there is very little work that focuses on the overall activities of a driver. For instance, a driver can be engaged in adjusting the controls with little or no change in the facial expression or gaze direction. On the other hand, observing the entire profile of the driver such an action can be easily distinguished from a driving author to whom all correspondences should be sent.

2

action. Hence in this work, we focus on the overall action of the driver to infer their actions as safe or unsafe. To the authors’ knowledge very little work focuses on the driver activity classification based on observation of other features such as the motion of hands inside the vehicle. Another important concern for activity recognition in the context of deployment in a real-world problem such as driver assistance system is the speed of recognition. In this work, we present a novel approach to driver activity recognition using a two-step process that combines an unsupervised clustering of activities over an extended time period to produce activity history images which are then classified into one of safe, unsafe, or unknown actions using a Bayesian eigenimage classifier.1 The clustering algorithm based on agglomeration of the skin-tone regions of the input video can be performed near real-time. The classification algorithm on the other hand when applied to individual image frames cannot be used for real-time classification. Whereas when using the results of the acitivity clustering we need to classify relatively few data points, thereby, or reducing the computational complexity immensely. Experimental results show the performance of the classifier on different test subjects as well as on different frames of the video sequences. As such two sets of experiments were performed. In the first set, the performance of activity clustering and the activity classification are evaluated independently. In other words, the activity classification is performed on single frame action shots of the individual as opposed to using the results of the activity clustering. The second set of experiments evaluated the performance of the system by performing the classification on the activity history images obtained as a result of activity clustering. A. Paper Outline This paper is organized as follows: After introducing and motivating the problem, Section II discusses the related work. Details of the two-step classification method, namely, the unsupervised action clustering and the action classification using Bayesian eigenimage analysis are presented in Section III and Section IV. Section V presents the experimental results and the discussion, while Section VI concludes the paper with some details of future work. B. Approach Overview The basic approach to activity classification is shown in Fig. 1. The input images are obtained from a commercial off-the-shelf camera mounted on the side of the vehicle’s interior. The subject’s actions are segmented using a skin-color segmentation algorithm followed by an unsupervised clustering algorithm. 1

The third class, unknown is used for actions that cannot be clearly classified into either safe or unsafe actions.

3

These agglomerated clusters are then classified into safe or unsafe actions by the Bayesian eigenimage classifier.

Pre−processing Skin Segmentation

Image sequence

Activity Clustering Activity History Images

Activity Classification

Fig. 1. Approach overview. Activity classification consists of three main steps, namely, the segmentation of relevant regions, activity clustering, and activity classification.

II. R ELATED WORKS Most of the work on driver activity monitoring is motivated by applications in driver assistance systems for improving the safety of drivers and pedestrians. In the work by Fletcher et al. (2005), the eye gaze of the driver is combined with the scene context to infer where the driver was focusing on the road and to provide alerts to the driver in the case of missed road signs, unexpected motions such as jay walking pedestrians, and driver inattentiveness. Other works such as in Baker et al. (2004), Smith et al. (2003), Zhu & Fujimura (2004) focus on the problem of detecting driver inattentiveness resulting from fatigue by monitoring their facial expressions, eyes, and eyelid movements. For instance, the work in Baker et al. (2004), used active appearance models for the driver’s face in order to recognize the facial expressions for inferring the state of the driver. Similarly, methods for detecting driver vigilance through monitoring driver’s eye gaze, eyelid motion patterns, head orientation and motion are reported in Jia & Yang (2002), Jiangwei et al. (2004), and Wahlstrom et al. (2003). The work in Ji et al. (2004) combines multiple sensors such as an infra-red (IR) camera with standard CCD camera for extracting the driver’s face pose, head orientation and eyelid motion patterns for detecting driver fatigue. Other methods for detecting human activity include spatio temporal templates such as motion history images proposed by Bobick & Davis (2001), where the sequence of actions performed by a subject is summarized over time using a motion history image. In the proposed method, the agglomerated clusters obtained for the actions in the first step are analogous to the temporal templates. However, the conceptual similarity of the two methods ends here. The activity history image is essentially a probability map of each pixel in the image being the skin-tone region for a given action cluster. The action classification

4

is then performed using a Bayesian eigenimage classifier that compares the activity history image to the stored set of eigenimages for classification. Other activity recognition methods such as the W4 system by Haritaoglu & Davis (2000), and Pfinder by Wren et al. (1997) make use of specific silhouettes of the subjects to recognize the actions. One of the main sources of poor recognition in silhouette-based methods is the poor segmentation of the subjects. Strong illumination and shadows are the main sources of poor segmentation. Infrared cameras have been used in methods such as Zhu et al. (2002), Ji et al. (2004) for segmenting the pupils of the subjects. The work in Loy & Zelinksy (2003) uses a robust radial-symmetry based segmentation method for segmenting the pupils of the subjects for determining eye gaze. Most methods employ simple heuristics for recognizing the different activities. For example, the PERCLOS measure used for detecting fatigue based on the count of eyelid motions, as well as specific poses or sequence of poses used for recognizing human activities as in Bobick & Davis (2001), Ivanov & Bobick (2000). The work in Baluja & Pomerleau (1994) makes use of a neural network for recognizing driver alertness by the analysis of the subject’s face. Learning methods such as hidden Markov models and more complex formulations such as coupled hidden Markov models were used in the work of Oliver & Pentland (2000) for recognizing driver behaviors such as maneuvers, lane changes, steering, etc., in addition to a multitude of sensors for inferring driver behaviors. The work in Salvucci et al. (2001) describes an elaborate cognitive architecture for inferring the driver intentions. An unsupervised method for learning human behaviors is presented in Song et al. (2001) using a maximum likelihood method for learning the structure of a triangulated graph of feature point-based human motions. The work in Gao et al. (2004) makes use of the general segment of the body regions where significant motions occur in order to detect various activities. Other methods for action recognition make use of dimensionality reduction techniques to represent the large variations in the actions across different subjects and varying illuminations. Examples include the work in Bartlett et al. (2002), Belhumeur et al. (1997), Belkin & Niyogi (2003), Roweis & Saul (2000), and Turk & Pentland (1991). This work makes use of a Bayesian eigenimage based method as proposed in Moghaddam et al. (2000). This method allows us to represent the space of safe and unsafe driving behaviors of several subjects under uncontrolled illumination conditions using a small number of eigenimages. The eigenimages are obtained from a small number of subjects in the training step. The main difference from the existing methods is that this work combines an unsupervised clustering with the Bayesian eigenimage classifier which makes use of supervised learning. The combination of

5

the two methods allows us to reduce the number of comparisons required for the classification. Hence, instead of doing action classification in each frame, classification is performed on an agglomerated image which contains the history of motion of the subject over a few frames. Thus in addition to classification, this work also introduces a novel method for summarizing a long video sequence using a small set of representative action clusters. III. U NSUPERVISED C LUSTERING

OF

D RIVING B EHAVIORS

The most basic cue about a driver’s actions is his/her pose. However, tracking a driver’s articulated motion in an environment with rapidly varying illumination and many potential self-occlusions is prohibitive both in terms of computational resources (for model-based tracking) as well as the accuracy in matching an overtly complex model using a limited number of models. Our approach does not depend on an estimation of a driver’s pose, but on the observation that periods of safe driving are periods of little motion of the driver’s body. Of course, a driver does not move much while talking on a cellular telephone (an unsafe driving behavior), so the need arises to classify periods of minimal motion into safe-driving periods and unsafe-driving periods. Detecting motion in a moving vehicle’s interior is complicated since the illumination of the interior can change very rapidly. Furthermore, the outdoor environment is visible through the vehicle’s windows, so motion will always be detected in the image regions corresponding to the vehicle’s windows. To address this problem, we only detect motion of skin-like regions, for example a driver’s face and hands. This approach is advantageous since skin color detection can be fairly robust to various illumination conditions. Skin tones are also unlikely to appear in the window regions, so motion in the outside environment is unlikely to be detected. Portions of the vehicle’s interior that are misclassified as skin are static and will contribute nothing to the detected motion, so such regions are not problematic as well. A. Skin Color Detection We perform the classification of color pixels into skin tones and non-skin tones by working in the normalized RGB space. The normalization is effective against varying illumination conditions, and can also be motivated by the fact that human skin tones have very similar chromatic properties regardless of race as discussed in Vezhnevets et al. (2003). An RGB triplet (r, g, b) with values for each primary color between 0 and 255 is normalized into the ′





triplet (r , g , b ) using the relationships: r′ =

255.g 255.b 255.r , g′ = , b′ = . r+g+b r+g+b r+g+b

(1)

6

We classify a normalized color (r ′ , g ′, b′ ) as a skin color if it lies within the region of normalized RGB space described by the following rules as described in Vezhnevets et al. (2003) as, r ′ > 95, g ′ > 45, b′ > 20 max {r ′ , g ′, b′ } − min {r ′ , g ′ , b′ } > 15 r ′ − g ′ > 15, r ′ > b′

(2)

Fig. 2 shows the results of the skin color detection for various subjects and lighting conditions. It should be noted that other skin-tone detection methods can be used without affecting the rest of the algorithm. We tried using a non-parametric Bayesian skin probability map as an alternative approach, but its results were of unsatisfactory quality as the number of training images used to create the map was small and the images themselves were obtained under radically different lighting conditions than those during our driver monitoring experiments. However, if a better skin-color detection method is available, it can be substituted in favor for the rule-based one.

Fig. 2. Skin color-based segmentation (bottom row) on various images(top row). Skin segmented regions are indicated in black. The results are post-processed by a sequence of morphological erosions and dilations.

B. Detecting Changes in Behavior Since our goal is to detect and classify relatively motion-free periods, we use inter-frame differencing of the skin-tone masks to decide when a period starts and ends. If the change between two consecutive skin-color masks obtained by the color classification step is significant, the current low-motion period terminates. When the inter-frame difference drops, we start accumulating data about a new low-motion period. Given an image region R, the change between two consecutive binary skin-color masks It−1 : ℜ → {0, 1} and It : ℜ → {0, 1} is described by the total number of pixels whose classification changed: c(t) =

X p∈ℜ

|It (p) − It−1 (p)|

(3)

7

Whenever c(t) is large, a transition in driver behavior is detected. A global threshold cannot be used to determine whether the change c(t) is significant or not, since different low-motion actions differ in the typical amount of “natural” motion that occurs throughout the action. Additionally, the amount of noise in the skin classification masks may differ from one run of the algorithm to another. Finally, the significance of a change c(t) depends on how much of a driver’s skin is exposed. For these reasons, we chose to have a relative threshold for the significance of c(t) that depends on the observed variation in c(t) over a period of time. We define a threshold θ of how of how many standard deviations away from the mean c(ti ) a given c(t) must be to be determined to be a new action period. At every frame t, a window over the past w frames is used, and the mean µ and standard deviation σ of all the c(ti ) within this window is computed. If c(t) > µ + θ.σ, then the inter-frame difference is considered significant, the current action period is determined to be over, and a new action period is started. Since we start recording data for a new action immediately during the onset of the significant change, the deviation in the first few samples (i.e. c(t1 ), c(t2 ), . . . , c(tn )) is larger, which limits the number of spurious short periods identified by the algorithm. This is advantageous since the sequence of images leading to a low-motion action will contribute to the action model and thus will allow us to distinguish between otherwise similar low-motion periods based on information about the high-motion events that preceded them. In both the experiments presented here, w is set to 900 frames, which corresponds to 30sec of the past activity. In the first experiment θ is set to 2.0, while in the second experiment, θ is varied between {1.0, 1.25, 1.50, 1.75, 2.0}. C. Activity History Images To model the distinct driver actions at different times, activity history images are used. An activity history image is a probability map for every pixel in the image being a skin-tone region for a given activity cluster. Significant changes in the binary skin-tone masks starts a new activity history image or action model. Thus the activity history image is a probability map that describes the expectation of observing a skincolor at every location in the input images. Given the binary skin masks It1 , It2 , . . . , Itk , with small spatial difference in actions (computed from the inter-frame differencing of the skin-tone masks), for the duration t1 until tk , the activity history image P is defined by, t

n 1X It P = n t=t 1

(4)

8

Sample activity history images for several actions are shown in Fig. 3. Individual actions that are

Fig. 3. Activity history images for several action clusters (representing more than 80% of the driver’s activity). Darker regions indicate higher probability of observing skin tones.

determined to be similar are merged together into clusters. The goal of the clustering is to produce clusters that correspond to a single type of behavior (safe or unsafe). Such clustering facilitates further analysis of a driver’s activities as it reduces tremendously the amount of data that needs to be analyzed (thousands of video frames versus tens of activity models). The similarity between a new activity history image P and an existing activity history image Q is defined as: P p P (i)Q(i) d(P, Q) = p P i∈R P ( i∈R P (i))( i∈R Q(i))

(5)

The measure defined in Eqn. 5 is the Bhattacharya coefficient, which measures the difference between two normalized histograms, and ranges from 0 to 1. A high similarity measure corresponds to similar activity history images (AHI), while a low measure corresponds to dissimilar models. For the experiments presented in this work, a threshold value for the Bhattacharya coefficient as described in Eqn. 5 is used. In the first experiment, this value was set to 0.85 while it was varied in {0.75, 0.80, 0.85, 0.90} for the second experiment. An activity history image is compared to the means of all the stored activity history images and merged with the most similar one if the similarity measure exceeds the above Bhattacharya threshold. The new activity history image P is merged into the most similar activity history image represented by Q according to: Q(i) ←

m n P (i) + Q(i) n+m n+m

(6)

where n and m are the number of video frames represented in P and Q, respectively. Activity history images on the surface appear to be similar to the motion history images described in Bobick & Davis (2001). However, this is not the case. Bobick & Davis (2001) use straight image-differencing as the cue in their method whereas we use the skin-tone regions. Also, whereas their motion history

9

images retain information about the order in which motion in certain regions occurs, we forgo ordering information, instead looking at probabilities of skin-tone being seen over the entire sequence. This is useful, because for the unsafe actions we are monitoring, it does not matter in which order the driver presses buttons on the dashboard stereo, merely that their hand is in that location. IV. BAYESIAN E IGENIMAGE BASED ACTIVITY C LASSIFICATION In this work, the activity history images obtained through the unsupervised clustering step are used as the input to the activity classifier. The Bayesian eigenimage classification proposed by Moghaddam et al. (2000) is a dimensionality reduction method which essentially tries to find the similarity of the input target image with the candidate models of action for their classification. In other words, each action, safe, and unsafe is represented by a set of low dimensional representation such as eigenimages, and the target image is classified as a given action based on the image it matches the best in any of the two classes. When the space of image locations spanned by the two actions is mutually exclusive, classification is trivial. However, this is not so in our case. Fig. 4 illustrates an example of safe and unsafe action. As can be seen, the spaces of image locations occupied by the agglomerated images for the two action classes overlap making classification a challenging task.

Fig. 4. Two examples of activity history images for safe and unsafe action. As can be seen, the image regions where the two actions occur are overlapping making classification a challenging task.

In this work, we currently assume that the action classes are distinct enough to make the classification using a two way classifier, safe and unsafe. A third class, unknown, is assigned to the activity history images that cannot be reliably classified distinctly into one of safe or unsafe action. Details of the learning algorithm, training, and the testing method are discussed in the following subsections. A. Training Method The goal of training is to find a representation for the images of a given class. Essentially, we want to find a low dimensional representation for the data. Several methods exist such as the Karhunen-Loeve representation. We use the eigenimage method originally proposed by Turk & Pentland (1991) for face

10

recognition and extended later on by Moghaddam et al. (2000). The method’s robustness to illumination variations and self-occlusions can be achieved by using multiple suitable training images. For a given set of images corresponding to a given class, the largest eigenvalues and eigenvectors represent the distribution of the data along the most significant component direction. This is the basis of this method. For the given set of images, Ii1 , . . . , IiK belonging to a class Ci , an eigenvalue decomposition is performed to obtain Ωi , the eigenvectors for the class Ci . This operation is performed off-line for each class of images. In this work, two different experiments were performed to evaluate the performance of the action classifier. In the first experiment, sets of single frame shots of the driver were used, while the second experiment used the activity history images for training. The first experiment evaluated the efficacy of the Bayesian eigenimage classifier for this application. The second experiment combines the two modalities, namely, the activity clustering and the classification in an unified framework. Fig. 5 and Fig. 6 show the second largest principal eigenimage for the safe and unsafe drive actions along with an example training image for the two different experiments.

(a) Example training im-

(b) Largest principal com-

(c) Largest principal com-

age.

ponent for safe action.

ponent for unsafe action.

Fig. 5. Sample training image and the largest principal eigenimage for safe and unsafe actions for experiment I using single action shots of the subjects.

(a) Example training im-

(b) Largest principal com-

(c) Largest principal com-

age.

ponent for safe action.

ponent for unsafe action.

Fig. 6. Sample training image and the largest principal eigenimage for the safe and unsafe actions for experiment II using activity history images obtained from the unsupervised clustering.

Fig. 7 shows the effect of the number of training samples on the accuracy of classification for all samples belonging to the unsafe actions. The number of correctly classified and misclassified images for different training sample size is shown in Fig. 7. Based on Fig. 7, we chose a sample size of 40 where the number of correctly classified samples is maximized and the number of incorrectly classified samples is minimized.

11

Fig. 8 shows the results of classification for examples containing the safe driving action for different sample

Fig. 7. The results of classification for the training images of the unsafe driving class. A training sample size of 40 was chosen since the number of correctly classified examples is maximized and the number of incorrectly classified examples minimized for this sample size.

sizes. Finally, Fig. 9 shows the distribution of the two classes along the three principal components. The largest eigenvalues and eigenvectors capture the largest variation within a class. However, given the small size of the training dataset in comparison to the dimensionality of the data, the variations resulting from erroneous skin segmentation are captured in the first principal component. Hence, we take the eigenvalues and eigenvectors starting from the second highest principal component for the classification models.

Fig. 8. Classification results for the safe driving class on the training set. A sample size of 20 was chosen since the number of correctly classified examples was maximized and the incorrectly classified examples minimized for this sample size.

B. Activity Classification The activity in each frame is evaluated by computing its similarity with the set of eigenimages obtained as a result of training. The standard measures of computing similarities between candidate and target models are distance measures such as the Euclidean distance. In this work, we use a probabilistic measure

12

Fig. 9. Distribution of the two classes under three largest principal components (starting from the second highest). Class safe drive is represented by o and class talk is represented by x.

where, given a candidate image Ix , its similarity to an image Iij from a class Ci is computed by projecting the difference of the two images, µ = Ix − Iij onto the principal eigenvectors of class Ci . This can be expressed as,

TΩ

P (µ|Ci) =

e−.5µ



(2π)d/2 |Ωi |1/2

(7)

where Ωi contains the m largest eigenvectors for class Ci and d is the dimensionality of the data. This operation is repeated over all members images in all the classes. This computation can be very expensive as the number of classes and the number of images increases. To reduce the computational burden, an off-line whitening transformation is performed as described in Moghaddam et al. (2000). Each of the Ii1 , . . . , IiK images in class Ci are transformed using the eigenvalues and eigenvectors: −1/2

imji = Di

Si Iij

(8)

where Di and Si are the eigenvalues and eigenvectors computed for the class Ci. Given these pre-computed transformations, the match for a new image Ix is computed as: j 2

e−1/2kimx −im k P (µ|Ci) = (2π)(d/2) |Ωi |1/2

(9)

where imx is the transformed image of Ix computed from the eigenvectors and eigenvalues of Ci as in Eqn. (8). The activity in a frame is classified as safe driving or unsafe based on the relative values of P (µ|Saf eDriving) and P (µ|Unsaf eDriving). Activities having almost equal probabilities for both classes are rejected and not classified as belonging to either class. That occurs when the probability of

13

association is in the range from 0.45 to 0.55. C. Combining unsupervised clustering and eigenimage classification The unsupervised clustering step summarizes the activity in a time period by an average activity image. Assuming that the variance in the activity of a cluster is small, the activity history image may be assumed to correspond to a single activity. Hence, the activity history image is a good candidate for classifying the action occurring in that period. The eigenimage classifier is trained on the activity history images (as shown in Fig. 6) obtained by manually marking the activity periods in the training step. V. E XPERIMENTAL R ESULTS A. Objective The objective of the experiments was to evaluate the performance of the proposed activity classification method. Experiments were designed to evaluate the performance of the method on novel subjects under outdoor lighting, as well as the performance of the system when combining clustering with eigenimage classification. B. Experimental Setup Experiments were performed in outdoor, uncontrolled illumination settings. For the safety of the subjects, videos were taken on a stationary vehicle with the subjects pretending to drive. However, there was motion outside of the vehicle due to pedestrians and other moving vehicles in the parking lot. The camera was mounted on the side or close to the rear of the vehicle such that the subjects profile view was clearly visible. In one set of experiments, all of the subjects face, upper arm, and hands were uncovered, while in the second set which was taken in winter, only the face and the hand portion of the subjects were skin segmentable. For Experiment I, videos of three individuals were used. The video camera used to record the videos was placed on a tripod directly outside the passenger-side window viewing the driver’s profile. Each video in this test set features a different individual sitting in the vehicle pretending to drive; different ethnicities and genders are represented. The lighting conditions vary throughout the videos as the vehicle was in an outdoor parking lot. Each of the three videos is about six minutes long (between 10,500 and 11,000 frames), full-color, and at full 720 × 480 resolution. Because these test sequences were filmed in the summer, all individuals were wearing short-sleeve shirts that expose the forearms. For Experiment II, videos of six individuals were used. These sequences were recorded using a camera placed behind the driver’s right shoulder. Again, the lighting conditions vary throughout the sequences as the filming was

14

performed outdoors. Each of these videos was about five minutes long, full-color, and at full 720 × 480 resolution. The forearms of the individuals filmed in these sequences are not exposed. During the course of each test sequence, the driver goes through periods of driving normally and performing distracting actions. Distracting actions include talking on a cellular telephone, adjusting the controls of the dashboard radio, drinking from a soda can, and, in Experiment II, a “free” dangerous behavior that was left up to the participant. These actions were chosen as the unsafe behaviors to test for because they are very common. C. Experiment I This experiment was performed in an outdoor setting where the subject’s upper and forearms could be segmented using the skin-segmentation algorithm, thereby, containing more information about the subject’s actions. The video consists of three subjects of different ethnic groups. The objective of this experiment was to evaluate the performance of the clustering algorithm for clustering activities as well as to evaluate the baseline performance of the activity classification. The goal of the clustering method is to group consecutive frames of similar activity together while place distinct activities in different clusters. Producing too few clusters has the danger that both the safe and unsafe actions can be merged in the same class. On the other hand, using a very small threshold to group the action frames can result in a large number of clusters, thereby, defeating the purpose of clustering. In this work, we chose a threshold value of 0.85 for clustering. The threshold was determined empirically on the different video sequences since it produced the best results with least the confusion between the two classes. Table I shows the results of clustering using a threshold of 0.85. The total number of clusters corresponds to the number of distinct activities recognized by the method. Singleton clusters are clusters that contain only one action model-such clusters usually reflect short periods of high motion indicative of transitions between different actions. Each sequence was manually segmented into safe and unsafe driving periods. Since the goal of the clustering method is to group activities for further analysis, it must not group together activities from the two different classes. The confusion score in the last column of Table I is thus used to indicate the proportion of data that was correctly segmented by the clustering algorithm. Specifically, the confusion score is the percentage of frames whose ground truth classification does not match the predominant classification of the cluster to which they were assigned by the algorithm. The goal of activity classification in this experiment was to evaluate the performance of the eigenimage classifier for the task of driver activity classification. The classifier was tested on the same three subjects as those used in the unsupervised clustering method on the same video sequences. The training and test

15

Subject 1 2 3

Num Frames 10231 10110 10380

Clustering(Singletons) 33(24) 38(23) 16(8)

Confusion Score 4.54% 16.8% 11.1%

TABLE I E XPERIMENT I: R ESULTS OF ACTIVITY CLUSTERING USING A

THRESHOLD OF

0.85. T HE SINGLETON CLUSTERS CORRESPOND TO

CLUSTERS CONTAINING FRAMES OCCURRING DURING ACTION TRANSITIONS .

sets were chosen from different portions of the video sequences. The classification was performed on the individual frames or action snapshots of the subjects without the irrelevant portions of the images. Relevant parts of activity were obtained through skin segmentation. Example snapshots for activity classification used in the Experiment I are shown in Fig. 5. While the training images were chosen as those having as little noisy segmentation as possible, no such restriction was placed on the test sets. The number of training images was 20 for the safe driving, and 40 for the unsafe driving classes. A total of 963 images were used for testing. Table II shows the results of activity classification. Safe Unsafe

Safe 95.2 14

Unsafe 1.6 74

Unknown 3.2 12

TABLE II E XPERIMENT I: R ESULTS OF

ACTIVITY CLASSIFICATION .

U NKNOWN CLASS CORRESPONDS TO THE CASE WHEN THE PROBABILITY OF

CLASSIFYING AN ACTIVITY IN EITHER CLASS IS SIMILAR .

D. Experiment II This experiment was performed in a more challenging setting where the subjects wore long-sleeved clothing making their forearms unsegmentable. The camera was mounted more towards the rear of the subject. Six different subjects were used in this experiment. Furthermore, the experiments were performed by combining the activity clustering with the eigenimage-based activity classifier. The eigenimage classifier was trained using a small set of activity clustered mean images from a small set of subjects, in our case three while the classification was tested on all the subjects. The training images were chosen manually so that all the cluster images consisted of a single action such as safe or unsafe and included little noise in the segmentation. No such restriction was placed on the test images. Some of the test clusters included both actions and sometimes were segmented poorly. The results of the activity classification are in Table III. The fifth column corresponding to ambiguous clusters corresponds to the number of images that were difficult to classify manually as safe or unsafe. As such these clusters included both the actions incorrectly grouped in to a single cluster.

16

Subject 1 2 3 4 5 6

Correct 37 29 27 41 19 19

Incorrect 9 15 17 20 12 13

Unknown 1 5 3 3 1 1

Ambiguous 0 7 7 0 14 14

TABLE III E XPERIMENT II: R ESULTS OF ACTIVITY CLASSIFICATION . C LASSIFICATION WAS BY THE UNSUPERVISED CLUSTERING METHOD .

PERFORMED ON THE ACTIVITY CLUSTERS COMPUTED

A MBIGUOUS IMAGES WERE THOSE THAT WERE MISCLASSIFIED BY MANUAL INSPECTION .

The results of activity clustering for all the subjects using the view used in the classification experiment, namely, the back view, are shown in Fig. 10. The Fig. 10 shows the confusion scores as a function of the cutoff standard deviation and the Bhattacharya coefficient used for the clustering. As can be seen, the clustering performance is in general directly proportional to the Bhattacharya coefficient and inversely proportional to standard deviation cutoffs.

Subject II confusion score

Subject III confusion score

0.4

0.4

0.3

0.3

0.3

0.2

0.1

0 2

Confusion score

0.4

Confusion score

Confusion score

Subject I confusion score

0.2

0.1

0 2 90 1.5

85

STD

75

STD

BHAT

1

75

80 STD

BHAT

Subject V confusion score

0.3

0.1

0 2

Confusion score

0.4

0.3

Confusion score

0.4

0.3

0.2

0.2

0.1

0 2 85 75

0.2

0.1

1.5

85

90 1.5

85

80 BHAT

BHAT

90

80 1

75

0 2

90 1.5

1

Subject VI confusion score

0.4

STD

85

80

Subject IV confusion score

Confusion score

90 1.5

85

80 1

0.1

0 2

90 1.5

0.2

STD

1

75

80 BHAT

STD

1

75

BHAT

Fig. 10. Results of activity clustering. The plots depict the change in the confusion scores with different values of the Bhattacharya coefficient (BHAT) and the standard deviation cutoff (STD) used for grouping the consecutive image frames in activity clusters.

Fig. 11 shows the mean variation of the confusion scores across all subjects with varying Bhattacharya coefficients and standard deviation cutoffs. Fig. 12 shows the mean and standard deviation of the confusion scores for all the subjects for different values of the Bhattacharya coefficients for fixed standard deviation cutoffs. As can be seen, the mean standard deviation of the confusion scores is the least for the Bhattacharya coefficient value 0.95. However, the number of clusters produced is directly proportional to the Bhattacharya coefficients as depicted in Fig. 14. Furthermore, larger values of Bhattacharya coefficient

17

also increases the number of singleton clusters produced as shown in Fig. 15. Hence, in order to minimize the number of clusters, particularly the singleton clusters but also have a reasonably small confusion score, we chose a value of 0.85 for the Bhattacharya coefficient. Mean confusion score (all subjects)

Confusion score

0.4

0.3

0.2

0.1

0 2 90 1.5

85 80 1

STD

75

BHAT

Fig. 11. Mean confusion scores across all the subjects with varying Bhattacharya coefficients (BHAT) and standard deviation cutoffs(STD).

Confusion score box plot (all subjects) for 1.25 STD cutoff

0.35

0.35

0.35

0.3

0.3

0.3

0.25

0.25

0.25

0.2

Values

0.4

0.15

0.2 0.15

0.2 0.15

0.1

0.1

0.1

0.05

0.05

0.05

0 75

76

77

78

79

80

81

82 83 BHAT

84

85

86

87

88

89

0 75

90

76

Confusion score box plot (all subjects) for 1.75 STD cutoff 0.4

0.35

0.35

0.3

0.3

0.25

0.25

0.2 0.15

78

79

80

81

82 83 BHAT

84

85

86

87

88

89

90

89

90

0 75

76

77

78

79

80

81

82 83 BHAT

84

85

86

87

88

89

90

0.2 0.15

0.1

0.1

0.05

0.05

0 75

77

Confusion score box plot (all subjects) for 2 STD cutoff

0.4

Values

Values

Confusion score box plot (all subjects) for 1.5 STD cutoff

0.4

Values

Values

Confusion score box plot (all subjects) for 1 STD cutoff 0.4

76

77

78

79

80

81

82 83 BHAT

84

85

86

87

88

89

90

0 75

76

77

78

79

80

81

82 83 BHAT

84

85

86

87

88

Fig. 12. Mean and standard deviation of the confusion scores across all subjects for different values of the Bhattacharya coefficients (BHAT) for fixed standard deviation cutoffs (STD).

E. Discussion of Results As mentioned in the results, the performance of the clustering algorithm is directly proportional to the Bhattacharya coefficient and inversely proportional to the standard deviation cutoffs. In other words, the confusion in the grouping of the unlike activities in a single cluster decreases when increasing the Bhattacharya coefficient as well as when reducing the allowed standard deviation of the clusters. While increasing the Bhattacharya coefficient and reducing the standard deviation cutoff increases the number

18

of clusters, the number of clusters computed for each subject for a 5 min long video segment was still below 100. Fig. 14 shows the number of clusters extracted for each subject with varying Bhattacharya coefficient values of (0.75, 0.80, 0.85, 0.95) and standard deviation cutoffs of {1.0, 1.25, 1.50, 1.75, 2.0}. Fig. 15 shows the number of singleton clusters which are basically activity history images with a history of one frame. Thus, the chosen values of the Bhattacharya coefficient and the standard deviation cutoffs not only reduce the confusion but also keep the number of singleton clusters below a certain number.

(a) Bad skin-tone segmentation

(b) Pose occlusion

ambiguity

due

to

self-

Fig. 13. Bad segmentation and self-occlusions can affect accuracy of classification. Poor segmentation results typically from parts of the background sharing similar color to the skin-tone regions.

Subject II cluster count

Subject III cluster count

200

200

150

150

150

100

50

0 2

Cluster count

200

Cluster count

Cluster count

Subject I cluster count

100

50

0 2 90 1.5

85 1

75

STD

BHAT

1

75

80 STD

BHAT

Subject V cluster count

150

50

0 2

Cluster count

200

150 Cluster count

200

150

100

100

50

0 2 85 75

100

50

85

90 1.5

85

80 BHAT

BHAT

90 1.5

80 1

75

0 2

90 1.5

1

Subject VI cluster count

200

STD

85

80

Subject IV cluster count

Cluster count

90 1.5

85

80 STD

50

0 2

90 1.5

100

STD

1

75

80 BHAT

STD

1

75

BHAT

Fig. 14. Number of clusters extracted for each subject using Bhattacharya coefficients (BHAT) {0.75, 0.80, 0.85, 0.90} and standard deviation cutoffs (STD) of {1.0, 1.25, 1.50, 1.75, 2.0}.

The performance of the activity classifier is affected by the number of images used for the training as well as the noise in the segmentation of the individual images, as well as ambiguous subject poses. Noisy segmentation adversely affects both the clustering and the activity classification. Noise results from illumination changes, as well as distractions resulting from the subject’s clothing. Fig. 13 is an example of a poorly segmented image resulting in poor classification. In the second experiment, where only a small portion of the subject’s hand is visible in addition to the face, segmentation errors severely affect the accuracy of classification. Ambiguous postures such as a subject leaning close to the window, as

19 Subject I singleton cluster count

Subject II singleton cluster count

150

100

50

0 2

200 Singleton cluster count

200 Singleton cluster count

Singleton cluster count

200

Subject III singleton cluster count

150

100

50

0 2

1

75

STD

BHAT

1

75

STD

BHAT

50

0 2

150

100

50

0 2 85

150

100

50

90 1.5

85

80 75

90 1.5

85

80 BHAT

BHAT

0 2

90 1.5

75

200 Singleton cluster count

Singleton cluster count

100

1

Subject VI singleton cluster count

200

150

STD

85 80

Subject V singleton cluster count

200 Singleton cluster count

90 1.5

85 80

Subject IV singleton cluster count

1

50

90 1.5

85 80

STD

100

0 2

90 1.5

150

STD

1

75

80 BHAT

STD

1

75

BHAT

Fig. 15. Number of singleton clusters obtained for each subject with varying Bhattacharya coefficients (BHAT) {0.75, 0.80, 0.85, 0.90} and fixed standard deviation cutoffs (STD) of {1.0, 1.25, 1.50, 1.75, 2.0}.

well as self-occlusions make the classification more challenging. For instance, a subject whose head was constantly in motion in a cluster resulted in misclassification. In the case of classifier that uses the results of the clustering algorithm, the accuracy is also affected by the former’s accuracy. VI. F UTURE WORK

AND

C ONCLUSIONS

This work introduced a novel method for driver activity classification by combining an unsupervised clustering and a supervised eigenimage based classification algorithm. We have shown that unsupervised clustering can greatly reduce a movie sequence to a much smaller set of activity history images. We have also shown that that the eigenimage classifier used can distinguish between safe and unsafe driving frames. By combining these two approaches, the best aspects of each was retained, allowing sped up throughput due to clustering and accurate recognition due to eigenimage-based classification. Such a system can be useful for increasing driver safety on the road. Potential areas of future work include expanding the eigenimage classification method to identify more than two classes of activities. Instead of just safe or unsafe, a more useful classification would be to categorize the subject.s activities in more minute detail, for instance include talking on the cell phone as different from eating. It may also be possible to use other techniques or sensors such as gaze detection and incorporate these inputs into the activity model. It may be possible to improve the results from the clustering method if features than skin tone can be found that more accurately keep track of head and hand location in the image. A limitation of the classifier occurs due to the significant overlap in the action spaces for the the safe and unsafe classes. Another classifier such as the support vector machine that might be able to transform the

20

images into distinct sub-spaces will improve the classification performance. This is a possible direction for future work. In conclusion, this work presented a two-class action classification method for classifying driver activities. The clustering method in the first step speeds up the algorithm by reducing the number of images that need to be classified as well as summarizes the actions in a video into a small set of sample images. The eigenimage classifier is easily scalable to different subjects as long as the subjects do not deviate from typical poses for the safe as well as unsafe actions. VII. ACKNOWLEDGMENTS This work was supported by the National Science Foundation through grant #IIS-0219863, the ITS Institute at the University of Minnesota, and the Minnesota Department of Transportation. The authors would also like to thank the subjects for participating in the experiments.

R EFERENCES Baker, S., Matthews, I., Xiao, J., Gross, R., Kanade, T. and Ishikawa, T. (2004), ‘Real-time non-rigid driver head tracking or driver mental state estimation’, 11th World Congress on Intelligent Transportation Systems . Baluja, S. and Pomerleau, D. (1994), Non-intrusive gaze tracking using artificial neural networks, Technical Report CMU-CS94-102, Carnegie Mellon University. Bartlett, M. S., Moverllan, J. and Sejnowski, T. (2002), ‘Face recognition by independent component analysis’, IEEE Trans. on Neural Networks 13(6), 1450–1464. Belhumeur, P., Hespanha, J. and Kriegman, D. (1997), ‘Eigenfaces vs˙fischerfaces: Recognition using class specific linear projection’, IEEE Trans. Pattern Analysis and Machine Intelligence 19(7), 711–720. Belkin, M. and Niyogi, P. (2003), ‘Laplacian eigenmaps for dimensionality reduction and data representation’, Neural Computation 15(6), 1373–1396. Bobick, A. and Davis, J. (2001), ‘The representation and recognition of action using temporal templates’, IEEE Trans. Pattern Analysis and Machine Intelligence 23(3), 257–267. Fletcher, L., Loy, G., Barnes, N. and Zelinsky, A. (2005), ‘Correlating driver gaze with the road scene for driver assistance systems’, IEEE Robotics and Autonomous Systems (52), 71–84. Gao, J., Collins, R., Hauptman, A. and Wactlar, H. (2004), ‘Articulated motion modeling for activity analysis’, IEEE Workshop on Articulated and Nonrigid Motion . Haritaoglu, L. and Davis, D. H. L. (2000), ‘W4: Real-time surveillance of people and their activities’, IEEE Trans. Pattern Analysis and Machine Intelligence 22(8), 809–830. Ivanov, Y. and Bobick, A. (2000), ‘Recognition of visual activities and interactions by stochastic parsing’, IEEE Trans. Pattern Analysis and Machine Intelligence 22(8), 852–872. Ji, Q., Zhu, Z. and Lan, P. (2004), ‘Real-time nonintrusive monitoring and prediction of driver fatigue’, IEEE Trans. Vehicular Technology 53(4), 1052–1068. Jia, Q. and Yang, X. (2002), ‘Real-time eye, gaze, and face position tracking for monitoring driver vigilance’, Real Time Imaging 8(5), 357–377. Jiangwei, C., Linsheng, J., Lie, G., Keyou, G. and Rongben, W. (2004), ‘Driver’s eye state detecting method based on eye geometry feature’. Loy, G. and Zelinksy, A. (2003), ‘Fast radial symmetry for detecting points of interest’, IEEE Trans. Pattern Analysis and Machine Intelligence 25(8), 959–973. Moghaddam, B., Jebara, T. and Pentland, A. (2000), ‘Bayesian face recognition’, Pattern Recognition 22(11), 1771–1782. Oliver, N. and Pentland, A. (2000), ‘Graphical models for driver behavior recognition in a smartcar’, Proc. IEEE Conf. on Intelligent Vehicles pp. 7–12. Roweis, S. and Saul, L. (2000), ‘Nonlinear dimensionality reduction by locally linear embedding’, Science (290), 2323–2326.

21

Salvucci, D., Boer, E. and Liu, A. (2001), ‘Toward and integrated model of a driver behavior in a cognitive architecture’, Transportation Research Record 1779, 9–16. Smith, P., Shah, M. and da Vitoria Lobo, N. (2003), ‘Eye and head tracking based methods-determining driver visual attention with one camera’, IEEE Trans. Intelligent Transportation Systems 4(4), 205–218. Song, Y., Gonclaves, L. and Perona, P. (2001), ‘Learning probabilistic structure for human motion detection’, Proc. IEEE Conf. on Computer Vision and Pattern Recognition 2, 771–777. Turk, M. and Pentland, A. (1991), ‘Eigenfaces for recognition’, Journal of Cognitive Neuroscience 3(1), 71–86. Vezhnevets, V., Sazanov, V. and Andreeva, A. (2003), ‘A survey on pixel-based skin color detection techniques’, Proceedings Graphicon pp. 85–92. Wahlstrom, E., Masoud, O. and Papanikolopoulos, N. (2003), ‘Vision-based methods for driver monitoring’, IEEE Intelligent Transportation Systems Conf. 2, 903–908. Wren, C., Azarbayejani, A., Darrell, T. and Pentland, A. (1997), ‘Pfinder: Real-time tracking of the human body’, IEEE Trans. Pattern Analysis and Machine Intelligence 10(7), 780–785. Zhu, Y. and Fujimura, K. (2004), ‘Head pose estimation for driver monitoring’, IEEE Intelligent Vehicles Symposium pp. 501– 506. Zhu, Y., Fujimura, K. and Ji, Q. (2002), ‘Real-time eye detection and tracking under various light conditions’, Proceedings Symposium on Eye Tracking Research and Applications pp. 134–144.

Suggest Documents