1
Occupant Classification Using Range Images Pandu RangaRao Devarakota∗+ , Student Member, IEEE, Marta Castillo-Franco+ , Romuald Ginhoux+ , Bruno Mirbach+ and Bjorn Ottersten∗ , Fellow, IEEE
Abstract— Static occupant classification is an important requirement in designing so-called ”smart airbags”. Systems for this purpose can be either based on pressure sensors or vision sensors. Vision-based systems are advantageous over pressure sensor based systems, as they can provide additional functionalities like dynamic occupant position analysis or child seat orientation detection. The focus of this paper is to evaluate and analyze static occupant classification using a low resolution range sensor which is based on the time-of-flight principle. This range sensor is advantageous since it provides directly a dense range image, independent of the ambient illumination conditions and object textures. Herein, the realization of an occupant classification system using a novel low-resolution range image sensor is described, methods for extracting robust features from the range images are investigated, and different classification methods are evaluated for classifying occupants. Bayes quadratic classifier, Gaussian Mixture Model (GMM) classifier and polynomial classifier are compared to a clustering based linear regression classifier using a polynomial kernel. The latter one shows improved results compared to the first three classification methods. Full scale tests have been conducted on a wide range of realistic situations with different adults and child seats in various postures and positions. The results prove the feasibility of low-resolution range images for the current application. Index Terms— real-time vision, time-of-flight principle, range imaging, 3-D object classification, clustering, polynomial classification.
I. I NTRODUCTION
M
OST new cars have air bags for front-seat passengers, since they limit head and chest injuries of passengers in the case of a crash. A US report published in 2004 shows that throughout 1987-2003 airbags saved approximately 16,905 lives [3]. But, there are some instances where occupants are severely injured too, not because of a frontal crash, but due to the deployment of airbags. For instance, when an airbag deploys on a rear-facing infant seat, it can cause severe or even fatal injuries to the infant. The same is the case, if a passenger is too close to the airbag while it is inflating. According to the American National Highway Traffic Safety Administration (NHTSA), since 1990, 227 deaths have been attributed to airbags deployed in low-speed crashes which included 119 children between the ages 1 and 11, and 22 infants [3]. According to FMVSS208, a safety standard which was introduced by NHTSA and will be fully effective in 2006, current airbag technology requires an occupant classification * The authors are with the Signal Processing group of the School of Electrical Engineering, Royal Institute of Technology (KTH), SE-100 44, Stockholm, Sweden (e-mail:
[email protected],
[email protected]) + The authors are with IEE S.A., ZAE Weiergewan, 11, rue Edmond Reuter, L-5326, Contern, Luxembourg (e-mail:
[email protected],
[email protected],
[email protected])
system, that deactivates an airbag in case of an infant seat. The same should be the case for a non-occupied seat for reducing repair costs. In addition to these safety requirements, it would be beneficial if the system could determine the orientation of an infant seat and the position of an adult as this could allow the adjustment of the airbag deployment to both the type and position of the occupancy. Based on these requirements the problem studied in this paper is formulated as follows: Classification of an occupant on the passenger seat into one of the following classes; 1. Empty seat 2. Rearward facing infant seat (RFIS) 3. Forward facing child seat (FFCS) 4. Adult (P) (see Fig. 1). The main challenges for such an occupant classification system are: 1) To handle large variations in the scenes with high reliability, 2) To cope with changing environmental conditions such as light and temperature, and humidity, 3) To keep costs low, which means also that the numerical methods that are applied should be computationally inexpensive. While item 2) is mainly an issue for the system hardware, item 1) and 3) are challenges for the image classification methods applied. Sensor technology for this application can be based on pressure sensors (see for example the IEE OC R sensor [2]), or vision systems, which use a video, stereo or range camera system. The latter system has an advantage that the position of the occupant and orientation of the child seat can also be detected. However, 2-D camera systems are not robust against changing light conditions and are not suited for dynamic occupant position detection. Herein, to overcome these drawbacks, a 3-D camera that acquires a range image using the time-of-flight principle is used. Different sensor systems have previously been investigated for the current application. In [26], an occupant sensing system for airbag deployment which is safe or unsafe to the occupant was studied. In [12], an occupant sensing system for detecting the presence of an occupant on the driver seat and on the passenger seat was investigated. In both papers a stereo camera was used as sensor. A good performance over large varied scenes was reported, however, the problem of classification has not been considered. In [20], an occupant classification system using a stereo camera was proposed, but very poor performance was reported (70%). In [22], both 2-D and 2 1/2-D range data for the classification of an occupant inside the vehicles were used. A three class problem was treated and a good classification accuracy was reported, however, a detailed description about
2
the occupant recordings was not provided. In [13], the authors illustrated a 4 class problem for occupant classification with a single gray scale camera and a digital signal processor that can perform this function in ”real time”. About 21,000 realworld images with a wide variety of passengers in five different vehicles in moderate lighting conditions were used for the classification task. The performance rate of 95% was reported. However, their database did not contain as many complex situations with persons or for example infant seats covered by a blanket as it is indeed a requirement of [3]. To conclude, in most papers, the sensor technology was either based on a stereo camera or simply using an intensity camera. Such systems can be cost effective and need more processing time. Only two studies where a time-of-flight sensor was proposed for the occupant classification system have been reported. In [15], a so-called PMD camera was proposed. The results presented, were, however obtained by emulating the PMDcamera by a non-minituarized prototype installed in a vehicle mock-up. The classification method used was a quadratic polynomial classifier using PCA features extracted from the raw image and showed promising results of more than 95% correct classification. In [8], a camera working with a modulated illumination similar to the camera of the current paper has been used for occupant classification. However, the camera was not installed into a vehicle, but a rather unrealistic laboratory setup was used. In [30], another technology based on a thermal longwavelength infrared camera was evaluated but, the focus was mainly on dynamic occupant posture analysis. The 3-D imaging technology used in this paper has been developed by CSEM [1]. It uses a continuously modulated near infrared light. Since this technology uses an active near infrared illumination for activation, it is independent from the ambient illumination conditions and object textures. However, depending on the remission coefficient of the objects under surveillance, noise is present in the range data. Based on this technology IEE S.A. [2] is presently developing a camera for the application of occupant classification. A prototype of this camera has been the basis for this study; the camera technology is described in more detail in Section II. The contribution of this paper is the following: 1) Description of a novel low-resolution 3-D range camera for the application of occupant classification. 2) Investigation of geometric feature extraction from range image sequences for the classification task. 3) Comparison of different feature selection algorithms (for selecting a subset of features) to optimize the classification performance. 4) Evaluation and comparison of different classification methods. The first important task in any classification algorithm is to extract from raw data the information relevant for the classification task while removing the redundant or irrelevant information. The goal is to obtain a compact representation of this information in terms of a so-called feature vectors which must be robust against noise. Several techniques for feature extraction in 3-D range images have been proposed (see for example [4] and [5]). However, these techniques
mainly focus on machine vision applications and must be refined and adapted to the current application and specifics of the 3-D camera. Specifically, the low resolution of the 3-D camera and noisy range measurements must be taken into account which is particularly challenging. In this context, geometric features are preferred as they can be defined to be invariant under translation and rotation and have good robustness against noise. For the classification task, two kinds of classification techniques are considered: The first one is based on Bayesian decision theory in which the probability structure underlying the classes is known perfectly. The second one is a discriminant function classifier where the goal is to construct a decision boundary between the classes [11]. In case of the prior classifier model, two different classifiers are considered for the evaluation: Bayes quadratic classifier (BQ) which assumes that each class data is distributed normally and a Gaussian mixture model (GMM) classifier where the class densities are estimated using a mixture of Gaussians. In case of the latter classifier model, a polynomial based regression classifier (PC) and a clustering based linear classifier using a polynomial kernel are considered.
Fig. 1. Possible occupancy of the passenger seat: (from left to right) Empty seat, Rearward Facing Infant Seat (RFIS), Forward Facing Child Seat (FFCS), and Adult (P).
The remainder of this paper is structured as follows: Section II explains about the data acquisition with a brief description of the camera technology employed in the study. Pre-processing of the images is described in Section III and in Section IV, feature extraction methods are presented. The classification methods are presented in Section V. A comparison of results of all classifiers evaluated in this paper is presented in Section VI and in Section VII, discussion and conclusions are reported. II. DATA ACQUISITION AND S ENSOR P RINCIPLE In this section, different range sensing techniques which were used in the past to capture the 3-D shape of an object are explained. Then the details of the current 3-D range sensor are introduced. Characteristic of this camera is that it provides a dense range image of a scene (not only at edge or textured pixels as in the case of standard stereo vision), but has only a low spatial resolution. Range sensing systems can be based on either active or passive approaches [4]: In the passive approach, the 3-D shape of a scene will be extracted from the scene using the existing light in the environment, for example, measurement of visible light already present in the scene. Stereo technique with intensity cameras where the range is computed by triangulation between the locations of matching pixels in images, is an
3
example of a passive approach [7]. Whereas in the active approach, the 3-D information is extracted by projecting an external light, for instance, infrared light or a laser onto the scene of interest and the shape of the scene is captured by analyzing the reflected light. Active-range sensing can further be divided into two main classes depending either on the principle of triangulation or on the time-of-flight principle. In the active triangulation, a scanning stripe of light (laser for example) is projected onto the scene and the depth to the point on the object is calculated using a set of calibration parameters [17]. Though triangulation based-range sensors proved efficient for different applications, the main drawbacks of these sensors are: resolution which is dependent on the baseline of the system, frame rate of the sensor and shadow effect (a good survey of various active sensors for 3-D sensing can be found in [4], [17]). In case of active sensors based on the time-of-light principle, the scene is broadly illuminated with a near infrared illumination and thus frame rate of the system is generally better. The approach advocated here, is based on active sensing using this principle. Further details about the camera are described below. First, the scene is broadly illuminated by a modulated infrared light beam, then the modulated beam which is reflected by an object, is detected by the receiver. The amplitude of the emitted light is altered due to material reflectivity and distance to the receiver [23]. Due to the time of flight of the light to and from the target, the detected beam has a phase shift compared to the phase of the modulation signal in the illumination. This phase difference ∆φ can be calculated by sampling at four temporal positions (see Fig. 2) and is given in (1). This measured phase delay can be directly converted into the distance between the target and the camera [33] as shown in (2). ∆φ = arctan[
a3 − a1 ] a2 − a0
(1)
∆φ 2π
(2)
L = Lmax
t0
t0 t3
t1
U
t2 I
a0
a1
a2
∆ϕ
a3
a0
t
Fig. 2. Detected light intensity as a function of time. The sinusoidal modulation (top curve) of the illumination causes a periodically modulated signal in the receiver (lower area). The phase offset ∆φ between the illuminated and received light can be computed by evaluating the signal amplitudes a0 , . . . ,a3 at 4 different temporal positions t0 , . . . ,t3 .
camera
Fig. 3. Test vehicle equipped with a prototype camera mounted in the overhead console, the camera is able to survey the passenger compartment of the vehicle.
III. P REPROCESSING
The parameter Lmax is called the non-ambiguity range of the sensor (Lmax = c/2fmod , fmod being the modulation frequency of illumination). By performing this phase measurement independently but in parallel for each pixel of the image, a range image (also called distance image) is acquired. That is, 2-D images where each pixel value represents the radial distance from the camera center to a point in the scene where the corresponding pixel is looking at. Note that if the amplitude of the received light is less than certain level of threshold then the distance in (2) is not computed and the corresponding pixel is marked with black. The camera has 52 × 50 pixels where each pixel corresponds to a 3cm × 2cm lateral resolution in the average distance of an occupant from the camera. The radial resolution is around 1.0cm. See [23] for more details on the camera. The camera has been installed in the overhead module of a car as shown in Fig. 3. A field of view of 120◦ × 90◦ allows a complete survey of the passenger seat.
Prior to the feature extraction, the image is preprocessed. This involves a distance clipping of the range images; with this operation, range measurements are compared at each pixel location with a pre-defined reference distance and set to background in case that the distance is larger. Any information regarding the background (or objects outside the car) is therefore removed. Consequently, this method allows reducing the number of pixels to be considered for further processing by discarding background regions in the field of view; i.e. a binary mask can be generated where all background pixels are set to 0 and non-background pixels to 1. A 2-D median filter is applied to the foreground pixel in order to reduce the range measurement noise present in the image. Fig. 4 shows an example of a distance image in false color representation and Fig. 5 shows an intensity image of the same scene which was taken with a high resolution 2-D camera to provide a reference. Note that for dark brown (black) pixel in Fig. 4, the corresponding distance values are not computed as explained in Section II. Fig. 6 shows the same image after the preprocessing is applied. In the second step of the pre-processing, the distance image
4
Fig. 4. An example color coded raw distance image before preprocessing: blue (dark gray) represents closest points to the camera and red (light gray) represents farthest points to the camera
Fig. 5. The same scene as in Fig. 4 taken with a high resolution intensity camera
is transformed to cartesian coordinates in the vehicle coordinate system. Transformation from range image to cartesian coordinates where a triple (x,y,z) is assigned to each pixel, is realized as follows: • On the system initialization, a set of unit vectors is calculated from the internal camera parameters. Each pixel is associated with a unit vector pointing to the direction from where it receives light. These unit vectors are kept in memory during the runtime of the system. • The transformation to vehicle coordinates is achieved by multiplying the range image point coordinates with the unit vectors and by adding a constant translation and rotation between camera and the origin of the vehicle coordinate system (external camera parameters). The internal and external parameters of the camera are obtained by calibrating the camera using a toolbox that has been c developed for this purpose with MATLAB° . IV. F EATURE C OMPUTATION In 3-D object recognition, features extracted from surfaces, lines, and points that describe the geometry of a 3-D object are commonly used. For this purpose, surface representation or volumetric representation of objects can be used [4]. At first glance, one can think of local invariant-surface properties such
Fig. 6. applied
The same distance image as in Fig. 4, after the preprocessing is
as surface curvature and surface normal for the classification (refer to [19] for instance). Signs of the extracted surface curvature and surface normal are different for different objects contained in an image and thus these properties can also be used for a segmentation problem [5]. However, calculation of curvature involves the computation of second order derivatives, magnifying the effects of any noise present. Thus it requires a smoothing prior to the feature extraction which is not feasible for low resolution images. Furthermore, features extracted from a range image can vary due to the position of the object with respect to the camera. Effective features should be invariant under the relative position and should thus be derived from the cartesian coordinates rather than from the range image. The following section explains the methodology of feature extraction that we have investigated in this paper. A. Feature Extraction Features can either be extracted by a projection of the image to a lower dimensional space or by extracting geometrical descriptors. A standard projection method that is often used for feature extraction is principle component analysis (PCA) (see for example [15], [22]). PCA transformation aims at reducing or compressing the information contained in the whole image by removing the redundant information. Though PCA often gives best performance, the physical meaning of each feature is lost. Moreover, the features are not invariant under variation of the relative position of the object with respect to the camera. On the other hand, geometrical modeling of the object can provide geometric descriptors of the contour contained in the 3-D image that can be made invariant and have a mathematical meaning. Strategies that are advocated here for extracting features are as follows: 1) Features extracted by a geometrical modeling of the object. 2) Features from the coefficients of PCA transformation which is applied to the preprocessed image. The latter one is considered as a benchmark to compare the performance obtained with the use of geometric features. For feature computation we use preprocessed range image and range image data represented in a cartesian vehicle coordinate system since cartesian coordinates allow to define feature
5
invariance with respect to the viewing direction. For example, while the size and distance of a person in the range image changes with his position, the cartesian coordinates allow calculation of for instance, the person’s height independent of the position. Each range image is additionally segmented into two parts, left and right image, by a vertical plane. The reason for this segmentation is that the backrest of the vehicle seat is not eliminated in the preprocessing stage. Nevertheless, a segmentation by a vertical plane allows the separation from a person or object. The following features are extracted from the full image as well as from segmented images (except for extracting PCA coefficients where only unsegmented preprocessed range image data is considered). 1) Moments of coordinates: The classification of objects from an image in a manner that is independent of scale, position, and orientation may be achieved by characterizing an object with a set of geometrical moments. Several different recognition techniques have been demonstrated that utilize moments to generate invariant features [27], [28]. Benefits of the moment based features are: • Noise present in the coordinates is reduced due to the fact that all pixels belonging to an object are included into the calculation. • Computation cost of moment based features is low. First order moments (so-called center of mass COM) and second order central moments (MOI, moment of inertia) for a set of coordinates are given by hxi COM ({x, y, z}) = hyi (3) hzi
hx2 i hxyi M OI({x, y, z}) = hyxi hy 2 i hzxi hzyi
hxzi hyzi hz 2 i
hxzi − hxihzi hz 2 i − hzi2
¶
Slope({x, z}) =
Sxz hxzi − hxihzi = . 2 2 hx i − hxi Sxx
(6)
Using the scatter matrices we can thus calculate implicitly the slope of a plane fitted to (x, z) data points. In addition, the diagonalization the above scatter matrix Sxz yields the parameters of an ellipse fitted to the x, z scatter plot. The length of major axis (L1) and minor axis (L2) as well as volume of the ellipse (V ) are given by q L1/2 =
λ1/2 (Sxz ) (7) πp V = det(Sxz ) 4 where λ1 and λ2 are the eigenvalues of the scatter matrix Sxz . 2) Extraction of curvature features: The calculation of the curvature involves the computation of second order derivatives which magnifies the effect of noise and is thus not suitable for low resolution images. To overcome this, a surface approximation with a paraboloid to represent the shape of the object in the scene [21] is used. The curvature features can then be extracted by evaluating the paraboloid surface. Second-order derivatives needed for the curvature calculation are computed directly from the coefficients of the paraboloid fit. Gaussian curvature (G), mean curvature (K) and the residual of the paraboloid fit are considered as features. See [5] for the properties of Gausssian and mean curvatures. We shall now explain the computation of curvature features. A range image can be represented as z = f (x, y), where (x, y, z) are the coordinates of a surface point with respect to the camera coordinate system [21]. A general form of a paraboloid is given by z = f (x, y) = ax2 + by 2 + cxy + dx + ey + f
(4)
where h∗i denotes the average over all pixel or pixel over the left, and right part of the image respectively, and (x, y, z) is the location of each pixel in the space. These moment features are useful to define other features. For instance the combination of the first order with the second (or higher) order moments allows the definition of features that describe the parameters of certain shapes fitted to the data, for example, parameters of a plane, or ellipse or paraboloid fitted to any coordinates or to the range data. These parameters can be calculated implicitly from scatter matrices as follows: First the scatter matrix of any coordinate can be computed using its first and second order moments. For instance the scatter matrix of the (x, z) coordinates is given by Sxz = cov({x, z}) µ hx2 i − hxi2 = hxzi − hxihzi
be calculated as
(8)
where A = [a, b, c, d, e, f ] are the parameters of the paraboloid. The coefficients of the paraboloid are obtained as the least-squares solution of an over-determined system of linear equations using the coordinates of a set of points in the neighborhood of the point of interest: Z = XAT , z1 .. . = zn | {z } |
Z
(9)
x21 .. .
y12 .. .
x1 y1 .. .
x1 .. .
y1 .. .
x2n
yn2
xn yn xn {z
yn
X
1 a .. .. . . 1 f } | {z
. }
AT
(10) (5)
where h∗i represents average over the number of pixel. Now consider a plane which is fitted to (x, z) data points and we are interested in defining its parameters. One interesting feature of the plane is the slope of this plane which can
A solution of the above linear equation which minimizes the residual < (Z − XAT )2 >, can be found by solving for A in the following equation, < X T Z >=< X T X > AT ,
(11)
where the average values are calculated from the set of
6
coordinates over the n pixel. Note that the 6 × 6 matrix < X T X > contains coordinate moments up to 4th order and the vector < X T Z > contains moments up to 3rd order. Expressions for calculating Gaussian curvature (K) and Mean curvature (H) are given by [31] 2 zxx zyy − zxy , (1 + zx2 + zy2 )2 (1 + zx2 )zyy − 2zx zy zxy + (1 + zy2 )zxx 2H = (1 + zx2 + zy2 )2
K=
(12)
where zx , zy are the first order derivatives and zxx , zyy , zxy the second order derivatives of the function z = f (x, y) evaluated with respect to x and y. However we are interested in evaluating the above curvature values at the point (xo , yo ) on the paraboloid, where the extreme values of the curvature occurs. From (12) one can see that curvatures H and K will be maximum if zx and zy both equal to zero. Therefore the maximal Mean and Gaussian curvature are now given by 2 K = (zxx zyy − zxy )|(x0 ,y0 )
COM (3) MOI (4) Slope of the plane (6) Length of major axis of the fitted ellipse (7) Length of minor axis of the fitted ellipse (7) Volume of the fitted ellipse (7) Mean Curvature (12) Gaussian Curvature (12) Residual of the paraboloid fit
TABLE I F EATURES CALCULATED FROM COORDINATE MOMENTS UP TO FOURTH ORDER
the eigenvectors of C, ordered so that the first row of A is the eigenvector corresponding to the largest eigenvalue, and the last row is the eigenvector corresponding to the smallest eigenvalue, then the PCA transformation is defined as Y = A(P − M )
2
= 4ac − b 2H = (zyy + zxx )|(x0 ,y0 ) = 2(a + c).
1. 2. 3. 4. 5. 6. 7. 8. 9.
(15)
T
(13)
Thus at the end, these curvature features can be computed by using the coefficients of the paraboloid which are in turn obtained by coordinate moments of order up to 4. One realizes that the Gaussian and Mean curvature are nothing else than the trace and the determinant of the Hessian matrix µ ¶ zxx zxy Hs = . (14) zyx zyy
where Y = [y1 , y2 , . . . , yN ] are the so-called principle components. For classification the first few principle components corresponding to the largest eigenvalues are taken into account. More details about principle component analysis can be found in [16] and [25]. C. Feature subset selection
It is thus evident that these features are invariant under any orthogonal transformation of the (x, y)-coordinates. The first 6 features listed in the Table I are calculated from the full image as well as from the left and right segmented part of each image. The last 3 features which come from the paraboloid fit are calculated using only images obtained after the segmentation. Features from the preprocessed range image are also calculated, like, number of pixel, first and second order moments of the distance. In a total 43 features are calculated and then a feature subset selection is used to select a subset of useful features for the classification task (see section IV-C).
For the geometric features it is not a priori evident which ones are best suited for the classification task. It is moreover important to reduce the number of features as far as possible since a larger number of features requires an increasing number of training data to avoid so-called overfitting [11]. Moreover, with too many parameters, complexity of the classifier increases. Thus a feature selection method has to be applied for reducing the total number of geometric features for the classification. Among various feature selection algorithms, a sequential forward feature subset selection with Mahalanobis class separability as a criterion [24] is chosen, since it selects features independent of the classification error. In such a case we can ensure that the same feature subset is used to evaluate different classification methods. A subset of 17 features out of 43 features was used for this purpose.
B. Principle Component Analysis (Eigenimages) PCA features are used for a similar application with a nearest neighbor technique in [22] and with a polynomial quadratic classifier in [15]. Since the performance reported with PCA features was promising, the performance using above described geometric features with the performance obtained using PCA features was compared. Assume that each range image in the training data is represented in a d-dimensional vector pi . Let P = [p1 , p2 , . . . , pN ]T be an d × N matrix where each column represents one image, with mean vector M and covariance matrix C. Let A be a matrix whose columns are formed by
In this section, different classification methods are presented which are evaluated in this study. The purpose of this paper is not to propose new methods for classification but to investigate different classification methods which are suitable for the complexity of the application and which should exploit the advantage of 3-D information available directly from the sensor. The methods are well established in the literature and a brief description is given. Two classification techniques are considered: Bayesian decision theory model based and discriminant function based classification [11]. In case of the prior classifier model, first
V. C LASSIFICATION TASK
7
a Bayes quadratic classifier is considered where the density of data in each class is approximated by one multi-variate Gaussian function on the feature space. Later, a GMM classifier is considered where the density of each class data is approximated by a mixture of several Gaussians. An Expectation Maximization (EM) algorithm is used to estimate the parameters of the Gaussian density [6] function. In the case of discriminant function based classifier, a regression classifier using a polynomial approximation is considered (PC). Later a non-supervised clustering of the data is performed to split the data in some of the classes into several clusters and then the polynomial classifier technique follows. We refer to the resulting classifier as polynomial cluster classifier (PCC). The details of different classifiers are described next. A. Bayes Quadratic Classifier The structure of the Bayes quadratic classifier is determined by the conditional densities p(v|ωi ) as well as by the prior probabilities P (ωi ) [29]. In pattern recognition applications, rarely, if ever, have this kind of complete knowledge about the a priori probabilities P (ωi ) and class conditional densities p(v|ωi ). In the present study, all classes are assumed to be equiprobable. The conditional probabilities can be estimated by assuming a multi-variate normal distribution. This is given by, p(v|ωi ) = N (v, µi , Σi ) · ¸ 1 1 =p exp − (v − µi )T Σ−1 (v − µ ) i i 2 (2π)N |Σi | (16) where µi is the mean and Σi is the covariance matrix of the input data v belonging to the class ωi . Now the probability density function of finding a pattern is given by (Bayes’ formula), P (ωi |v) =
p(v|ωi )P (ωi ) p(v)
(17)
p(v|ωi )P (ωi ) .
(18)
where p(v) =
c X i=1
B. GMM Classifier For the GMM classifier the density of the data belonging to one class is approximated by a mixture of Gaussians instead of only a single Gaussian as it is in the case of the Bayes quadratic classifier. Thus (16) for the conditional density of each class now becomes (see also [10]) p(v|ωi ) =
L X
wn ∗ N (v, µn , Σn )
Note that (17) to estimate the a posteriori probability and Bayes rule (18) to decide a class label are kept unchanged even in the case of GMM classifier. C. Polynomial Classifier Classifiers based on polynomial regression are well-known techniques [29], [9]. The advantage with this approach is that it makes no assumptions about the underlying statistical distributions and leads, at least when using the least meansquare error optimization criterion, to a closed solution of the optimization problem without iterations. Here, the discriminant function is given by d(v) = AT X(v),
(20)
where A is a coefficient matrix which is to be optimized based on a set of training data, and X(v) is a matrix of polynomials of input feature vectors. For example, for two features v = [v1 , v2 ]T and second order, one column vector in the matrix X(v) is given by x(v) = [1
v1
v2
v12
v1 v2
v22 ]T .
(21)
The discriminant function has as many components as there are classes to be discriminated. Finally, the decision is based on the nearest neighbor principle, Best match = arg max(di (v)). i
(22)
The optimal coefficient matrix A is found as a least-squares solution to (20) as: A = E{xxT }−1 E{xy T }.
(23)
See [29] for more details. For a multi-class classification task, there are several strategies for applying the polynomial classifier: If C > 2 is the number of classes one can: • Generate one classifier for all classes i.e., all-versus-all strategy; • Generate C(C − 1)/2 pairwise classifiers and apply a voting scheme for the class decision; • Generate C classifiers each of which discriminates one class against the union of the others, the so-called oneversus-all strategy. The latter two approaches are advantageous over the prior approach as the controlling of a mapping polynomial function will be done locally. However, it is possible that the variation of a data belonging to a class can be very large and approximation of a class function using a single polynomial function may not be sufficient to handle this variation. In order to consider the variation of the data within each class, the following approach is considered.
(19)
n=1
where wn is a weight on each Gaussian and L is the number of Gaussians per class. The expectation-maximization (EM) algorithm to estimate the mixture components parameters is used. A good tutorial on learning GMMs can be found in [14].
D. Polynomial Cluster Classifier First, the data belonging to a certain class is split into a number of clusters where in each cluster variation is very small. This means, grouping together similar data within each class and treat each group as a subclass within a class. A simple K-means algorithm [18] is used as a clustering method
8
to split the data of each class into a number of clusters. A polynomial regression classifier is then evaluated where the number of classes to be discriminated is now equal to the number of clusters. The resulting classifier is referred to as the polynomial cluster classifier (PCC). This is different from the approach described in [32], where a number of prototypes was found within each class (for example mean of each cluster), and then the nearest subclass classifier principle was used. This classifier implements a pre-supervised scheme and classifies unseen objects with the label of their nearest prototype. Whereas in the PCC approach, each individual cluster will be treated as a different class, and therefore the number of classes to be discriminated is now equal to the number of clusters. At the end classification decision is based on the nearest neighbor principle (22) and class label is assigned based on the cluster to which class it belongs to. A one-vs-all strategy is the only feasible way to train this classifier. VI. DATABASE AND E XPERIMENTS In this section, description of the database is presented first and then experiments conducted to examine the effectiveness of the above-described classifiers are presented. To train and test the classifier and to perform the recognition task, a database of test sequences has been developed. For this purpose, the camera was installed in a car (Fig. 3) equipped with a computer and a data acquisition software. Capturing and processing of images is done simultaneously whereas algorithm training and evaluation is done off-line. At present, the goal is to discriminate between 4 classes, Empty, rear faced infant seat (RFIS), forward face child seat (FFCS) and finally an adult person (P). The database consists of 450 sequences of different sizes with a total of 25878 images. In order to take the variation of occupant scenes into account, different occupants with varying hand postures, leg postures, and torso gestures were recorded. Possible variations in the position are also introduced in case of RFIS and FFCS classes, e.g., type, hight and orientation of child seats. For recordings, child seats with infants, persons in a large variation of size ranging from 5%il female to 95%il male were considered. These occupant types and their positions are defined in accordance with the new airbag safety standard released by NHTSA [3], but contains a much larger variation in person postures than the legal requirements. Table II shows the number of sequences in each class and size of the individual class data. Note that a convertible seat can used for rear-facing infant seat as well as forward-facing child seat. The goal of the experiments are as following: First, the comparison of the classification performance of all classifiers which were mentioned in Section V. For this purpose only geometrical features which were described in Section IV are used. The feature set is further reduced from 43 to 17 using a feature subset selection algorithm described in Section IVC. Once the features are selected, the classifiers of Section V are evaluated with the same feature set that has been selected. The optimized performance of a classifier is compared with the performance of a classifier obtained using only PCA features as a benchmark.
A. Validation criterion Cross-validation, leave-one-out method or re-substitution are standard procedures for the validation of a classifier [11]. The cross-validation procedure is considered here for the evaluation of all the classifiers. Training and validation datasets are selected from the dataset randomly. Two-thirds of the data is used for training and one-thirds of the data is used for the validation. Thus there was no overlap between the training and testing data. For all of the methods, the same training and testing dataset are used so that a fair comparison could be made between them. To ensure the independence in the overall performance, 100 different sets are selected out of the whole dataset with re-substitution and the average performance of all individual performances is reported as a overall performance. A sequence wise selection is chosen to maintain the independence between the training set and the test set. However a frame wise classification is considered for the evaluation. The sequence-wise classification (where the decision is based on the class to which the maximum number of images are assigned with) may improve the overall performance but this is not considered. Classifier performance is presented in a confusion matrix where elements on the main diagonal represents the number of images correctly classified for each class, whereas the elements in the off-diagonal represent the number of images confused with all other classes. The classification rate in each class is equal to the number of frames correctly classified in each class divided by the total number of images in each class. The overall performance of the classifier is given by the ratio, N IC (24) . TI N IC is the number of images correctly classified and T I, the total number of images. The classification results obtained from 100 data sets are then averaged to obtain the final classification results for each test set. This is done for each classification method and for every test set. OP =
B. Results Table III shows the performance of the Bayes quadratic classifier where each class density is approximated using a single Gaussian. We have chosen equal a priori probabilities for all classes. A multivariate Gaussian model for each class works very well and the overall performance is 94.72%. As evident from the table, classification is rather good for the first 2 classes and most of the mis-classifications occur in case of the FFCS class and the Adult class. This is on the one hand due to the fact that the Bayes quadratic classifier assumes a specific form of the statistical distributions of the classes, which may be a wrong assumption for our dataset and on the other hand due to the lack of enough variation in the training data to estimate the class conditional density. The prior problem can be handled by choosing a better approximation for the data. At this point it can be expected that a better estimation of the density of the data is possible by considering a mixture of Gaussians (GMM). A GMM classifier using a Bayes rule with EM training is examined next.
9
Class Empty RFIS FFCS P
Test Object Empty Vehicle Seat 1 car bed, 10 infant seats, 6 convertibles 6 convertibles 6 Adult persons: 5%il to 95%il
No.of Sequences 45 × 50 126 × 50 43 × 50 45 × 200
TABLE II DATASET USED FOR THE TRAINING AND THE EVALUATION OF ALL CLASSIFIERS
Table IV shows the confusion matrix of the performance of the GMM classifier. When using GMM classifiers the parameter to be optimized is the number of clusters within each class for which the overall classification rate is maximum. In [14], a method to find out the optimal number of clusters automatically was proposed. But here it was determined empirically that the minimum classification error was achieved for a GMM classifier with the following cluster number, [1 1 2 4], where each element in the row represents the number of clusters within each class respectively. Therefore only split the data of the last two classes as it is not needed for the other three classes where the performance is already very good. The EM algorithm to determine the parameters of each class (see Section V-B)is used, that is, mean, covariance and weight on each cluster [6]. The initial values of these parameters can be assigned randomly or using any clustering algorithm. Here a K-means algorithm [11] is used. As expected, the performance of the Adult class has been improved significantly, however, clearly there is a decrement in the performance in case of Empty, RFIS and FFCS classes. It indicates that some of the Gaussians of Adult class are spread too large such that they are overlapping with some of the clusters within the Empty, RFIS class and FFCS class. Table V shows the performance of the Polynomial classifier where the overall performance in this case is 95%. Note that the individual class performance in case of the Adult class is improved compared to the Bayesian decision model classifiers. A polynomial degree of 2 is selected for the classification task. It is observed that this is sufficient for the optimized performance and further increment in the degree of the polynomial actually caused over-fitting of the training data during the training phase. The approximation of class decision boundaries using a polynomial is indeed successful. All the mis-classification images occurred only in the Adult class. One reason could be that the data belonging to the Adult class show a large variation within the class where the approximation using a single polynomial function can not handle this variation and thus produce undesirable values. To overcome this problem, a clustering approach prior to the polynomial classifier is examined next. Table VI shows classification results with a combination of supervised clustering and a polynomial classifier. A K-means algorithm is used for the clustering. As it is evident from the table, there is an improvement in the overall performance. The parameter which is to be optimized is the number of clusters within each class. It was determined empirically that [1 1 1 3] as a number of clusters within each class gives
the maximum classification rate. After the evaluation of all the classifiers, the polynomial cluster classifier (PCC) shows the best overall performance compared to all other classes. The overall performance is close to 100% except for a few outliers which can easily be eliminated by maintaining a history buffer at the class output. The above performance is now compared with the performance obtained using only PCA features. A polynomial classifier is employed as a classifier technique. Table VII shows the performance of the polynomial classifier with only PCA features. To make the results directly comparable, 17 principle components were used for the classification task. Clearly, classifier performance with geometrical features is better compared to the performance obtained using PCA features. However, the performance with PCA features is good for the first three classes, whereas in case of the Adult class, it is poor. This was expected as the PCA analysis does not always provide enough information to discriminate classes. True Class Empty RFIS FFCS P
Empty 98.2 0 0 0
Estimated Class RF FF 0 0 99.8 0 0 84.8 5.2 0
P 1.8 0.2 15.2 94.8
TABLE III BAYES QUADRATIC CLASSIFIER RESULTS
True Class Empty RFIS FFCS P
Empty 97.4 0 0 0
Estimated Class RF FF 0 0 99.2 0 0 76.9 1 0
P 2.6 0.8 23.1 99
TABLE IV GMM CLASSIFIER RESULTS
VII. D ISCUSSION AND C ONCLUSIONS In this paper we presented a novel camera technique for the classification of vehicle occupants. In this safety critical application it is very challenging to provide high reliability due to the large variations in the scene. Past research on this
10
True Class Empty RFIS FFCS P
Estimated Class Empty RF FF P 100 0 0 0 0 100 0 0 0 0 100 0 0 2.5 0.6 96.9
True Class Empty RFIS FFCS P
Estimated Class Empty RF FF 94 0 0 0 92 0.8 0 0 99.8 0.5 8.9 4.6
P 6 7.2 0.2 86
TABLE V
TABLE VII
P OLYNOMIAL CLASSIFIER RESULTS
P OLYNOMIAL CLASSIFIER PERFORMANCE USING PCA FEATURES
True Class Empty RFIS FFCS P
Estimated Class Empty RF FF P 100 0 0 0 0 100 0 0 0 0 100 0 0 0.1 0 99.9 TABLE VI
P OLYNOMIAL CLUSTER CLASSIFIER R ESULTS
application use 2-D and stereographic system. A similar 3D camera technology as in the paper was used in [15] with PCA features. Using the camera described herein, a distance of the object/scene to the camera can be obtained directly without any further processing in contrast to the stereographic systems where additional image processing is needed and thus extra processing time. An additional advantage of the present camera system is that it gives a distance image which is independent of illumination and textures. Evaluation was performed within a test vehicle and considered both a large number of simple and complex situations. Different possible geometrical feature extractions from range images were presented which are invariant under scene variation and their performance was compared with the performance obtained with PCA features. Results have been improved compared to the previous work of [15]. A total of 43 features were extracted from each range image and the size of the feature vector is further reduced to 17 using a feature subset selection algorithm. For the comparison to be fair the feature subset selection algorithm was done independent of the classification performance. As a classification task, different classifiers were implemented: Based on Bayesian decision theory and on the estimation of decision boundary using discriminant functions. It was found that the classifier based on the estimation of a decision boundary using a polynomial performed better than the other methods. However, it was observed that the class approximation function could not handle the large variation that might be possible within one class. To overcome this, a modification in the classification method was proposed which used a cluster approach that splits data into clusters prior to the classification task. With this modification it was shown that 100% performance can be achieved. Results of the extensive evaluation showed the feasibility of the present camera system for occupant classification problem.
ACKNOWLEDGMENTS This project is funded by IEE S.A., Luxembourg and Luxembourg International Advanced Studies in Information Technology (LIASIT), Luxembourg. R EFERENCES [1] “Centre suisse d’electronique et de microtechnique SA,” http://www.csem.ch. [2] “IEE S.A., luxembourg,” http://www.iee.lu. [3] “Traffic safety facts 2004,” Traffic Safety Annual Report, National Highway Traffic Safety Administration (NHTSA), Tech. Rep., 2004, www-nrd.nhtsa.dot.gov/pdf/nrd-30/NCSA/TSFAnn/TSF2004.pdf. [4] F. Arman and J. K. Aggarwal, “Model-based object recognition in denserange images - a review,” ACM Computing Survey, vol. 25, no. 1, March 1993. [5] P. Besl, Surfaces in Range Image Understanding. Springer-Verlag, 1988. [6] J. Bilmes, “A gentle tutorial on the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden markov models,” Technical Report, University of Berkeley, ICSI-TR-97-021, 1997, Tech. Rep., 1997, citeseer.ist.psu.edu/bilmes98gentle.html. [7] K. Boyer and A. Kak, “Color-enhanced structured light for rapid active ranging,” IEEE Trans. on Pattern Analysis and Machine Intelligence, no. 9, pp. 14–28, 1987. [8] S. Burak-Goktuk and A. Rafii, “An occupant classification system eigen shapes or knowledge-based features,” Proc. IEEE International conference on Computer Vision and Pattern Recognition, 2005, San Diego, USA, pp. 57–64, 2005. [9] W. M. Campbell and K. T. Assaleh, “Polynomial classifier techniques for speaker verification,” IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 15–19, March 1999. [10] F. Cardinaux, C. Sanderson, and S. Marcel, “Comparison of MLP and GMM classifiers for face verification on XM2VTS,” Lecture notes in Computer Science, vol. 2688, pp. 911 – 920, 2003. [11] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. Wiley, 1991. [12] P. Faber, “Seat occupation detection inside vehicles,” 4th IEEE Southwest Symp. on Image Analysis and Interpretation, Austin, TX, pp. 187– 191, April 2000. [13] M. E. Farmer and A. K. Jain, “Occupant classification system for automobile airbag suppression,” Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. I–756–I–761, June 2003. [14] M. A. Figueiredo and A. K. Jain, “Unsupervised learning of finite mixture models,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp. 1 – 16, March 2002. [15] M. Fritzsche, M. Oberlander, T. Schwarz, B. Woltermann, B. Mirbach, and H. Riedel, “Vehicle occupancy monitoring with optical 3-D sensors,” Proc. IEEE Intelligent Vehicle Symposoium, pp. 90–94, June 2001. [16] K. Fukunaga, Introduction to Statistical Pattern Recognition. Academic Press, 1990. [17] M. Hebert, “Active and passive range sensing for robotics,” Proc. IEEE International conference on Robotics and Automation, 2000, vol. 1, pp. 102 – 110, April 2000. [18] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264 – 323, 1999. [19] B. Jan and B. Claus, “Curvature based range image classification for object recognition,,” Proceedings of SPIE Intelligent Robots and Computer Vision Algorithms, Techniques, and Active Vision, pp. 211– 220, October 2000.
11
[20] H. Kong, Q. Sun, W. Bauson, S. Kiselewich, P. Ainslie, and R. Hammoud, “Disparity based image segmentation for occupant classification,” IEEE Computer Vision and Pattern Recognition Workshop, Washington D.C., pp. 126–133, June 2004. [21] P. Krsek, G. Lukash, and R. Martin, “Algorithms for computing curvature from range images,” Computer and Automation Institute, Hungarian Academy of Sciences, Budapest, 1998, Tech. Rep., 1998, http://www.ralph.cs.cf.ac.uk/papers/Geometry/curvcompare.pdf. [22] J. Krumm and G. Kirk, “Video occupant detection for airbag deployment,” Proceedings of IEEE Workshop on Applications of Computer Vision, Princeton, NJ, pp. 30–35, October 1998. [23] R. Lange, “3D time-of-flight distance measurement with custom solid-state image sensors in CMOS/CCD - technology,” Ph.D. dissertation, University of Siegen, Department of Electrical Engineering and Computer Science, Germany, 2000. [Online]. Available: http://www.ub.uni-siegen.de/epub/diss/lange.htm [24] K. Z. Mao, “Orhogonal forward selection and backward elimination algorithms for feature subset selection,” IEEE Trans. On Systems, Man, And Cybernetics-Part B: Cybernetics, vol. 34, no. 1, pp. 629–634, February 2004. [25] A. M. Martinez and A. C. Kak, “PCA versus LDA,” IEEE Trans. On Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228–233, February 2001. [26] Y. Owechko, N. Srinivasa, S. Medasani, and R. Boscolo, “Vision-based fusion system for smart airbag applications,” Proc. IEEE Intelligent Vehicle Symposium, vol. 1, pp. 245–250, June 2002. [27] R. J. Prokop and A. P. Reeves, “A survey of moment-based techniques for unoccluded object representation and recognition,,” CVGIP: Graphical Models and Image Processing, vol. 54, no. 5, pp. 438 – 460, September 1992. [28] A. Reeves, R. Prokop, S. E. Andrews, and F. P. Kuhl, “Threedimensional shape analysis using moments and fourier descriptors,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 10, no. 6, pp. 937 – 943, November 1988. [29] J. Sch¨urmann, Pattern Classification: Statistical and Neural Network based Approach. John Wiley and Sons, Inc., New York, 1990. [30] M. M. Trivedi, S. Y. Cheng, E. M. C. Childers, and S. J. Krotosky, “Occupant posture analysis with stereo and thermal infrared video: algorithms and experimental evaluation,” IEEE Trans. on Vehicular Technology, vol. 53, no. 6, pp. 1698 – 1712, November 2004. [31] E. Trucco and A. Verri, Introductory Techniques for 3-D Computer Vision. Prentice Hall, 1998. [32] C. J. Veenman and M. J. T. Reinders, “The nearest subclass classifier: A compromise between the nearest mean and nearest neighbor classifier,” IEEE Trans on Pattern Analysis and Machine Intelligence, vol. 27, no. 9, pp. 1417 – 1429, September 2005. [33] J. W. Weingarten, G. Gruener, and R. Siegwart, “A state-of-art 3D sensor for robot navigation,” Proceedings of IEEE Intelligent Robot and Systems Conference (IROS) 2004, pp. 2155–2160, September 2004.