Dec 12, 2013 ... See figure 1 for the first trip that has been collected for .... 1. 2003 Cadillac CTS 4-
door Sedan — 6-cylinder. 133000623. 1 out that by including ...
Machine Learning project: Identify a Car’s Driver from Driving Behavior Fan Yang, Chunjing Jia December 12, 2013
1
Introduction
Each individual has his/her personal driving behavior, which could been used as a identifying characteristic, similar to handwriting. Under this hypothesis, we propose a learning study of the connection between a driver’s identity and the vehicle’s characteristics, such as accelerometer/heading/speed, which can usually be collected using the electronic system of the vehicle or by imposing other measurements. The dataset includes real-time high-frequency accelerometer, heading, speed, odometer, and gas usage. We first convert the time-dependent data into a large number of time-independent features, which can then be used to train vehicle-against-vehicle classifiers. We aim to obtain the reliable supervised learning algorithm for the single driver driving the same car, as well as unsupervised clustering to detect when vehicles have multiple drivers.
2
Data Collecting
The data collection was operated by MetroMile, Inc and has been saved in csv (comma-separated values) format which can be seen and manipulated by Microsoft Excel and Matlab. Each csv file has the information for one car, in which the data was collected for a number of trips. Each trip includes the information for a continuous section of time, usually every second or every few seconds. The recorded information includes the velocity in the units of mph, the orientation of the car, the accelerations in three dimensions, and the transient gas usage. See figure 1 for the first trip that has been collected for car #133000249. The characteristic number of trips collected at each car is a few thousands, which is for example 2281 for car #133000249 when any two consecutive data points collected with a time interval greater than 60 seconds being seen as two different trips. This provides us a lot of information to study the driving behavior of each driver. And further with the assumption that driving behavior is unique for each single person, we can identify the driver just by looking at the way he/she drives. We note that the we assume that the each driver’s driving behavior is independent of the car’s make/model/condition, just 1
like when people recognize the signature the kind of pen he/she uses is ignored. The same data collecting procedure has been performed for 18 different cars. We know from the data provider that some of the cars are driven by one single drive, while some of the cars are driven by multiple people in a family. Table 1 shows the list of car names, the corresponding number in the study and the number of drivers.
300 200 100 0
heading degree 0
100
200
300
400
500
600
700
800
100 speed mph gas mpg 50
0
0
100
200
300
400
500
600
700
1
800
accel x gs accel y gs accel z gs
0.5 0 −0.5 −1 0
100
200
300
400
500
600
700
800
Figure 1: The information of the first trip/section that has been collected for car #133000249.
3
Feature selection
Extracting out the key features from the tons of data that we have obtained is one of the key questions for this study. We see each trip as one data point, so that we can extract a vector x containing all the useful features to represent this data point. Then we can obtain, for example for car #249, 2281 data points. This has provided us a large enough data set for either the regression for the single-driver cases or the multi-class classification for the multiple drivers cases. To find out the good and useful features turn out to be a tough question, especially considering the complexity of the collected data and the problem itself. The features that we propose include: (1) average speed in each section x1 (2) max speed in each section x2 (3) average speed on the ramp when entering highway x3 (4) average speed on the ramp when leaving highway x4 (5) frequency of lane changing x5 (6) speed at 1 second before stop x6 (7) speed at 2 second before stop x7 (8) speed at 3 second before stop x8 (9) speed at 1 second after start x9 (10) speed at 2 second after start x1 0 (11) speed at 3 second after start x11 . x= [x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9 , x10 , x11 ]T . We find 2
Table 1: The list of car models, with the car number and the number of drivers, that have been used for the data collecting. car model and make 2005 Volkswagen GTI 2-door Hatchback — 4-cylinder 2004 Honda Pilot 6-cylinder — 4WD 2012 Toyota Prius v 4-door Wagon — 4-cylinder 2003 Toyota Corolla 4-door Sedan — 4-cylinder 2011 Infiniti G37 4-door Sedan — 6-cylinder 2011 Mercedes-Benz GL450 8-cylinder — 4WD 2008 Subaru Outback 4-door Wagon — 4-cylinder 2003 Honda Accord 4-door Sedan — 4-cylinder 2005 Toyota Camry 4-door Sedan — 4-cylinder 2012 Subaru Impreza 4-door Wagon — 4-cylinder 2011 Volkswagen Jetta 4-door Sedan — 5-cylinder 2011 Nissan Versa 4-door Hatchback — 4-cylinder 2007 Acura MDX 6-cylinder — 4WD 2000 Toyota Camry 4-door Sedan — 4-cylinder 2007 BMW 335 4-door Sedan — 6-cylinder 2001 BMW X5 8-cylinder — 4WD 2006 Honda Civic 2-door Coupe — 4-cylinder 2003 Cadillac CTS 4-door Sedan — 6-cylinder
car number 133000249 133000250 133000251 133000252 133000253 133000254 133000257 133000258 133000259 133000261 133000263 133000265 133000284 133000374 133000381 133000386 133000485 133000623
driver(s) condition 2 2 2 1 Family of 3 drivers same as 254 Family of 3 drivers same as 257 1 1 1 2 1 1 2 1 1 1
out that by including these features we don’t oversimply the modeling nor make the modeling over complicated so as to overfit.
4
Supervised learning
We performed supervised learning for the car of single driver. The internal relation of the features can be modeled as: x2 ∼ N (a1 ∗x21 +a2 ∗x1 +a3 , a4 ∗x1 +a5 ), x3 ∼ N (a6 , a7 ), x4 ∼ N (a8 , a9 ), x5 ∼ N (a10 , a11 ), x6 ∼ N (a12 ∗ x7 + a13 , a14 ∗ x7 +a15 ), x7 ∼ N (a16 ∗x8 +a17 , a18 ∗x8 +a19 ), x9 ∼ N (a20 ∗x10 +a21 , a22 ∗x10 + a23 ), x10 ∼ N (a24 ∗ x11 + a25 , a26 ∗ x11 + a27 ). For the cars of one single driver, we fit the features with the model described above and find the parameter a= [a1 , a2 , a3 , a4 , a5 , a6 , a7 , a8 , a9 , a10 , a11 , a12 , a13 , a14 , a15 , a16 , a17 , a18 , a19 , a20 , a21 , a22 , a23 , a24 , a25 , a26 , a27 ]T . The parameter vector a can be used to as the identification for the driver. A model fitting of the features for car #133000259 has been shown in figure 2.
5
Unsupervised learning
For those cars of multiple drivers, we use k-means clustering algorithm to separate different drivers. For example, for car #133000249 as shown in figure 3, the frequency of lane changing highlighted by the dotted circles have two clusters that can be directly used to separate the two drivers. This algorithm becomes very useful for separating the drivers who have very different behaviors on lane changing frequency, but may not work very well when different drives tend to have close behaviors on lane changing frequency.
3
speed
average (accel)speed on ramp (mph) average (deaccel)speed on ramp (mph) frequency of lane changing #/5000s
100 0.05 max speed mph
80
0
60
−0.05
40
−0.15
−0.1
−0.2
20
−0.25 0
20
40 60 average speed mph
80
100
0
50
50
40
40 speed next second
speed last second
0
30 20 1 second before stop 2 seconds before stop
10 0
0
10
20 30 speed this second
40
20
40
60
100
30 20 10 0
50
80
1 second after stop 2 seconds after stop 0
10
20 30 speed this second
40
50
Figure 2: Features and the model parameters for car #133000259 (1 driver). speed
100
0.05
max speed mph
80
0
60
−0.05
40
−0.15
−0.1
−0.2
20 0
−0.25 0
20
40 60 average speed mph
80
100
0
50
50
40
40
speed next second
speed last second
average (accel)speed on ramp (mph) average (deaccel)speed on ramp (mph) frequency of lane changing #/5000s
30 20 10 0
1 second before stop 2 seconds before stop 0
10
20 30 speed this second
40
40
60
80
100
30 20 10 0
50
20
1 second after stop 2 seconds after stop 0
10
20 30 speed this second
40
50
Figure 3: Features and the model parameters for car #133000249 (2 drivers). 4