2017-01-1372
Driver Identification Using Vehicle Telematics Data Bo Wang, Smruti Panigrahi, Mayur Narsude, Amit Mohanty Ford Motor Company
Abstract Increasing number of vehicles are equipped with telematics devices and are able to transmit vehicle CAN bus information remotely. This paper examines the possibility of identifying individual drivers from their driving signatures embedded in these telematics data. The vehicle telematics data used in this study were collected from a small fleet of 30 Ford Fiesta vehicles driven by 30 volunteer drivers over 15 days of real-world driving in London, UK. The collected CAN signals included vehicle speed, accelerator pedal position, brake pedal pressure, steering wheel angle, gear position, and engine RPM. These signals were collected at approximately 5Hz frequency and transmitted to the cloud for offline driver identification modeling. A list of driving metrics was developed to quantify driver behaviors, such as mean brake pedal pressure and longitudinal jerk. Random Forest (RF) was used to predict driver IDs based on the developed driving metrics. The RF model was also used to rank the importance of each driving metric on driver identification. In conclusion, this paper demonstrated the possibility of identifying drivers from their on-road naturalistic driving behaviors with 100% accuracy within 6 minutes of driving by training the RF model with 4 hours of driving data.
1. Introduction With the introduction of the electronic control unit (ECU) in the early 1970s [1] vehicles gained the ability to record and control their individual dynamic states. With intensive efforts from the automotive OEMs and suppliers, these ECUs have continued to evolve in terms of performance while enhancing the scope of control actions performed by the vehicle on which they are installed. ECU usually consists of various control modules such as Engine Control Module (ECM) and Brake Control Module (BCM). The newer vehicles now have nearly hundreds of ECUs [1, 2]. Sophistication and complexity of the embedded software that goes in these ECUs have continued to increase. The ECUs and sensors in a vehicle are linked through a controller area network (CAN) through which various control modules communicate with each other [1]. The signals through CAN bus can be captured through the on-board-diagnostic port (OBD-II) using a wireless dongle often referred to as plug-in-device (PID). In addition to sensors used in monitoring the vehicle internal systems, the technology for external world sensing has taken a leap in recent years. Advancements in external-world sensing technologies, such as a fusion of the Camera, GPS, RADAR, LIDAR, Ultrasonic Sonars, and Dedicated Short Range Communications (DSRC) [2], have fueled the growth in the 21st century automotive industry and opened the door to modern day digital and autonomous vehicles. Apart from the significant digitization of the automobiles in the past 50 years, data transmission and connectivity have enabled the vehicle sensor data collection and wireless transmission via vehicles’ built-in telematics units or via plug-in-devices’ (PID) cellular connectivity. The telematics in this paper refers to the communication technology used to send and receive vehicle Can bus data. The large amounts of Page 1 of 7 1/18/2017
vehicles’ high-frequency sensor data streams, and the vehicle-tovehicle (V2V), and vehicle-to-infrastructure (V2I) connectivity have created a new market dubbed mobility services industry. With the mobility services rapidly growing, the utilization of these technologies has been of growing interest in the past few years by companies like Uber and Lyft [3]. The mobility movement has led to further advancements in development of smart chip-based electronic devices that can collect and transmit vehicle CAN-bus signals wirelessly into the cloud. These CAN bus or telematics data contain rich information about a particular trip or drive segment. Various researchers have taken advantage of these data either in real-time or offline in developing predictive models for various mobility applications. One example of such efforts has been in identifying the drivers based on their driving habits. This has prompted researchers into exploring techniques to develop models in identifying drivers from their driving signature embedded in the vehicle telematics data. Various authors have used virtual simulators in order to collect telematics data similar to the vehicle telematics data. In particular Zhang et al. and Wakita et al. have used these data collected from a simulation environment in developing predictive models for driver identification [4, 5]. They have used various controlled routes and settings in reaching a prediction accuracy of 85% with 20 drivers using Hidden Markov Model (HMM) [4] and 73% with 30 drivers and 81% for 12 drivers using Gaussian Mixture Model (GMM) [5]. These studies, however, did not capture the real-world driving conditions. Although these analyses provide insights into the modeling techniques, the results cannot be compared to real-world situations due to various uncontrolled settings, such as different traffic patterns and weather conditions. Driver identification using mobile phone’s inertial sensor data has been explored by Van Ly et al. in distinguishing between two drivers in a controlled route including residential and highway segments [6]. Through this experimental study they have achieved 60% accuracy using supervised support vector machine (SVM) and unsupervised kmeans clustering technique. While a phone’s accelerometer provides acceleration signals only, this is not sufficient to capture drivers’ driving signatures. Miyajima et al. and Nishiwaki et al. [7, 8] have developed driver identification models with data collected from multiple expensive sensors, cameras and on-board instrumentations. This driving simulator study achieved 86% accuracy among 11 drivers and 77% accuracy among 274 drivers, using previously recorded multimedia and vehicle sensor data. The most promising results for a controlled environment study has been presented by Enev et al. [9]. The authors have used random forest techniques in arriving at 87% accuracy to distinguish between 15 drivers while using only the brake pedal signal and 99% accuracy using 5 sensor signals in a controlled and predefined route experiment. This
study used 48 different features extracted from 3 second windows of 60Hz frequency time series data. These results provided insights into the particular signals that were more prominent in the prediction model.
their personal vehicles in commuting to different places, as they would do in their day-to-day commute. Using the data collected during this driver behavior experiment, we aim to identify the drivers by training a machine-learning model using the historical driving data.
These investigations for driver identification, however, were set-up either in a controlled environment or using simulated data. Consequently, most studies lacked the real-world driving conditions such as weather, time-of-day, heavy vs light traffic condition. Most of these studies also used the same vehicle which ensures identical vehicle quality and status. In this paper, we did not constrain any of those sources of variabilities. The data we used for our driver identification was taken without the author’s knowledge of the weather, time-of-day, or traffic-condition. The drivers in this study used their own personal vehicles, which were Ford Fiesta vehicles from 2009 to 2015. One important assumption of this study was that the differences captured in driving metrics were mainly caused by drivers’ behaviors, even though the vehicles with ages can have some impacts on vehicle dynamics. Section 2 provides details of our experimental data collection and driver recruitment phase of Ford’s driver behavior experiment. The driver identification framework is presented in Section 3. Section 4 provides a detail description of the methods and algorithms used in the machine learning multiclass classification prediction model. In particular, the random forest classifier is used. The results of the random forest classification are then reported in section 5. Finally, the major findings and contributions are presented in Section 6.
2. Experimental Data Collection Ford Motor Company is currently expanding its business to be both an auto and a mobility company; as such, the company is pursuing several emerging opportunities. Various mobility experiments are being conducted worldwide under these initiatives. One such experiment is the Driver Behavior experiment [10], which is conducted in London in order to better understand driver behavior and explore various incentive-driven instruments to improve it. The geographical area where the drivers lived and took part in this experiment is shown in Figure 1.
Figure 2. Data flow diagram for vehicle CAN signals being collected through the on-board diagnostics port (OBD-II) through a plug-in-device (PID) equipped with a SIM card that transmits the CAN data through 2G connectivity to the cloud server.
2.1 Vehicle Sensors and Data Sources The recent development of vehicle telematics technology allows OEMs to collect high frequency vehicle data from CAN bus signals, such as vehicle speed, brake pedal pressure, and steering wheel angle. During a vehicle’s operation, a wide range of signals are transmitted through the CAN bus at a very high frequency. Among these signals, there are some signals that relate to the dynamics of the vehicle and some that relate to the driver’s control inputs. These signals are readily available through the OBD-II port of the vehicles by using aftermarket PIDs. As the computing power almost doubles every year, these aftermarket devices continue to improve their offerings both in terms of hardware and software. The PIDs used in Ford’s Driver Behavior experiments were from a third party supplier from France [11]. Using this OBD-II wireless PID [12], the sensor data streams were transmitted to the cloud server using a 2G SIM card. An illustration of the data flow from the vehicle to the cloud is shown in Figure 2.
2.2 Driver Recruitment
Figure 1. Map of London where most of the drivers took part in Ford’s Smart Mobility Experiment on Driver Behavior. (Google Map, 2016)
Over a six-month period, plug-in devices gathered data from more than 43 Ford Fiestas at the beginning of the experiment. However, only 30 drivers were kept in the driver identification study due to small driving sample sizes and missing vehicle signals. The CAN signals from the vehicles were transmitted to the cloud as the volunteer drivers rode Page 2 of 7 1/18/2017
The initial phase of the experiment was to recruit drivers with wide coverage of driver ages (19 – 70 years) and almost half-and-half split between driver genders. The testing vehicles were all from 2009 to 2015 Ford Fiesta vehicles, which were previously own by the participants. The drivers were recruited through email and phone calls. The participants signed the necessary data consent form and authorized Ford Motor Company to collect the CAN signals of interests. These drivers were then provided with a wireless PID that was then pluggedinto the OBD-II port of the vehicle as shown in Figure 2. Ford UK team helped configure and set-up the PIDs in each participant’s vehicle. The drivers were instructed to leave the PID untouched throughout the experiment. The PID was configured such that the CAN signals collected from the vehicle through the OBD-II port were transmitted to the Cloud server for future use. All of the volunteer
participants of this experiment used their own vehicle with PID provided by Ford.
3. Data Analysis Methods Approximately 83 million rows of real-world driving data were collected from 15 days of driving by volunteered drivers in London, UK. We focused on the CAN signals collected at 5Hz and used them to extract unique information about each driver’s driving style in order to identify the drivers among a pool of 30 drivers. All drivers in this experiment drove their own Fiesta vehicles. This is a unique differentiating factor between various other efforts in driver identification. Our goal is to find out whether our model can predict if Vehicle-X is driven by Driver-X. A total of 4 hours of driving activities were recorded as a training data set. The 4 hours’ data was used because most participants drove at least 4 hours in the 15 days. Additionally, different sample sizes were examined in the testing dataset to see how long it takes to differentiate all drivers. The challenges in this classification study are significant due to the fact that nothing about the driving routes or speed limit or traffic information is known. This type of study is also known as naturalistic driving study, which is defined as an unobtrusive observation method to study a drivers’ everyday driving behavior without any interventions [13]. The data stream could include not only highway driving but also expected to have complex city driving as the drivers lived in the greater London area. The routes could include multiple speed limits, traffic stops and parking lots. The driving weather condition is also an important factor that could affect someone’s driving signature. The driving style also changes depending on time of day, daylight conditions, and traffic patterns. We expect all of these factors to be present in our data set making a uniquely challenging identification task. Though various machine-learning methods were initially consider in this study, random forest model was used in this study due to its proven high accuracy predictions as reported in the literature [14-16]. The next few sections discussed the methodology for data imputation, driving metrics development, and random forest model.
3.1 Signal Pre-Processing
Figure 3. Plot of Vehicle Signal Data from a Sample Trip
3.1.1 Vehicle CAN Signals Vehicle CAN/telematics data streams that were used for the machine learning application are listed in Table 1 below: Table 1. Vehicle Sensors and Units Sensor Signal Symbol Unit Vehicle Speed KPH 𝑣𝑥 Engine Speed RPM 𝜔𝐸 Engine Torque Nm 𝜏 Brake Pressure Bar 𝑃𝑩 Accelerator Pedal Position % 𝑥𝐴 Steering Wheel Angle Degree 𝜃𝑠 The first three signals above are vehicle’s response to driver input and the rest are the driver’s control input for maneuvering the vehicle. The brake pressure and the steering wheel angle signal streams were collected from BCM. Vehicle speed, engine speed, engine torque, and accelerator pedal position were transmitted from ECM.
3.1.2 Derived Sensor Signals To maintain an efficient cloud data transmission, the data from the PIDs were transmitted to the cloud only when the CAN signals changed values. When the values of a particular signal did not change for 200 milliseconds the data field was filled as NULL on the cloud server. Before the data can be used in the classifier, the missing data (NULL values) must be filled with appropriate values. All the telematics data were pre-processed using the linear interpolation method to fill the missing or NULL values in order for it to be used in the machine learning model. Figure 3 illustrated different vehicle signals collected from a sample trip, including vehicle speed, acceleration pedal position, brake pedal pressure, engine speed, steering wheel angle, and engine torque. The units of the variables are shown in Table 1.
Apart from the vehicle CAN/telematics data streams; collected as shown above, we have computed the following three signals shown in Table 2, in order to capture a driver’s intrinsic behavior such as harshness and smoothness of the driving. Table 2. Derived Sensor Signals Derived Signal Symbol Unit 𝑑𝑣𝑥 𝑑𝑡
m/s2
𝑑2 𝑣𝑥 𝑑𝑡 2
m/s3
Longitudinal Acceleration
𝑎𝑥 =
Longitudinal Jerk
𝑗𝑥 =
Steering Speed
𝑆𝑥 =
𝑑𝜃𝑠 𝑑𝑡
degree/s
In particular, longitudinal jerk is the second derivatives of vehicle speeds and it captures the subtleness of a smooth vs rough driving. This longitudinal jerk is derived from the second derivative of the Page 3 of 7 1/18/2017
longitudinal velocity. Since the signals collected did not contain the longitudinal acceleration, we also derive this by computing the first derivative of the longitudinal acceleration. Steering speed is calculated as the first derivative of steering wheel angle. It is used to understand how fast the driver turns the steering wheel. The acceleration (𝑎𝑥 ) is calculated from the discrete longitudinal speed (𝑣𝑥 ) using the following formula: 𝑎𝑥 [𝑡2 ] =
𝑣𝑥 [𝑡2 ]−𝑣𝑥 [𝑡1 ] 𝑡2 −𝑡1
(1)
where 𝑡2 is the current time stamp, and 𝑡1 is the previous time stamp. Subsequently utilizing equation (1) the longitudinal jerk, 𝑗𝑥 , is computed as: 𝑗𝑥 [𝑡2 ] =
𝑎𝑥 [𝑡2 ]−𝑎𝑥 [𝑡1 ] 𝑡2 −𝑡1
(2)
The steering speed (𝑆𝑥 ) is calculated as the first derivative of the steering angle: 𝑆𝑥 [𝑡2 ] =
𝜃𝑠 [𝑡2 ]−𝜃𝑠 [𝑡1 ] 𝑡2 − 𝑡1
it is a flexible method that requires few prior assumptions on types of data it provides a probabilistic output and the ranking of variable importance it is an efficient algorithm and does not easily over fit the dataset
3.3.1 Methods and Criteria for Classification In order to train the random forest model, we split the data into sliding windows. Different window sizes (3s, 5, 10s, 20s, and 30s) had been tested to find the most efficient design of sliding window sizes. Windows length less than 3 seconds will contain fewer data points within each window and the driving metrics might not provide meaningful statistical features. Driving metrics were summarized from each sliding window. Additionally, a variation in the overlapping of the consecutive windows is used to find an optimum overlapping value. The study used 25% overlapping windows and no-overlapping windows to summarize the driving features. The overlapping window did not significantly improve the model accuracy, so it was not implemented in the final model.
(3)
Where the notations and units were defined previously in Table 1 and Table 2.
The feature vector computed from each sliding window is then used in training or testing of the classifier. The segmentation of the training and testing samples, and the method used for multi-class classification are described below.
3.2 Statistical Features
3.3.2 Training vs Testing Data Segmentation
We aim to build an efficient model using the few signals recorded and statistical features that can capture most of the driving signature. To this end, the statistical features used in the machine-learning model are the following:
The driver behavior experiment database was used for the driver identification. In the 15 days’ study period, an average driver recorded about 7 hours driving data. Some drivers drove less than 3 hours, whereas one driver drove more than 70 hours in the 15 days. From this database, we chose drivers that have driven for at least 6 hours to ensure that an adequate amount of data was available for the training of the classifier model. Additionally, a significant portion of vehicle brake pedal pressure data were lost for three vehicles, which were later excluded from the modeling process. Within the study period, another three vehicles were reported to have more than one driver and they were excluded from the model in order to remove bias. After applying these filtering criteria, the total number of vehicles considered for the classifier training was reduced from 43 to 30.
Mean Minimum Maximum 85th Percentile Standard Deviation
While the above features are expected to capture the intrinsic driving behavior from the driver’s control input and vehicle’s dynamic response, these simple set of features enable a model that is computationally less expensive and easy to be implemented. The forward stepwise selection method was used for variable selection.
3.3 Multi-Class Classification While there are various classification methods available, for the purpose of our investigation on prediction accuracy of the driver identification, we used the random forest model as it has been proven to be an effective and powerful model for multiclass classification [14 – 16]. Random forest is an algorithm for classification developed by Leo Breiman [17] that uses an ensemble of classification trees, which has been very successful in producing higher accuracy than regular tree model in test datasets [18 - 19]. Random forest algorithm has several characteristics that make it ideal for driver identification task:
The 15 days’ data were then split into two segments. The first 12 days’ data was used as the training dataset. The last 3 days’ data was used as the testing dataset in this study. Because different drivers had different amounts of driving time in the first 12 days, a random sampling technique was used to draw with 4 hours’ driving sliding windows from each driver. For testing dataset, different driving durations (5, 10, 20, 30, 40, 50 minutes) was tested to check the improvement of prediction accuracy with larger testing dataset. This will help us understand how long it takes for the random forest model to differentiate all drivers.
3.3.3 Data Standardization After windowing, we normalized the training data for each 10 seconds window using the following formula 𝑥𝑖𝑁 =
Page 4 of 7 1/18/2017
𝑥𝑖𝑊 −𝑥̅ 𝑊 σ(𝑥 𝑊 )
(4)
where 𝑥𝑖𝑊 is the ith raw data point in any particular window, 𝑥̅ 𝑊 and σ(𝑥 𝑊 ) are the mean and standard deviation value of this window respectively, and 𝑥𝑖𝑁 is the normalized value of ith data point in the window. This normalization is used in order to prevent skewness of the data window and to bring all the features into the same scale.
4. Random Forest Model Prediction Random forest has the advantage of building high accuracy prediction model with multiple classes. It can be used to fit categorical and continuous variables with both linear and non-linear relationships. The main concept of random forest algorithm is an ensemble learning method by building a group of decision trees (𝑇 𝑘 ) from the bootstrap sample (ℒ𝑘 ), where k is the total number of trees to be grown. Each learning sample is bootstrapped from the whole dataset (ℒ ) with replacement. And also, the input variables are randomly selected from the pool of all variables. This random process is the main reason why the method is called random forest.
accuracy. Therefore, it is concluded that larger sliding window does not utilize driving data efficiently. Based on the three window sizes we considered, it is recommended to use 5 seconds sliding window to achieve higher prediction accuracy with the most efficient use of driving data. Better understanding of the optimal sliding window sizes and testing sample sizes have direct impact on the real-world application of driver identification.
100% 95% 90% 85% 80% 5
10
20
30
40
50
MINIUTES In order to reduce bias, each tree is grown to the maximum depth with no pruning. The tree classifier is independent from all previous classifiers. The final class of an observation is determined by the largest number of votes from all trees. The idea is that the average results from a group of tree models outperforms each individual tree. Although some trees might fit poorly to the dataset, the limitation of those trees will be averaged out with a large group of trees. There are two important tuning parameters in the random forest mode. The number of randomly selected variables (m) at the subset of a node. The number variables in the sub-trees often uses the square root of total number of variables [17-18]. Another important parameter is the number of trees to be grown in the forest, which usually uses 1,000. A sensitivity analysis showed the model performance does not improve significantly after 200 trees. Although 200 trees could produce reasonable results, it was still decided to use 1000 trees to improve model accuracy without significant reduction of model runtime. The generalization error for a random forest with K trees denoted as 𝑃𝐸𝑘 . As the number of fitted trees increase, the generalization error of a random forest(𝑃𝐸𝑘 ) converges to a limit, which is the main reason that the random forest does not easily over fit the dataset. Another advantage of the random forest model is that the model gives ranking of the variable importance. GINI index measures the impurity at each splitting node. The decreased model accuracy with the absence of a variable is also used to measure variable importance. The Random Forest package in R statistical software was used to fit the model.
4.1 Model Accuracy Different combinations of window sizes (3s, 5s, 10s, 20s, and 30s) and testing sample sizes (5, 10, 20, 30, 40, and 50 minutes) were examined in the analysis. The random forest prediction accuracy with different window sizes and testing driving durations are plotted in Figure 4. The five different lines on Figure 4 indicate different window sizes. The xaxis uses different driving time in testing dataset. The y-axis indicates the model prediction accuracy. In general, as the testing driving time increased, the model prediction accuracy increased. All five sliding windows reached 100% prediction accuracy with 40 minutes testing data. With 5 minutes testing data only, the 3s, 5s, and 10s sliding window achieved almost 93% accuracy. Additionally, the 5 seconds sliding window was also the quickest one to reach 100% prediction accuracy, which took about 6 minutes to achieve 100% prediction accuracy. It took 40 minutes for 30s window to reach 100% prediction Page 5 of 7 1/18/2017
3s Window 20s Window
5s Window 30s Window
10s Window
Figure 4. Model Prediction Accuracy with Different Sliding Windows and Testing Sample Sizes
4.2 Voting Strategy for Driver Identification The best prediction model with least testing data needed is the 5 seconds sliding window with 6 minutes of testing dataset. This model reached 100% prediction accuracy with the least amount of testing data used. In this case, the random forest made about 70 predictions in the six minute period. Some predictions are correct, but some predictions were wrongly predicted as other drivers. The majority rule was used to determine who was driving the vehicle in the 6 minutes’ period. A confusion matrix as shown in Table 3 is a helpful way to visualize the prediction results. The actual drivers and predicted drivers were shown as rows and columns respectively. The drivers were anonymous in this study using driver ID from A, B, C, …, Z, AA, AB, AC, and AD. If the drivers were correctly predicted as the actual driver, the largest number will be allocated to the diagonal of the confusion matrix. This confusion table was color coded at each row. The highest number in each row is visualized as green color, and the lowest value is coded as white color. As shown from the table, the largest number is always on the diagonal of the table, which means the predicted driver is always the same as the actual driver. In summary, the overall driver identification model in this case (5-second window with 6 minutes of testing dataset) has 100% prediction accuracy.
Table 3. Confusion Table between Actual Drivers (Rows) and Predicted Drivers (Columns).
Figure 5. Ranking of Variable Importance
5. Discussion 4.3 Top CAN signals and Features Random forest has the advantage of providing variable importance through two statistics measures, which are decreased accuracy and GINI value. The decreased accuracy is calculated by dropping one variable at a time while keeping all other variables the same. If the mode performs worse without the dropped variable, it indicates the variable played an important role in class classification and prediction. Another way to measure the importance of the variables is measured from GINI value. It indicates the impurity of the split with the given variables. The larger decreased GINI value indicates higher importance of the variable. The top 10 variables were selected and plotted against model accuracy and GINI index in Figure 5. The most important variables in the random forest model are:
Maximum Brake Pressure Mean Engine Speed Maximum Engine Torque Maximum Engine Speed Maximum Steering Value Mean Steering Speed Maximum Jerk
It is interesting to observe that the majority of important variables are maximum statistics. That means the driver aggressiveness plays an important role to differentiate drivers from each other. Another observation worth noticing is that three driving metrics measured driver inputs (maximum brake pressure, maximum steering values, and mean steering speed), whereas the rest of metrics measured vehicle response. The inter-vehicle differences might be captured in the random forest model. However, the differences in vehicle dynamics were assumed to be minimum, because all vehicles were the same model of Ford Fiesta vehicles between 2009 and 2015. It should be noted that the variable importance might also slightly change due to different number of variables chosen at each node, the initial seed for randomization, and the number of bootstrap trees, but the changes on variable importance should be minimum.
Page 6 of 7 1/18/2017
This study examined the possibility to use machine learning to identify drivers using vehicle telematics data. It has some important differences from previous research. The previous driver identification studies were conducted on either simulated environment or predefined driving routes. This driver identification study allowed drivers to perform their daily driving without any experimental intervention. The naturalistic driving setting best replicates the real-world driving scenarios. These real-world scenarios introduce a large quantity of noises into the model, which made the driver identification task more challenging. Additionally, this study conducted sensitivity analysis to test the influence of window sizes and testing sample sizes on model accuracy. It is important to understand the most efficient way to use the telematics data to achieve high prediction power. Last but not least, some previous studies merged driving data with roadway data, but this study predicted drivers using vehicle telematics data only. This study has a number of important contributions to the state-of-theart telematics data analysis. This study successfully set up an experiment to collect vehicle telematics data and transmit the high frequency data through cloud server. A number of data quality issues were identified from the dataset. For example, some vehicles lost brake pedal pressure data for a significant portion of driving data, but they were excluded from the dataset. Several imputation methods were proposed to properly impute the missing data. Driving metrics were proposed in this study to characterize different driving styles. Random forest model was used in this study to predict a driver from a group of drivers. The model was found to have high accuracy in identifying drivers from a group of 30 drivers. Among different sliding window sizes tested, the 5 seconds window size was found to utilize the vehicle data most efficiently. The sensitivity analysis of testing sample size also found that the accuracy reached 100% after merely 6 minutes into driving. In other words, it is possible to identify drivers with 100% accuracy after observing 6 minutes driving data given the potential complexity of the driving environments. One important limitation of this study is the quality of real-world vehicle telematics data. Some signals were missing from the vehicles for a significant time period. Some vehicles were excluded from the study because of the data quality issue. One major assumption made in this study was that the difference captured in the CAN signal were mainly caused by drivers’ behaviors, because the vehicle models were all controlled as Ford Fiesta vehicle. However, the different mileage and ages of vehicles might cause differences in vehicle dynamics. That means the random forest model might potentially pick up some differences between the vehicles. It is recommended to collect driving data from the same vehicle, so that the inter-vehicle differences are eliminated from the dataset.
Moreover, the random forest model in this study was implemented as ad-hoc analysis. The run time of the model took about 5 minutes on a commercial laptop with 16 GB RAM. Therefore, the current implementation of random forest does not support real-time driver identification. It is recommended to implement the algorithm on a cloud server to support streaming analysis of driver identification with large-scale dataset. The other limitation of random forest includes the predicted variables need to be discrete variables only. Although the RF model gives good prediction, but it is often treated as a black box with less explanatory power. Future study will include differentiating drivers with unsupervised learning techniques using unlabeled data.
8.
9.
10.
6. Conclusions This paper presented a detailed framework for driver identification using machine learning. It successfully collected vehicle telematics data using third party PIDs with cloud transmission capability. The data quality issue was carefully examined and proper imputation techniques were used to fill in the missing values. Various driving metrics were also developed in this study to characterize different diving styles. Random forest model was used to predict drivers from a group of 30 drivers. This sensitivity analysis study found 5 seconds sliding window used the data more efficiently. After training the random forest model for about 4 hours of real-world naturalistic driving data, the model with 5 seconds sliding window was able to predict with 100% accuracy in 6 minutes driving. The features along with CAN and computed signals were ranked for their importance in the identification model accuracy. Majority of the important variables were observed to be the maximum statistics, indicating that the driver aggressiveness is an intrinsic signature that is unique to each driver. The classification task in this study was ad-hoc analysis and does not support real-time streaming analysis. This algorithm can be expanded in the future to a large-scale dataset with more drivers and also explore the possibility to differentiate drivers using unsupervised machine learning.
References 1. 2.
3. 4.
5.
6.
7.
Leen, G., and Heffernan, D., "Expanding automotive electronic systems," Computer, 2002, 35 (1), pp. 88-93. Varghese, J.Z., and Boone, R.G., “Overview of Autonomous Vehicle Sensors and Systems,” Proceedings of the 2015 International Conference on Operations Excellence and Service Engineering. Lyft and Uber Are ‘Allies’ in the Transit Revolution: http://time.com/4259615/lyft-uber-apta-mobility-study/ T. Wakita, K. Ozawa, C. Miyajima, K. Igarashi, I. Katunobu, K. Takeda, and F. Itakura. Driver identification using driving behavior signals. IEICE TRANSACTIONS on Information and Systems, 89(3):1188–1194, 2006. X. Zhang, X. Zhao, and J. Rong. A study of individual characteristics of driving behavior based on hidden markov model. Sensors & Transducers (1726-5479), 167(3), 2014. M. Van Ly, S. Martin, and M. M. Trivedi. Driver classification and driving style recognition using inertial sensors. In Intelligent Vehicles Symposium (IV), 2013 IEEE, pages 1040–1045. IEEE, 2013. C. Miyajima, Y. Nishiwaki, K. Ozawa, T. Wakita, K. Itou, K. Takeda, and F. Itakura. Driver modeling based on driving
Page 7 of 7 1/18/2017
11. 12. 13.
14.
15. 16.
17.
18. 19. 20.
behavior and its evaluation in driver identification. Proceedings of the IEEE, 95(2):427–437, 2007. Y. Nishiwaki, K. Ozawa, T. Wakita, C. Miyajima, K. Itou, and K. Takeda. Driver identification based on spectral analysis of driving behavioral signals. In Advances for InVehicle and Mobile Systems, pages 25–34. Springer US, 2007. Enev, M., Takakuwa, A., Koscher, K. and Kohno, T., 2016. Automobile Driver Fingerprinting. Proceedings on Privacy Enhancing Technologies, 2016(1), pp.34-50. “Ford Mobility Experiment Shows Drivers How Good They Really Are...And Could Save Them Money.” Fleetpoint.com. Last modified June 21st 2016, http://www.fleetpoint.org/fleet-industry-news/news-bydate/ford-mobility-experiment-shows-drivers-how-goodthey-really-are-and-could-save-them-money/ “Mobile Devices Ingenierie.” Mobiledevice., last modified September 1st 2016, http://www.mobile-devices.com “Munic Box.” Munic Box. Last modified September 1st, 2016: https://www.munic.io Wang, B. 2015. Modeling drivers’ naturalistic driving behavior on rural two-lane curves. Dissertation. Iowa State University. Hallac, D., Sharang, A., Stahlmann, R., Lamprecht, A., Huber, M., Roehder, M., Sosic, R., and Leskovec, J., “Driver Identification Using Automobile Sensor Data from a Single Turn,” http://stanford.edu/~hallac/ITSC.pdf Vakati, K., “Driver Telematics Analysis,” Master’s Project. San Jose State University, 2015. Xu, L., Fujimura, K., “Real-time Driver Activity Recognition with Random Forests,” Proceeding of Automotive UI, 2014. Breiman, L., Friedman, J., Olshen, R., Stone, C., “Classification and Regression Trees,” New York, Chapman & Hall, 1984. Breiman, L., “Bagging Predictors. Machine Learning,” 1996, 24, pp. 123-140. Ripley, B.D., “Pattern Recognition and Neural Networks,” Cambridge, Cambridge University Press, 1996. Hastie, T., Tibshirani, R., Friedman, J., “The Elements of Statistical Learning,” New York: Springer; 2001.
Contact Information Bo Wang Research and Innovation Center, Ford Motor Company 2101 Village Road, Dearborn, MI 48121 Email:
[email protected] Tel: 515-509-3879.
Acknowledgments We appreciate the involvements of all the drivers in this mobility experiment as well as grateful for all the planning and set-up of the experiment handled by Mr. Jonathan Scott and Mr. Robin Giles of Ford UK. The mobility experiment is sponsored by Ford Motor Company.