A Framework for Realistic Vehicular Network ...

A Framework for Realistic Vehicular Network Modeling using Planet-scale Public Webcams Gautam S. Thakur‡§ Pan Hui‡ Ahmed Helmy§ §

CISE, University of Florida, Gainesville, ‡ Deutsche Telekom Laboratories, Berlin

[email protected], [email protected], [email protected] ABSTRACT

primitives and mechanisms, including broadcast and geocast protocols[1]. Initial efforts to capture realistic vehicular density distributions were limited by availability of sensed vehicular data[16]. Hence, there is a need to collect and conduct vehicular density modeling using larger scale and more comprehensive datasets. Furthermore, commonly used assumptions, such as exponential distribution[15] of vehicular inter-arrival times[1], have been used to derive many theories and conduct several analyses, the validity of which bears further investigation. In this study, we provide a novel framework for the systematic monitoring, analysis and modeling of vehicular traffic density distributions at a large-scale. To avoid the limitations of sensed vehicular data, we instead utilize the existing global infrastructure of tens of thousands of video cameras providing a continuous stream of street images from dozens of cities around the world. Millions of images captured from publicly available traffic web cameras are processed using a novel density estimation algorithm to build an extensive measurement library of spatio-temporal vehicular traffic densities. We perform a comprehensive analysis of this data to characterize the underlying statistical patterns at individual intersections and highways of major cities. Briefly, the temporal correlations between consecutive hours of individual locations are nearly 80% correlated, but go down to 30% for a 3-4 hours lag difference. We also investigate the best distribution fitting by comparing the frequencies, observed in the empirical density distribution to the expected frequencies of the theoretical distribution. We discover that empirical values closely follow (less than 3% deviation on KS-test) ‘log-logistic’ and ‘gamma’ distributions. These results question the adequacy to apply exponential distribution for vehicular traffic modeling and can impact the design and evaluation of vehicular networks. The contributions of this work are: • To the best of our knowledge, we provide by far the largest and most extensive library for future

Realistic design and evaluation of vehicular mobility has been particularly challenging due to a lack of largescale real-world measurements in the research community. Current mobility models and simulators rely on artificial scenarios and use small and biased samples. To overcome these challenges, we introduce a novel framework for large-scale monitoring, analysis, and modeling of vehicular traffic using freely available online webcams. So far, we collected 125 million vehicular mobility records from 2,700 cameras located in ten different cities. Initial analysis of traffic densities show 80% temporal correlation during various hours of a day. The modeling of traffic densities against known theoretical distributions show less than 5% deviation for log-logistic and gamma distribution. This is surprising because the assumption of an exponential distribution for traffic modeling. To the best of our knowledge, this is also the largest dataset ever used. We believe this framework and the dataset provide a much-needed contribution to the research community for realistic and data-driven design and evaluation of vehicular networks.

1. INTRODUCTION Research in the area of vehicular networks has increased dramatically in the recent years. With the proliferation of mobile networking technologies and their integration with the automobile industry, various forms of vehicular networks are being realized. These networks include vehicleto-vehicle[3], vehicle-to-roadside[8], and vehicle-toroadside-to-vehicle architectures. Realistic modeling, simulation and informed design of such networks face several challenges, mainly due to the lack of large-scale community-wide libraries of vehicular data measurement, and representative models of vehicular mobility. Earlier studies in this area have clearly established a direct link between vehicular density distribution and the performance[4, 13] of vehicular networks 1

City Bangalore Beaufort Connecticut Georgia London London(BBC) New york Seattle Sydney Toronto Washington Total

vehicular network analysis. This potentially addresses a severe shortage of such datasets in the community. The library will be made available to the research community in the future. • A fast algorithm to efficiently process millions of images for novel traffic density estimation. • Establish log-logistic and gamma distributions as the most suitable fits for vehicular traffic density. We believe our framework helps ‘fills a gap’ between expected and realized necessity for the ‘design and evaluation of realistic and data-driven models’ for the future generations of vehicular networks. Next, in section 2, we discuss related work. In section 3, we propose framework and its related components like measurements and pre-processing, knowledge discovery and statistical modeling. In section 4, we briefly discuss impact and challenges on the vehicular networks. Finally, we conclude in section 5 and give insight into the future work.

# of Cameras 160 70 120 777 182 723 160 121 67 89 240 2709

Duration 30/Nov/10 - 01/Mar/11 30/Nov/10 - 01/Mar/11 21/Nov/10- 20/Jan/11 30/Nov/10 - 02/Feb/11 11/Oct/10 - 22/Nov/10 30/Nov/10 - 01/Mar/11 20/Oct/10 - 13/Jan/11 30/Nov/10 - 01/Mar/11 11/Oct/10 - 05/Dec/10 21/Nov/10 - 20/Jan/11 30/Nov/10 - 01/Mar/11 -

Records 2.8 million 24.2 million 7.2 million 32 million 1 million 20 million 26 million 8.2 million 2.0 million 1.8 million 5 million 125.2 million

Table 1: Global Webcam Datasets traffic for the purpose of inter-vehicle communications. Data collected from realistic scenarios shows the effectiveness of exponential model for highway vehicle traffic. Along the same lines, quantitative characteristics of vehicle arrival patterns on highways is studied in [10]. These findings enrich traffic modeling, but were carried out on very small sample of data and mainly localized to one or two locations. In our study, we use 42 days of vehicular imagery data from four cities to model the traffic and characterize its density distribution.

2. RELATED WORK

3. VEHICULAR MOBILITY FRAMEWORK

Large-scale mobility datasets are very important for the mobile networking and computing research communities, but collecting them is even more challenging and usually expensive [3, 7, 14]. In some cases, commercial vendors log number of vehicles, GPS coordinates, speed and movement traces. However, there are three downsides to it. First, these traces are not publicly available to the research community. Second, they contain only particular vehicles with vendor specific hardware. Third, they are from individual vehicles with short driving distances and in most cases with non-repetitive journeys. Invariably, these issues undermine the efficacy of using them for any kind of longitudinal analysis. In this paper, we propose an inexpensive method to collect global scale vehicular mobility traces using thousands of freely available webcams that provide continuous and fine-grained monitoring of the vehicular traffic. Simulation tools like CORSIM[6] and VISSIM[9] are geared to model specific scenarios for planning future traffic conditions on a micromobility and small scale level. But they lack tools to perform network analysis[14] . From a networking perspective, mobility models[5, 11] and routing[18] techniques investigate how mobility impact the performance of routing protocols[2]. If the mobility model is unrealistic then routing performance becomes questionable[12]. We need models inspired from real datasets- by way of this work, we believe a comprehensive set of parameters can be extracted to develop such models. In a recent work, Bai et.al [1] analyzed spatio-temporal variations in vehicular

In this section, we describe our proposed framework, shown in Fig.-1, which is comprised of three parts: (i) Measurements and pre-processing, (ii) Knowledge discovery, and (iii) Modeling and analysis. The measurements and pre-processing component (in green, fully covered) is responsible to capture imagery snapshots, sanitize data and generate a quantifiable value of vehicular traffic, hereafter known as traffic density(d). The knowledge discovery(orange, partially covered) focuses on applying data mining tools to extract traffic patterns, and spatio-temporal information. This activity can help to develop rich mobility scenarios. Next, the modeling and analysis component focus on characterizing the vehicular traffic densities. It can aid in designing and developing new data-driven vehicular mobility models and simulators. Next, we discuss these components.

3.1 Measurements and pre-processing There are thousands, if not millions, of outdoor cameras currently connected to the Internet, which are placed by governments, companies, conservation societies, national parks, universities, and private citizens. We view the connected global network of webcams as a highly versatile platform, enabling an untapped potential to monitor global trends, or changes in the flow of the city, and providing largescale data to realistically model vehicular, or even human mobility. We developed a crawler that collects vehicular mobility traces from these online webcams. A majority of these webcams are deployed by 2

Measurements and Pre-processing

Filtering Outliers Detection Traffic Density Estimation Normalization

Traffic Web Camera Images

(43.11, -75.59) (33.75, -84.38) (41.01,-73.64) (151.18, -33.86)

Knowledge Discovery

Modeling and Analysis

Spatial, Temporal and Spatio-Temporal Mining

Models

Clustering

Vehicular Traffic Modeling

Patterns and Rules

Vehicular Related Networks

Design and Evaluation

Prediction Models

Graphical Visualization

Mobility Models and Simulators

Media Image Storage Server

Location and KML Processing Classification and associations Database Server

Geo-Coordinates

Figure 1: A framework for monitoring, analysis and modeling of vehicular traffic densities.

Algorithm to Extract Traffic Information

(a) London

We aim to estimate traffic density(d) on roads considering the number of vehicles or pedestrians crossing the road. We have a sequence of images captured by webcams. Considering our problem, we have to be able to separate information we need, e.g. number of vehicles and pedestrians from the back ground image, which is normally road and buildings around. The main factor that can distinguish between vehicles and background image (road, buildings) is the fact that the vehicles are not in a stationary situation for a long period of time, however the back ground is stationary. The solution for the problem then seems to be applying a sort of high pass filtering over a sequence of images captured by a webcam over time. The high pass filter removes the stationary part of the images (road, buildings, etc.), and keeps the moving components (mainly vehicles). In order to implement such a high pass filter, we subtract result of a low pass filter over a sequence of images, from each still image. This is practically equivalent to implementing a high pass filter over sequence of images. In order to obtain low pass filtering effect, we run a moving average filter over a time sequence of images obtained from one webcam. The duration of the moving average filter can be adjusted in an adhoc way. The moving average filter is simply implemented by averaging over the intensity map for several images in a certain duration. At the output of the moving average filter, the intensity of each pixel is obtained by averaging intensity of corresponding pixels in the interval. The output of the moving average filter (low pass filter) is normally the required background image, which is still part of the image. Therefore, subtracting each image from the output of the low pass filter, gives us the moving components (e.g. vehicles). Having the

(b) Sydney

Figure 2: The red dots show the location of cameras in London and Sydney. city’s Department of Transportations (DoT). These web cameras are installed on traffic signal poles facing towards the roads of some prominent intersections throughout city and highways. At regular interval of time, they capture still pictures of on-going road traffic and send them in the form of feeds to the DoTs media server. For the purpose of this study, we made agreements with DoTs of 10 cities with large coverage to collect these vehicular imagery data for several months. We cover cities in North America, Europe, Asia, and Australia. Since, these cameras provide better imagery during the daytime, we limit our study to download and analyze them only during such hours. On average, we download 15 Gigabytes of imagery data per day from over 2700 traffic web cameras, with a overall dataset of 7.5 Terabytes containing around 125 million images. Table-3 gives a high level statistics of the dataset. Each city has a different number of deployed cameras and a different interval time to capture images. In Fig.-2, we show a geological snapshot of the cameras deployed in the city of London and Sydney. The area covered by the cameras in London is 950km2 and that in Sydney is 1500km2. Hence, we believe our study will be comprehensive and will reflect major trends in traffic movement. Next, we discuss the algorithm to extract traffic information from images. 3

1

CDF (% Cameras)

0.8 0.6 0.4 1−Hour 2−Hour 3−Hour 4−Hour

0.2 0

−0.2

0

0.2 0.4 Correlation Coefficient (ρ)

0.6

(a) d = 2023, 0.28 (b) d = 5400, 0.55 (c) d = 9230, 0.93

Figure 4: Traffic with varying densities[(a)low/(b)medium/(c)high] is shown. The first values is the result of background subtraction and later is the normalized value.

0.8

(a) London 1

CDF (% Cameras)

0.8

City 1st Best Fit 2nd Best Fit 3rd Best Fit Connecticut L[87%] G[11%] E[0.5%] London L[42%] G[39%] W[16%] Sydney L [62%] G[32%] N[2%] Toronto G[46%] W[31%] L[21%] E=Exponential. G=Gamma, L=Loglogistic, N= Normal, W=Weibull

0.6 1−Hour 2−Hour 3−Hour 4−Hour

0.4 0.2 0 −0.2

−0.1

0

0.1

0.2 0.3 0.4 0.5 Correlation Coefficient (ρ)

0.6

0.7

Table 2: Dominant distribution as Best Fits[By Ranking]

0.8

(b) Sydney

City Connecticut London Sydney Toronto

Figure 3: CDF showing correlation of traffic densities between hour differences of the day. high pass component of the image, the vehicles are highlighted from background. One could then use regular object detection techniques to identify and count number of vehicles in the high pass filtered image. However, this is computationally expensive and unnecessary. As an alternative, we simply count the number of active pixels (pixels with a value higher than a certain threshold). This is much faster than detecting and counting objects in an image. At the same time, it is more effective, because we are looking at the traffic densities (d), i.e. percentage of the street (road) which is covered by vehicles (as an indicator of how crowded is the street), rather than number of vehicles. Number of vehicles is not a good indicator of crowdedness, as a long vehicle may introduce more traffic than a small one. Second, our method overcomes the issues that object detection face in case of severe congestion. Counting number of active pixels can indicate what percentage of the road is covered, no matter how many vehicles are in the road. In many instances, images are duplicate, corrupted with zero sized or with extraneous bytes (noise). We use semi-supervised learning and hierarchical clustering to overcome the challenges of outliers’ detection and removal. Due to limited space, we omit the details.

63% L[62%], G[15%], W[3%] G[34%], L[34%], W[10%], N[0.5%] L[88%], G[61%], W[4%], N[2%] G[75%], W[58%], L[34%]

65% L[94%], G[44%], W[19%] L[82%], G[70%], W[47%], N[7%] L[98%], G[88%], W[44%], N[18%] G[94%], W[88%], L[87%], E[4%], N[1%]

Table 3: Dominant distributions as Best Fits [By % Deviation KS-Test.] day, from 7 AM to 6 PM. For example, we investigate what the correlation is between the traffic at 7 AM and 8 AM(1-hour lag), 1 PM and 3 PM(2-hour lag) etc. In Fig.-3, we show CDF for various hours lag of the day. For the city of Sydney the hourly traffic change is highly correlated, almost 80% of cameras’ next hour traffic is 70% correlated to its current hour. For next two hours from the current, the traffic for 80% of the cameras are only 50% or less correlated. And around 60% cameras have only 30% correlation for a time lag of 3-4 hours. While in case of the city of London, the next hour traffic density for 80% cameras is close to 60% correlated to the current hour. It goes further down to 30% for next two hours and around 15-20% for a 3-4 hour difference. Thus, vehicular traffic has temporal richness, which in-turn affects the mobility of vehicles and therefore, have an impact on the performance of routing protocols[2]. Similar trends are observed in other cities, but omitted here for brevity.

3.3 Modeling and Analysis Here, we focus on modeling empirical traffic densities against known theoretical distributions. The objective of this study is to help understand the underlying statistical patterns. Due to page limits, we only discuss the results from four cities (London, Sydney, Toronto, and Connecticut) with total 458 camera locations and 12 million images. Traffic density is an approximation of traffic on roads. This assumption is different from counting cars using loop

3.2 Knowledge Discovery We investigate correlation coefficients(ρ) to measure the degree to which traffic of a camera is linearly associated with itself for 42 days. In our case, we are using this to analyze the change in traffic densities. We analyze the correlations for 1-4 hour lags for each camera against itself during 12 hours of the 4

1

0.8

0.8

0.8

0.6 London Traffic Exponential Gamma Loglogistic Normal Weibull

0.4 0.2 0 0

0.1

0.2

0.3

0.4

0.5 Data

0.6

0.7

0.8

0.9

1.0

CDF (CI 95%)

1

CDF (CI 95%)

CDF (CI 95%)

1

0.6 London Traffic Exponential Gamma Loglogistic Normal Weibull

0.4 0.2 0 0

0.1

0.2

(a) London(L)

0.3

0.4

0.4 Data

0.6

0.7

0.8

0.9

0.6

London Traffic Exponential Gamma Loglogistic Normal Weibull

0.4 0.2

1.0

0 0

(b) London(M)

0.1

0.2

0.3

0.4

0.5 Data

0.6

0.7

0.8

0.9

(c) London(H)

Figure 5: CDF plot for three varying traffic intensities, Low(L), Medium(M) and High(H).

(a) Low Traffic

(b) High Traffic

(c) Periodic Traffic

(d) Random Traffic

Figure 6: Several variations in traffic densities across 42 days traffic monitoring is shown. Fig-(a) show relatively mild traffic during various hours of the day, while (b) show high traffic recording for the full trace periods. In Fig-(c) we find a regularity patterns during the morning and evening hours when the traffic is relatively higher than afternoon intervals. A random traffic characterization is recorded in the last. Avg. % [1st Best Fit]

100 80