WORKING PAPER
1
Map Inference Under Low-frequency GPS Data: A Shortest-Paths Based Approach Wen Jin and Hai Jiang
Abstract—In the past decade, there has been a proliferation of studies that leverage GPS data reported by probe vehicles, such as taxicabs or shuttles, to infer the underlying road networks. Although existing map inference algorithms can provide satisfactory results under high-frequency GPS data, for example, GPS data collected every second, their performance deteriorates when the reporting frequency of the GPS data is low, for example, every 30 seconds. In this study, we propose a shortest-paths based approach to infer road networks under low-frequency GPS data. Our approach works with the GPS locations reported by the vehicles directly, rather than the trajectories formed by connecting consecutive GPS locations reported by each individual vehicle. We first perform an initial inference of the underlying road network through an iterative process involving a shortest-paths based algorithm. We then apply a series of map refinement techniques to further improve the appearance of the inferred road network. We compare the performance of the proposed algorithm against two popular trajectory-based algorithms in the literature. Using real lowfrequency GPS data reported by taxicabs in Changsha, China, we find that the proposed algorithm outperforms the two trajectory-based algorithms by 19.66% and 17.53%, respectively, in terms of F-score. To further understand the influence of reporting frequency on the algorithms, we carry out sensitivity analysis using a publicly available high-frequency GPS data collected by commuter shuttles in Chicago, United States. We find that our algorithm is more robust with respect to the increase in reporting intervals. Index Terms—GPS data, low frequency, map inference, shortest paths
the kernel density estimation category discretize the space covered by the GPS data into grid cells of pixels and calculate the lengths of GPS trajectories that fall within each cell. The algorithms from the trace merging category merge GPS trajectories to identify road networks. Although existing map inference algorithms can provide satisfactory results under high-frequency GPS data, for example, GPS data collected every second, their performance deteriorates when the reporting frequency of the GPS data is low, for example, every 30 seconds. This is because existing algorithms are trajectory-based, that is, they rely on the fact that the GPS trajectories of the vehicles track links in the actual road network well. Although this assumption is true when the reporting frequency of the GPS data is high, it no longer holds when the frequency of the GPS data is low. In Figure 1, we illustrate the trajectories of two probe vehicles, among which one reports high-frequency GPS data (Vehicle 1) and the other reports low-frequency GPS data (Vehicle 2). The gray grid represents the actual road network and the solid dots correspond to the locations where vehicles report their GPS data. We can see that the GPS trajectory of Vehicle 1 closely tracks the links in the actual road network. This, however, is not true for Vehilce 2, whose GPS trajectory deviates from the links in the actual road network significantly.
I. I NTRODUCTION CCURATE and up-to-date road maps are critical to route planning and vehicle navigation. In the past decade, there has been a proliferation of studies that leverage GPS data reported by probe vehicles, such as taxicabs or shuttles, to infer the underlying road networks. The inferred network is then used to update maps for navigation. Biagioni and Eriksson [1] provide a comprehensive survey on this topic and Ahmed et al. [2] cover the latest development in this field. According to [1], map inferencing algorithms can be classified into three categories, that is, K-means clustering [3–5], kernel density estimation [6–9], and trace merging [10, 11]. Most, if not all, existing studies work with the GPS trajectories of the vehicles, which are formed by connecting consecutive GPS locations reported by each individual vehicle. For example, the algorithms from the K-means clustering category select a series of cluster centers in the GPS trajectories and apply K-means clustering to adjust their positions. The algorithms from
A
W. Jin and H. Jiang are with the Department of Industrial Engineering, Tsinghua University, Beijing 100084, China. e-mail:
[email protected] (H. Jiang) Manuscript received July 12, 2017.
Vehicle 1
Vehicle 2
Fig. 1: Illustrative examples for the GPS trajectories formed by high-frequency and low-frequency GPS data: Vehicle 1 reports high-frequency GPS data and its GPS trajectory closely follows the links in the actual road network. Vehicle 2 reports low-frequency GPS data and its GPS trajectory deviates from the links in the actual road network.
WORKING PAPER
In this study, we propose a shortest-paths based approach to infer road networks under low-frequency GPS data. Our approach works with the GPS locations reported by the vehicles directly, rather than the trajectories formed by connecting consecutive GPS locations reported by each individual vehicle. Our approach first perform an initial inference of the underlying road network through an iterative process. In this process, we construct an auxiliary network by connecting the GPS locations reported by all vehicles as long as their Euclidian distances are within a certain threshold. In this auxiliary network, we apply a shortest-paths based algorithm to identify the subset of links that closely represent the actual road network. This subset of links forms the initially inferred road network. We then apply a series of map refinement techniques to further improve the appearance of the initially inferred road network. Specifically, we remove zigzags, merge nearby junction nodes, eliminate erroneous dead-end paths, and fine-tune the locations of nodes to improve the overall match between the inferred road network and the underlying GPS data. We compare the performance of the proposed algorithm against two popular trajectory-based algorithms in the literature. Using real low-frequency GPS data reported by taxicabs in Changsha, China, we find that the proposed algorithm outperforms existing algorithms and the average improvements relative to the two trajectory-based algorithms are 19.66% and 17.53% in terms of F-score. To further understand the influence of reporting frequency, we carry out sensitivity analysis using a publicly available high-frequency GPS data collected by commuting shuttles in Chicago, United States, where we systematically vary the reporting interval from 3.5 seconds to 70 seconds. We find that our algorithm is quite robust with respect to the increase in reporting intervals in that its F-score stays around 0.5 while those for the base algorithms drop dramatically when the reporting interval exceeds 35 seconds. The remainder of this paper is organized as follows. In Section II, we introduce the details of the initial map inference algorithm. In Section III, we present the techniques that refine the inferred road network. In Section IV, we describe the data sets and the empirical evaluation of the proposed algorithm. Finally, we give our conclusions in Section V. II. I NITIAL M AP I NFERENCE In this section, we detail the procedure to perform initial map inference. We first create an auxiliary network G from the input GPS data. In the auxiliary network G, we develop a shortest-paths based algorithm that identifies a subset of links that approximates the actual road network. This subset of links form the initially inferred road network M. A. Construction of the auxiliary network Let N be the set of nodes that correspond to the GPS locations reported by the probe vehicles. For node i ∈ N , let xi and yi be its longitude and latitude. Arc (i, j) is created between nodes i and j if their Euclidean distance eij is below a threshold eM AX . The travel cost on arc (i, j) is denoted as cij = e2ij , which implies if the
2
Euclidean distance between two nodes doubles, their travel cost grows 22 = 4 times. The benefit associated with this treatment is detailed later in Section II-D. Let A be the set of arcs created and G = (N , A). In Figure 2(a), we give an example of the auxiliary network. The black dots are the GPS locations reported by probe vehicles and they become the nodes in the auxiliary network. We connect two nodes if their Euclidean distance is below eM AX , which is set to 20 meters in this research. B. The idea for the initial map inference algorithm The initial map inference algorithm identifies a subset of links in G to approximate the underlying actual road network through an iterative process. In the beginning of the algorithm, or Iteration 0, the initially inferred road network M is empty. We arbitrarily select a node s in the auxiliary network as an origin node and in each iteration of this process, we carry out the following three steps: • Step 1: Identify Ds , which contains nodes whose Euclidean distances from s are between R and R +∆R, where R is the step size and ∆R is a small increment (R ∆R). Find the shortest paths from the starting node s to all nodes in Ds . For a node t ∈ Ds , let hs, ti refer to the shortest path from s to t. These paths are stored in our candidate path list L; • Step 2: If L is empty, stop. Otherwise, for each path hs, ti ∈ L, compute P (hs, ti |M), its potential to make a positive contribution to links already in M. A path with a high potential shall meet the following two criteria: (1) Have a high likelihood to be a correct inference of the links in the actual road network; and (2) Do not overlap with links already included in M. To avoid distracting the readers from understanding the idea for the initial map inference, the details associated with the calculation of a path’s potential are deferred till Section II-C; and ∗ ∗ • Step 3: Let hs , t i ∈ L be the path which has the highest potential. Remove hs∗ , t∗ i from L and add its links to M. For simplicity, the addition of links in path hs∗ , t∗ i to M is expressed as M ← M ∪ {hs∗ , t∗ i}, which means that hs∗ , t∗ i also refers to the set of links in this path. Let t∗ be the new starting node, that is, go back to Step 1 and treat node t∗ as node s. Let us explain the above steps using the examples shown in Figures 2(b) through 2(f). The details are also presented in Table I, where Column 1 shows the iteration ID, Column 2 shows the origin node for each iteration, Column 3 shows the set of destination nodes, whose shortest paths are to be found. Column 4 shows the elements in the candidate path list L. Column 5 shows the path selected, that is, the path has the highest potential to make a positive contribution to M. The last column shows the elements in M at the end of the iteration: • Iteration 0: The candidate path list L and the auxiliary network M are both set to empty sets; • Iteration 1: We start from node s in Figure 2(b). Two circles whose radii are R and R + ∆R, respectively, are drawn to identify the set of destination nodes Ds . The area inside the dashed rectangle in Figure 2(b)
WORKING PAPER
3
R + "R R s
(b) Iteration 1: Node s is the origin node and two circles are drawn to identify the set of destination nodes.
(a) The auxiliary network
t1 t4
t1
t2 t3
s
s
(c) Iteration 1: For origin node s, the set of destination nodes Ds = {t1 , t2 , t3 , t4 } are identified and the shortest paths from s to them are found.
(d) Iteration 1: At the end of this iteration, the initial inferred road network M becomes {hs, t1 i}.
u1 u6 s
t1
u5
s
u2 u3
t1
u4
u4
(e) Iteration 2: For origin node t1 , the set of destination nodes Dt1 = {u1 , u2 , · · · , u6 , s} are identified and the shortest paths from t1 to them are found.
(f) Iteration 2: At the end of this iteration, the initially inferred road network M becomes {hs, t1 i , ht1 , u4 i}.
Fig. 2: An illustrative example for the initial map inference algorithm. TABLE I: The first two iterations for the illustrative example shown in Figures 2(a) through 2(f) Iteration
1 2
Origin node
Destination nodes
s t1
t1 , t2 , t3 , t4 u1 , u2 , u3 , u4 , u5 , u6 , s
Candidate path list L
Selected path
Inferred network M
∅ hs, t1 i, hs, t2 i, hs, t3 i, hs, t4 i {hs, t2 i, hs, t3 i, hs, t4 i, ht1 , u1 i, ht1 , u2 i, ht1 , u3 i, ht1 , u4 i, ht1 , u5 i, ht1 , u6 i, ht1 , si}
hs, t1 i ht1 , u4 i
∅ {hs, t1 i} {hs, t1 i , ht1 , u4 i}
WORKING PAPER
is further magnified in Figures 2(c) through 2(f), so that we can see the details clearly. In Figure 2(c), the set of destination nodes Ds = {t1 , t2 , t3 , t4 } and the shortest paths to them, that is, hs, t1 i, hs, t2 i, hs, t3 i, and hs, t4 i are shown. These four paths are stored in our candidate path list L, which now becomes {hs, t1 i , hs, t2 i , hs, t3 i , hs, t4 i}. Suppose that hs, t1 i has the highest potential to make a positive contribution to M. We then remove hs, t1 i from L and add links in this path to M, which becomes {hs, t1 i}. Figure 2(c) shows the appearance of M at the end of Iteration 1. Node t1 is selected as our new origin node for the second iteration; • Iteration 2: We start from t1 in Figure 2(e). Two circles whose radii are R and R + ∆R, respectively, are drawn to identify the set of destination nodes Dt1 = {u1 , u2 , u3 , u4 , u5 , u6 , s}. The shortest paths between node t1 and nodes in Dt1 are also shown. These new paths are added to L. Note that path ht1 , si is the reverse of hs, t1 i found in the first iteration. Path ht1 , si will have a very low potential to contribute to M = {hs, t1 i} because its links completely overlap with links already in M. Similarly, path ht1 , u6 i will not have a high potential either, because the majority of its links are already in M. Suppose that path ht1 , u4 i has the highest potential. We then remove ht1 , u4 i from L and add links in this path to M. Figure 2(f) shows the appearance of M at the end of Iteration 2. Node u4 is then selected as the new origin node for the third iteration. This iterative process continues until the candidate path list L becomes empty. An example for the output of the initial map inference algorithm is shown in Figure 3. The thick gray lines in the background correspond to the actual road network. We can see that the initially inferred road network roughly follows the underlying road network. However, since it is composed of a subset of links in the auxiliary network, there is quite some room to make improvement, for example, the path connecting nodes B and C has many zigzags, the path between nodes C and F is a false inference of a dead end, and junctions A and B shall be combined to form an intersection. In Section III, we present a series of techniques to further refine the initially inferred road network.
4
B A
C F
D E
Fig. 3: The output of the initial map inference algorithm.
•
•
of Φhs,ti . Let LENGTHhs,ti be the length of path hs, ti. DENSITYhs,ti is defined as the ratio between NUM GPShs,ti and LENGTHhs,ti . Intuitively, if a path traverses a neighborhood with a high GPS density, it is more likely to be a correctly inferred path; DISPERSIONhs,ti is calculated as the standard deviation of the distances between the GPS points in Φhs,ti and path hs, ti. When the distances of the GPS points in Φhs,ti to path hs, ti have a high standard deviation, this means that these GPS points are scattered around the path and path hs, ti is prone to be erroneous; and OVERLAPhs,ti measures the difference between path hs, ti and links already in M. This is used to prevent select a path that overlaps significantly with links already in M. For example, in Iteration 2 of Table I, path ht1 , si overlaps with links already in M = {hs, t1 i}. It should then have very low potential to contribute to M.
In this research, we follow the procedure detailed in [12] to estimate the coefficients for the above three variables.
C. Calculating the potential of a path In each iteration of the initial map inference algorithm, we need to compute the potential for each path in L to make a positive contribution to M. This is accomplished by a logistic regression model that predicts the likelihood of being a correct inference. The variable selection and coefficient estimation are detailed in a parallel study conducted by the authors [12]. For a given path hs, ti ∈ L, we compute its potential P (hs, ti |M) by taking into consideration the following independent variables in the logistic regression model: • DENSITYhs,ti is the density of the original GPS points in the area traversed by path hs, ti. Let Φhs,ti be the set of GPS points that are within 50 meters from path hs, ti and NUM GPShs,ti be size
D. The choice of travel cost for an arc When we construct the auxiliary network in Section II-A, we set the travel cost of arc (i, j) to e2ij , which means that the travel cost between two nodes grows much faster than the Euclidean distance between them. The advantage of doing so is that it encourages the shortestpaths algorithm to produce paths that traverse areas with dense nodes, or GPS data. This is particularly helpful around intersections. During Iteration 2 in the previous example, when we need to find the shortest path from t1 to u4 in Figure 2(e), the setup of the travel cost makes Path 2 more costly although its Euclidian distance is shorter than Path 1. Please refer to Figure 4 and Table II for details.
WORKING PAPER
5
TABLE II: Path 1 has lower travel cost although its distance is longer.
Path 1
Travel cost (meters2 ) cij 2411.9
Distance (meters) eij 194.3
Path 2
6030.5
133.5
Selected √
Algorithm 1: Initial Map Inference Algorithm 1 2 3 4 5 6 7 8 9 10 11
t1 v1 Path 2
12
Path 1 v2 u4
13 14 15 16 17 18
Fig. 4: The shortest path from t1 to u4 follows Path 1 because it has lower travel cost although its distance is longer than Path 2.
19 20 21 22
Initialization: s, M ← φ, L ← φ begin repeat Ds = {t|R ≤ est ≤ R + ∆R, ∀t ∈ N } for t ∈ Ds do Find hs, ti, the shortest path from s to t L ← L ∪ {hs, ti} end if L 6= φ then for hs0 , t0 i ∈ L do Compute P (hs0 , t0 i |M) if P (hs0 , t0 i |M) < P0 then L ← L \ {hs0 , t0 i} end end hs∗ , t∗ i ←argmaxhs0 ,t0 i P (hs0 , t0 i |M) L ← L \ {hs∗ , t∗ i} M ← M ∪ {hs∗ , t∗ i} s ← t∗ end until L = φ; end
E. Algorithm Statement for Initial Map Inference We now formally present the statement for the initial map inference algorithm in Algorithm 1. In each iteration, given s, we define Ds on Line 4. On Lines 5 through 8, we find the shortest paths from s to all nodes in Ds and add the shortest paths to L. If L is not empty, on Line 11, we compute P (hs, ti |M) for each path hs, ti ∈ L. On Lines 12 through 14, we remove paths whose potential is below P0 from L. This helps to prevent L from growing too large. We then select hs∗ , t∗ i, the path with the highest potential in L on Line 16 and remove it from L on Line 17, add it to M on Line 18. Finally, we set t∗ as the new starting node on Line 19. The algorithm stops when L becomes empty. III. M AP R EFINEMENT T ECHNIQUES In this section, we work on the initially inferred road network and apply a series of techniques to further improve its appearance. The goal is to improve the match between the initially inferred road network and the original GPS data. We first identify junction and dead-end nodes in M. We then remove the zigzags along the paths connecting adjacent junction or dead-end nodes. Afterwards, we merge junction nodes that are close to each other and remove erroneous dead-end nodes. Finally, we fine-tune the locations of junction and dead-end nodes to improve the match between the final inferred road network and the raw GPS data. A. Identification of junction nodes and dead-end nodes The initially inferred road network is composed of nodes and arcs in the auxiliary network. To facilitate the refinement of the network, we need to first identify junction and dead-end nodes in the network. These nodes
and the paths connecting them form the skeleton of the inferred network. A junction node is defined as one whose degree is at least three. This kind of nodes is likely to be a three-way or multi-way intersection in the actual road network. In Figure 3, nodes A, B, C, D, and E are examples of junction nodes. A dead-end node is defined as one whose degree is exactly one. A dead-end node corresponds to the terminal node of a dead end. In Figure 3, node F is an example of a dead-end node. B. Removal of zigzags along paths connecting adjacent junction and dead-end nodes Take a path between two adjacent junction nodes in Figure 3, for example, the path between nodes B and C, we can observe that there are a lot of zigzags along the path, which is not surprising because this path traverses quite a few nodes in the auxiliary network and is composed of tiny links. To remove the zigzags, we need to find a simplified curve traversing fewer nodes. This can be easily accomplished by the Douglas-Peucker algorithm as is suggested in [6]. Figure 5 shows the effects of the Douglas-Peucker algorithm on the inferred road network in Figure 3. C. Merge of nearby junction nodes In Figure 3, we can observe that in the initially inferred road network, a four-way intersection tends to be identified as two three-way junction nodes. We correct this by scanning through all junction nodes in the initially inferred network. For each junction node, we check if there is another junction node nearby. If so, we combine these two nodes into a single junction node. For example, junction nodes A and B in Figure 3 is merged into a new junction node G in Figure 6.
WORKING PAPER
6
B
C
D
G
C
E
A
H
F
F
Fig. 5: Removal of zigzags.
G
C
H
Fig. 7: Elimination of erroneous dead-end paths.
D. Elimination of erroneous dead-end paths In Figure 3, we can notice that in the initially inferred road network there are false inference of dead end roads. This is mainly due to the drifting of the GPS locations reported by the probe vehicles. We, therefore, apply the logistic regression model introduced in Section II-C to remove dead-end paths such as the path connecting nodes C and F. E. Fine-tune the locations of nodes After we apply the techniques we mention previously, the number of nodes in the initially inferred road network is dramatically reduced. We call this new network M0 . We now fine-tune the locations of the nodes in M0 to improve the match between the initially inferred road network and the underlying GPS data using an optimization model. For each node i in M0 , we make minor adjustment to its coordinates (xi , yi ), so that the overall distance between all the GPS points and links in M0 is minimized. Note that when (xi , yi ) changes, the appearance of M0 also
Fig. 6: Merge nearby junction nodes.
G
H
Fig. 8: The final output of our map inference algorithm.
changes, therefore, the overall distance between all the GPS points and links in M0 changes. This is analogous to linear regression, except that our M0 contains a set of line segments. The final inferred road network is shown in Figure 8. We want to point out that due to GPS errors, not all intersections can be nicely inferred. For example, the intersection in the upper right corner of Figure 8 still contains diagonals. IV. C ASE S TUDY We use two GPS data sets to evaluate the performance of the proposed algorithm. The first data set is a real GPS data set collected from about 12,000 taxicabs in Changsha, China on July 1st, 2015. It covers an area of 12 km×12 km and contains around 2 million GPS records. The average reporting interval is 38.7 seconds and the distribution of the reporting intervals is depicted in Figure 9(a). Since different taxicab companies install different GPS devices on their vehicles, we can see that the reporting intervals are concentrated on a few values.
WORKING PAPER
To understand the influence of reporting frequency, we also test our algorithm using a public data set from Chicago, United States whose average reporting interval is about 3.5 seconds. This data set covers an area of 7 km × 4.5 km and contains around 120 thousands GPS records. The distribution of the GPS reporting intervals is presented in Figure 9(b). We then systematically increase the reporting interval from 3.5 seconds to 70 seconds and test the performance of the algorithm. The algorithms are implemented in Python and the numerical experiments are conducted on a workstation with an Intel i5 2.6 GHz processor and 8 GB of RAM. A. Base algorithms We benchmark our algorithm against two popular trajectory-based algorithms proposed by Davics et al. [6] and Cao and Krumm [10], respectively. For ease of discussion, these two algorithms are referred to as Algorithms A and B as follows: • Algorithm A: Davics et al. [6] first discretize the map into grid cells of pixels and calculate the density of each cell based on the GPS trajectories or points within a cell. A binary image representation of the road network is then produced using a global threshold. Finally, the road centerlines are generated from this binary image; and • Algorithm B: Cao and Krumm [10] first introduce a clarification step that merges GPS trajectories that are located near to each other. Then new arcs are identified by incrementally inserting each trajectory into the inferred road network following local criteria such as distance and direction. B. Evaluation criteria Given a final inferred road network M0 , we need to compare it against T , the actual road network to evaluate the quality of the inference algorithm. Since both M0 and T are composed of nodes and links, it is not an easy task to measure their similarity/disimilarity and produce the conventional precision and recall rates. Inspired by the idea detailed in [1], we use the following procedure to compute the precision and recall rates: First create a grid mesh over the area covered by the inferred and actual road networks. The size of each cell in the grid mesh is 5 m × 5 m. Figures 10(a) and 10(b) show the inferred road network and the actual road network on the same grid mesh, respectively. Let S be the set of cells in the grid mesh. For a given cell i ∈ S, we assign two binary indicators. The first one is mi , which takes value 1 if it is traversed by a link in M0 ; and 0, otherwise. The second one is ti , which takes value 1 if it is traversed by a link in T ; and 0, otherwise. We then calculate the precision and recall rates as follows. P (mi × ti ) P (1) precision = i∈S i∈S mi P (mi × ti ) P recall = i∈S , (2) i∈S ti P whereP i∈S mi is the number of cells traversed by M0 , and i∈S ti is the number of cells traversed by T ,
7
P and i∈S (mi × ti ) corresponds to the number of cells traversed by both M0 and T . The F-score is defined as the harmonic average of the precision and recall as follows: F-score =
2 × precision × recall precision + recall
(3)
In P the examplePshown in FigureP 10, we have that i∈S mi = 16, i∈S ti = 15, and i∈S (mi × ti ) = 14. According to the above equations, the precision rate of 14 inferred road network in Figure 10 is 16 , the recall rate 14 28 is 15 , and the F-score is 31 . Since F-score is a combined performance measure of the precision and recall for the inferred road network, it is the main performance metric used in this paper. C. Results for low-frequency GPS data in Changsha We divide the entire region into 36 subareas measuring 2 km×2 km and randomly split the 36 subareas into 24 training subareas and 12 test subareas. Then, we experiment with combinations of parameters on the training subareas to obtain parameters that lead to the highest F-scores, and use those trained parameters on the test subareas to benchmark the algorithms. The detailed results on the test subareas are summarized in Table III. Column 1 shows the IDs of the test subareas, where Subareas 1 through 7 are urban subareas and Subareas 8 through 12 are suburban areas. Columns 2 through 5 show the precision, recall, F-score and runtime for the test subareas using Algorithm A, Columns 6 through 9 show those statistics using Algorithm B, and Columns 10 through 13 show those using the proposed algorithm. The last two columns show the improvement in F-score for the proposed algorithm relative to Algorithms A and B. The last row in this table averages the results in all test subareas. From the last two columns of the table, we can see that on average, the proposed algorithm is superior to both Algorithms A and B and the average improvements relative to Algorithms A and B are 19.66% and 17.53%, respectively. We also show the runtime (in minutes) of the three algorithms in Table III. As much as possible, we use the same functions to perform common operations. The average runtime of Algorithm A is 1.0 minutes, and that of Algorithm B is 410.4 minutes. The proposed algorithm finishes in 1.1 minutes. We observe that Algorithm A and the proposed algorithm have similar runtime complexities, while Algorithm B requires significantly greater runtime. Figure 11 shows the results of the road network inferred by each algorithm in an urban subarea and a suburban subarea. Consistent with the findings in [6], Algorithm A is prone to produce zigzags when the centerline is taken. The inferred road network of Algorithm B contain a large number of extra lines, especially in Figure 11(c). The proposed algorithm has a better performance in both subareas in terms of F-score. D. Parameter spensitivity In this section, we test three parameters in the parameter sensitivity analysis, namely: (1) the step size R in the initial map inference algorithm, (2) the increment ∆R
8
0.5
0.5
0.4
0.4
Proportion
Proportion
WORKING PAPER
0.3
0.2
0.1
0.3
0.2
0.1
0 0
20
40
60
80
100
120
Sampling interval (s)
(a) Changsha data set.
0
0
20
40
60
80
100
120
Sampling interval (s)
(b) Chicago data set.
Fig. 9: The distribution of GPS reporting intervals.
(a) The inferred road network traverses 16 cells of the grid mesh.
(b) The actual road network traverses 15 cells of the grid mesh.
Fig. 10: Examples showing the calculation of precision, recall, and F-score using a grid mesh.
in the initial map inference algorithm (here, we consider the ratio of ∆R to R), and (3) the Douglas-Peucker algorithm’s distance parameter, θd . In order to test the effect of parameter variations, we change each parameter separately and perform the proposed algorithm on all 36 subareas to calculate the average precision, recall, and F-score. When changing a parameter individually, the other two parameters take three values (that is, high, medium, and low, respectively) to average out the effect of their values on the results. The results of the parameter sensitivity analysis are shown in Figure 12. The left side shows the change of the precision and recall, while the right side shows the change of the F-score value. From Figures 12(a) and 12(b), we see that when the step size R is small, both the precision and the recall are high, resulting in an overall high F-score. As R becomes larger, the precision and recall gradually decrease, and the recall drops faster. From Figures 12(c) and 12(d), we see that the ratio of ∆R to R has little effect on the F-score. From the perspective of the algorithm design, this value determines the size of the destination node set. As long as ∆R is within a reasonable range, the destination set is not empty or too small, so that we can obtain a destination node as the next starting node. From Figures 12(e) and 12(f), we see that the input parameters of the Douglas-Peucker algorithm are inversely proportional to the performance of the algorithm, i.e., the smaller the parameters, the higher
the precision, recall, and F-score. When the parameter is 10 m, we obtain the best results. This may be because we use line segments to simulate road shapes, and this parameter determines the curvature of the road. When the parameter is larger, the road is straighter, and there will therefore be a reduced accuracy of the road shape, resulting in decreasing effects of the algorithm. E. Sensitivity to reporting frequency (Chicago data set) In this section, we investigate the influence of GPS reporting frequency to the F-score. Since in the Changsha data set, the reporting frequency of the GPS data is already relatively low, we decide to use the Chicago data set whose average reporting interval is 3.5 seconds. To simulate the reduction in reporting frequency, we resample this data set to keep 100%, 90%, ..., 20%, 10%, and 5% of the original GPS records. The corresponding average reporting intervals range from 3.5 to 70 seconds. We then benchmark three algorithms and compare their performance. The changes in the precision, recall, and F-score values when the average reporting interval changes from 3.5 to 70 seconds are shown in Figure 13. The precision of Algorithm A is the highest when the reporting interval is small. Then, it begins to decline when the average reporting interval gets longer. The precision of Algorithm B is the lowest among the three algorithms. The precision of the proposed algorithm increases when the average
WORKING PAPER
9
200 m
200 m
(a) The urban road network inferred by Algorithm A. (Precision=0.66, Recall=0.43, F-score=0.52)
(b) The suburban road network inferred by Algorithm A. (Precision=0.64, Recall=0.66, F-score=0.65)
200 m
200 m
(c) The urban road network inferred by Algorithm B. (Precision=0.61, Recall=0.40, F-score=0.48)
(d) The suburban road network inferred by Algorithm B. (Precision=0.58, Recall=0.74, F-score=0.65)
200 m
200 m
(e) The urban road network inferred by the proposed algorithm. (Precision=0.74, Recall=0.54, Fscore=0.62)
(f) The suburban road network inferred by the proposed algorithm. (Precision=0.85, Recall=0.70, F-score=0.77)
Fig. 11: The performance of each algorithm in an urban subarea and a suburban subarea.
10
0.84
0.68
0.83
0.67
0.82
0.66
F-score
Precision
WORKING PAPER
0.81 0.8
300
200 100 50
0.64
500400
0.79 0.78 0.52
0.65
0.63 0.53
0.54
0.55
0.56
0.57
0.62
0.58
100
200
300
Recall
0.84
0.68
0.83
0.67
0.82
0.66 0.1
0.8
0.3 0.7 0.5
0.65 0.64
0.79 0.78 0.52
0.63 0.53
0.54
0.55
0.56
0.57
0.62 0.1
0.58
0.2
0.3
Recall
0.6
0.7
0.68 5 10 15
0.83
0.67
20
0.82
0.66
F-score
Precision
0.5
(d) Change in F-score when ∆R/R increases from 0.1 to 0.7.
0.84
25
0.81 30
0.8
0.78 0.52
0.4
"R=R
(c) Change in precision and recall when ∆R/R increases from 0.1 to 0.7.
0.79
500
(b) Change in F-score when R increases from 50 to 500 m.
F-score
Precision
(a) Change in precision and recall when R increases from 50 to 500 m.
0.81
400
R
0.65 0.64 0.63
40
0.53
0.54
0.55
0.56
0.57
0.58
0.62
5
10
15
Recall
20
25
30
35
40
3d
(e) Change in precision and recall when θd increases from 5 to 40 m.
(f) Change in F-score when θd increases from 5 to 40 m.
Fig. 12: Results of parameter sensitivity analysis.
reporting interval increases. The recalls of Algorithm A and the proposed algorithm show a decreasing trend when the reporting interval increases, while that of Algorithm B shows an increasing trend. The recall of Algorithm B is always the highest, and that of Algorithm A is always the lowest. Overall, the proposed algorithm achieves the highest F-score. The F-scores of the other two algorithms are always declining and they drop significantly when the average reporting interval exceeds 12 seconds. Figure 14 shows the inferred road networks of the three algorithms when the average reporting interval is 35 seconds. Similar to the previous results, Algorithm A has a higher precision, but it will lose some of the roads. Algorithm B has a higher recall, but it will produce
dispersed lines. The proposed algorithm produces the best results relative to the others. V. C ONCLUSIONS AND F UTURE R ESEARCH D IRECTIONS In this study, we propose a shortest-paths based approach to infer road networks under low-frequency GPS data. Our approach works with the GPS locations reported by the vehicles directly, rather than the trajectories formed by connecting consecutive GPS locations reported by each individual vehicle. Our approach first perform an initial inference of the underlying road network through an iterative process. In this process, we construct an auxiliary network by connecting the GPS locations reported by all
WORKING PAPER
11
0.9 0.8 0.7
0.5
Improvement in F-score Alg. A Alg. B 37.04% 21.60% 20.05% 28.90% 22.30% 37.09% 5.22% 3.77% 28.50% 47.76% 8.53% -9.16% 28.22% 1.11% 18.31% 18.37% 13.28% 10.02% 20.29% 22.45% 14.14% 45.53% 19.99% -17.09% 19.66% 17.53% Runtime 0.7 1.8 2.2 0.6 2.8 0.5 1.1 0.8 1.5 0.4 0.3 0.3 1.1
0.9
This paper Recall F-score 0.62 0.71 0.54 0.62 0.62 0.70 0.51 0.64 0.68 0.71 0.39 0.54 0.41 0.55 0.70 0.77 0.48 0.61 0.39 0.53 0.40 0.54 0.47 0.61 0.50 0.63
0.4 0.3
0.6
0.2 This paper Alg. A Alg. B
0.1 0 3.5
4
4.5
5
6
7
9
12
18
35
70
18
35
70
18
35
70
Average Sampling Interval (s)
(a) Precision.
0.8
Recall
0.7
0.5 0.4 0.3 0.2 This paper Alg. A Alg. B
Precision 0.85 0.74 0.79 0.86 0.75 0.86 0.83 0.85 0.87 0.82 0.85 0.88 0.83
0.1 0 3.5
4
4.5
5
6
7
9
12
Average Sampling Interval (s)
Runtime 111.2 228.7 787.1 94.1 2807.6 49.9 560.0 60.8 214.0 6.8 3.3 1.1 410.4
(b) Recall. 0.9
Precision 0.65 0.61 0.38 0.56 0.33 0.56 0.47 0.58 0.66 0.63 0.81 0.83 0.59
0.7 0.6
F-score
Alg. Recall 0.53 0.40 0.77 0.69 0.88 0.62 0.64 0.74 0.48 0.33 0.24 0.67 0.58
B F-score 0.59 0.48 0.51 0.62 0.48 0.59 0.54 0.65 0.56 0.43 0.37 0.74 0.55
0.8
0.5 0.4 0.3 0.2 This paper Alg. A Alg. B
0.1 0 3.5
4
4.5
5
6
7
9
12
Precision 0.62 0.66 0.64 0.75 0.55 0.78 0.72 0.64 0.72 0.70 0.78 0.71 0.69
Alg. Recall 0.45 0.43 0.51 0.52 0.56 0.36 0.30 0.66 0.44 0.32 0.34 0.40 0.44
A F-score 0.52 0.52 0.57 0.61 0.55 0.49 0.43 0.65 0.54 0.44 0.47 0.51 0.53
Runtime 0.7 1.1 2.4 0.6 3.2 0.7 1.3 0.7 1.1 0.2 0.2 0.1 1.0
Average Sampling Interval (s)
ID 1 2 3 4 5 6 7 8 9 10 11 12 Avg.
TABLE III: A summary of precision, recall, F-score statistics, and runtime (in minutes).
Precision
0.6
(c) F-score.
Fig. 13: Variation in each statistic when the average reporting interval increases from 3.5 to 70 s.
vehicles as long as their Euclidian distances are within a certain threshold. In this auxiliary network, we apply a shortest-paths based algorithm to identify the subset of links that closely represent the actual road network. This subset of links forms the initially inferred road network. We then apply a series of map refinement techniques to further improve the appearance of the initially inferred road network. Specifically, we remove zigzags, merge nearby junction nodes, eliminate erroneous dead-end paths, and fine-tune the locations of nodes to improve the overall match between the inferred road network and the underlying GPS data. We compare the performance of the proposed algorithm against two popular trajectory-based algorithms in the literature. Using real low-frequency GPS data reported by taxicabs in Changsha, China, we find
WORKING PAPER
12
that the proposed algorithm outperforms existing algorithms and the average improvements relative to the two trajectory-based algorithms are 19.66% and 17.53% in terms of F-score. To further understand the influence of reporting frequency, we carry out sensitivity analysis using a publicly available high-frequency GPS data collected by commuting shuttles in Chicago, United States, where we systematically vary the reporting interval from 3.5 seconds to 70 seconds. We find that our algorithm is quite robust with respect to the increase in reporting intervals in that its F-score stay around 0.5 while those for the base algorithms drops dramatically when the reporting interval exceeds 35 seconds. We outline possible future research directions in this field as follows: (1) In this research, we use line segments to approximate actual roads. However, it may be beneficial to include curve functions. This can be useful to approximate ramps or cloverleafs; (2) Existing map inference algorithms focus on the inference of city streets. (a) The road network inferred by Algorithm A. (Precision=0.58, RecalThe next step is to develop algorithms that can infer l=0.28, F-score=0.38) more complex road infrastructure such as parking lots, overpasses, and cloverleafs; (3) Drifting of GPS locations are quite common in GPS data sets. At this moment, existing algorithms have not explicitly investigated this issue. It would be of interest to systematically study the impact of GPS error and develop appropriate solutions accordingly; and (4) In our algorithm, although we try to fine-tune the locations of junction nodes in Section III-E, not all intersections can be nicely inferred. For example, the intersection in the upper right corner of Figure 8 still contains diagonals. It would be good to develop better map refinement techniques to address this issue. 400 m
00 m 400
A PPENDIX (b) The road network inferred by Algorithm B. (Precision=0.22, Recall=0.84, F-score=0.35)
We summarize key notation used in this paper as follows: G = (N , A)
xi , yi eij cij Ds
400 m
hs, di M M0
(c) The road network inferred by the proposed algorithm. (Precision=0.74, Recall=0.43, F-score=0.55)
T P (hs, di |M) eM AX
Fig. 14: The performance of each algorithm using the Chicago data set when the average reporting interval is 35 seconds.
R ∆R θd S mi
The auxiliary network created from the GPS points, where N is the set nodes and A is the set of arcs. Coordinate of node i ∈ N . Euclidean distance of arc (i, j) ∈ A. Cost of arc (i, j) ∈ A. Defined as cij = e2ij The set of destination nodes given a starting node s. The shortest path from s to d. The initially inferred road network. The final inferred road network, which is obtained by improving M. The actual road network. The potential path hs, di contribute to M. Upper limit of the distance between two nodes in G. Step size in the initial map inference algorithm. Increment in the initial map inference algorithm. Parameter for the Douglas-Peucker algorithm. The set of cells in the grid mesh. For cell i ∈ S, mi equals 1 if this cell is
WORKING PAPER
ti
13
traversed by M0 ; and zero, otherwise. For cell i ∈ S, ti equals 1 if this cell is traversed by T ; and zero, otherwise.
R EFERENCES [1] J. Biagioni and J. Eriksson, “Inferring road maps from gps traces: Survey and comparative evaluation,” in Proceedings of the 91st Annual Transportation Research Board, 2012. [2] M. Ahmed, S. Karagiorgou, D. Pfoser, and C. Wenk, “A comparison and evaluation of map construction algorithms using vehicle tracking data,” GeoInformatica, vol. 19, no. 3, pp. 601–632, 2015. [3] S. Edelkamp and S. Schr¨odl, “Route planning and map inference with global positioning traces,” in Computer Science in Perspective, 2003, pp. 128–151. [4] S. Schroedl, K. Wagstaff, S. Rogers, P. Langley, and C. Wilson, “Mining gps traces for map refinement,” Data Mining and Knowledge Discovery, vol. 9, no. 1, pp. 59–87, 2004. [5] G. Agamennoni, J. I. Nieto, and E. M. Nebot, “Robust inference of principal road paths for intelligent transportation systems,” IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 1, pp. 298–308, 2011. [6] J. Davics, A. R. Beresford, and A. Hopper, “Scalable, distributed, real-time map generation,” Pervasive Computing, IEEE, vol. 5, no. 4, pp. 47–54, 2006. [7] C. Chen and Y. Cheng, “Roads digital map generation with multi-track gps data,” in International Workshop on Education Technology and Training, and International Workshop on Geoscience and Remote Sensing., vol. 1. IEEE, 2008, pp. 508–511. [8] J. Biagioni and J. Eriksson, “Map inference in the face of noise and disparity,” in Proceedings of the 20th International Conference on Advances in Geographic Information Systems. ACM, 2012, pp. 79– 88. [9] X. Liu, J. Biagioni, J. Eriksson, Y. Wang, G. Forman, and Y. Zhu, “Mining large-scale, sparse gps traces for map inference: comparison of approaches,” in Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2012, pp. 669–677. [10] L. Cao and J. Krumm, “From gps traces to a routable road map,” in Proceedings of the 17th ACM SIGSPATIAL international conference on advances in geographic information systems. ACM, 2009, pp. 3–12. [11] B. Niehoefer, R. Burda, C. Wietfeld, F. Bauer, and O. Lueert, “Gps community map generation for enhanced routing methods based on trace-collection by mobile phones,” in First International Conference on Advances in Satellite and Space Communications., 2009, pp. 156–161. [12] W. Jin, X. Qi, and H. Jiang, “A latent class approach to identifying inaccurate links in road networks inferred from gps data,” Working Paper, 2017.
Wen Jin is a PhD student in the Department of Industrial Engineering at Tsinghua University. She received her BS at the Tsinghua University. Her research interests include transportation system analysis and travel behavior modeling.
Hai Jiang an Associate Professor in the Department of Industrial Engineering at Tsinghua University. His teaching and research interests involve advanced consumer behavior models and system optimization methods, as well as their applications in transportation, e-Commerce, and urban studies. Prior to joining Tsinghua University, he worked at the corporate research group of Sabre Holdings (parent company of Travelocity.com) at Dallas, TX. He holds a doctoral degree from MIT and a bachelor’s degree from Tsinghua University.