International Journal of Geographical Information Science, 2015 http://dx.doi.org/10.1080/13658816.2014.981191
Split-Match-Aggregate (SMA) algorithm: integrating sidewalk data with transportation network data in GIS Bumjoon Kanga*, Jason Y. Scullyb, Orion Stewartc, Philip M. Hurvitzb and Anne V. Moudonb
Downloaded by ["University at Buffalo Libraries"] at 08:48 27 April 2015
a
Department of Urban and Regional Planning, University at Buffalo, the State University of New York, Buffalo, USA; bUrban Form Lab and the Department of Urban Design and Planning, University of Washington, Seattle, USA; cUrban Form Lab and the Department of Epidemiology, University of Washington, Seattle, USA (Received 30 December 2013; final version received 15 October 2014) Sidewalk geodata are essential to understand walking behavior. However, such geodata are scarce, only available at the local jurisdiction and not at the regional level. If they exist, the data are stored in geometric representational formats without network characteristics such as sidewalk connectivity and completeness. This article presents the Split-Match-Aggregate (SMA) algorithm, which automatically conflates sidewalk information from secondary geometric sidewalk data to existing street network data. The algorithm uses three parameters to determine geometric relationships between sidewalk and street segments: the distance between streets and sidewalk segments; the angle between sidewalk and street segments; and the difference between the lengths of matched sidewalk and street segments. The SMA algorithm was applied in urban King County, WA, to 13 jurisdictions’ secondary sidewalk geodata. Parameter values were determined based on agreement rates between results obtained from 72 pre-specified parameter combinations and those of a trained geographic information systems (GIS) analyst using a randomly selected 5% of the 79,928 street segments as a parameterdevelopment sample. The algorithm performed best when the distances between sidewalk and street segments were 12 m or less, their angles were 25° or less, and the tolerance was set to 18 m, showing an excellent agreement rate of 96.5%. The SMA algorithm was applied to classify sidewalks in the entire study area and it successfully updated sidewalk coverage information on the existing regional-level street network data. The algorithm can be applied for conflating attributes between associated, but geometrically misaligned line data sets in GIS. Keywords: sidewalk; polyline conflation; algorithm; pedestrian network data; GIS
1. Introduction Sidewalks are the primary piece of infrastructure for pedestrian travel. Quality sidewalk data are thus crucial to understand walking travel behaviors. The relationships between the presence or absence of sidewalks and people’s choice of walking as a travel mode have been studied in the fields of public health (Saelens et al. 2003, Reed et al. 2006, Saelens and Handy 2008, McCormack et al. 2012) and urban planning and transportation (Kitamura et al. 1997, Rodriguez and Joo 2004, Lee and Moudon 2006a, Ewing and Cervero 2010). However, research has been hindered by the limited availability of sidewalk geodata, with some studies using on-site audits to collect sidewalk data, such as in *Corresponding author. Email:
[email protected] © 2015 Taylor & Francis
Downloaded by ["University at Buffalo Libraries"] at 08:48 27 April 2015
2
B. Kang et al.
Cervero and Kockelman (1997). Sidewalk measures are typically aggregated at the area level, capturing line density (Lee and Moudon 2006b, McCormack et al. 2012) or ratios of sidewalk to roadway lengths (Cervero 2002) within home census tracts or within homebased buffers. Though they describe local sidewalk availability, these simple measures cannot be used for high-resolution analyses at the trip or route level or for origin-anddestination studies (Handy et al. 2002). The topologic network features of sidewalks are needed for walking mode choice modeling (Frank and Engelke 2001, Chin et al. 2008). Recent studies have found that route-level sidewalk completeness and connectivity was one of the most significant factors in pedestrian route choice (Gallimore et al. 2011, Rodriguez et al. 2014). Network-based sidewalk coverage data are necessary to better understand, analyze, and model walking behaviors. Such data need to be fine-grained, preferably at the street segment level, yet available for the large geographic extents of contemporary metropolitan areas. Navigation systems such as NAVTEQ and TomTom, which are commercially available, include information on the parts of the transportation networks that can be used by pedestrians (Neis and Zielstra 2014). The ESRI StreetMap Premium for ArcGIS data set, based on NAVTEQ’s digital map, allows users to calculate walking routes between given origins and destinations using a walking mode restriction option with the ArcGIS Network Analyst extension (NAVTEQ 2012). Each road network segment in the data set is labeled as traversable or not by walking mode. However, no further information is provided on the characteristics of segments (e.g., presence or completeness of the sidewalk). Aside from these limitations, proprietary data sets are expensive to acquire and must be used under the terms of the licensing agreement, preventing users from editing, or customizing the data. As such, they have limited applicability in research and modeling. Public sidewalk geodata are typically only available from jurisdictions that have large numbers of pedestrians or from those that have the resources to develop and maintain comprehensive data on their transportation infrastructure. Compiled primarily for transportation asset management purposes, these data are stored in GIS vector data sets whose records indicate the locations and characteristics of sidewalks. Data formats differ with sidewalks being represented as single or double lines (on either side of a street), or as polygons that delineate not only sidewalks but all impervious areas (i.e. paved surfaces). Data attributes are few, typically restricted to the length of each polyline or the area of the polygon (Moudon et al. 2013). Most importantly, individual records in sidewalk geodata are not referenced to street network geodata, which means that it is not possible to determine automatically where sidewalks do or do not line streets without further data processing. The lack of reference to street networks and lack of standardization over the large spatial extents of metropolitan regions limits the usability of these public data. Some researchers have carried out environmental audits to manually develop networkbased sidewalk data. Studies using these data are few and limited to small, neighborhoodsized areas (Gallimore et al. 2011, Rodriguez et al. 2014). Recent remote audit approaches using Google Earth mapping software can facilitate the task, but they too require tedious visual examination of each road segment (Janssen and Rosu 2012). Regardless of the method used, building objective and standardized network-based sidewalk data at the regional level is onerous work. We developed a procedure, termed Split-Match-Aggregate (SMA) algorithm, which takes advantage of multiple existing public sidewalk data sets and standardizes the data at the metropolitan region level. The algorithm conflates sidewalk coverage information available from several local jurisdictions to an existing regional street network data set.
International Journal of Geographical Information Science
3
The algorithm results in a street network attribute table updated with sidewalk coverage information for each side of the street, for each network segment.
Downloaded by ["University at Buffalo Libraries"] at 08:48 27 April 2015
2. Problem and strategy There are many existing polyline conflation methods that mostly use proximity and geometric similarity to detect matching correspondences (Yuan and Tao 1999, Savary and Zeitouni 2005). The proposed algorithm is also designed to use the same approach and to match street and sidewalk geometries based on their proximity and angle. However, there are unique challenges for sidewalks. Unlike other polyline conflation methods developed for matching pairs of different polylines representing the same objects in the real-world (Yuan and Tao 1999), street, and sidewalk geometries are not the same objects, although they are spatially associated. Thus, the proposed SMA algorithm has to solve problems resulting from spatial discrepancies, with the two key problems being described next. The first problem is that multiple-cardinality relationships exist between matching pairs. A street polyline can be matched to multiple sidewalk polylines. For example, the City of Seattle’s sidewalk geodata depict sidewalk presence as polylines about 10 m from the left or right sides of street center polylines. It is common to find one long street centerline segment between two intersections, while its associated sidewalks have multiple segments of varying lengths, due to sidewalk conditions and source data quality (e.g., curb cuts, inconsistent coding protocols used by jurisdictions). The irregularity in shape, length, and angle makes it difficult to determine which portion of a sidewalk segment is associated with which of many potentially matching nearby street segments. The second problem is curvilinearity of polylines. Because sidewalk polylines are typically located on each side of a street polyline, in the case of curvilinear shapes, the radius of curvature of the sidewalk polyline differs from the radius of curvature of the corresponding street segment. Furthermore, the digitization process of a single curvilinear line segment typically uses multiple linear sub-segments of different lengths and angles (Figure 1).
W3
S2
W2
W1 S1
Figure 1.
Street segments Sk with the corresponding sidewalk segments Wi .
Downloaded by ["University at Buffalo Libraries"] at 08:48 27 April 2015
4
B. Kang et al.
To match the street and sidewalk polylines, which have such complex relationships, the two aforementioned problems should be considered simultaneously. Commonly used conflation methods do not address multiple-cardinality relationships or handle curvilinear polylines well because they conflate data at the segment or polyline level. For example, point-based approaches match pairs of start and end points between two polylines for conflation at the polyline level (Gabay and Doytsher 1994, Cobb et al. 1998) and line-based approaches match multiple within-polyline breakpoints between two polylines for conflation at the segment level (Doytsher et al. 2001, Haunert 2005, LópezVázquez and Manso Callejo 2013). The Buffer Growing algorithm was developed to address issues of multiple-cardinality (Walter and Fritsch 1999). It screens correspondence candidates within a buffer around a starting polyline, then the buffer is enlarged around the next polyline until no further correspondence candidates are detected. As the buffer grows, matching result information is aggregated to multiple polylines. The Buffer Growing approach still matches at the polyline level, which was found to be too coarse for application to street and sidewalk data. The proposed SMA algorithm attempts to simultaneously solve the problems of multiple-cardinality and curvilinearity through conflation at the sub-segment level. It uses a three-step process of splitting, matching, and aggregating. First, source street and sidewalk polylines are split into short sub-segments of straight or ‘nearly straight’ lines, thus solving the curvilinearity problem. Second, matching between sidewalks and streets is assessed at the sub-segment level, and based on the two geometric properties proximity and angle. Proximity considers the distance between the street and the sidewalk subsegments, while angle takes the difference in the directions of the street and the sidewalk sub-segments into account. Matching at the sub-segment level addresses the multiplecardinality problem because lengths of sub-segments are standardized. Third, sub-segments need to be aggregated back to the original street network segments. For each side of each street segment, the length portion of the associated sidewalks is calculated from summing matched sidewalk sub-segment lengths. Finally, a tolerance addresses minor dimensional mismatches in cases where sidewalk segments are discontinuous (e.g., at curb cuts or at intersections). These steps provide more precise matching results than other conflation approaches because they use a much finer matching unit, after which matched results are then aggregated back to the original street segments. This strategy of splitting, matching, and aggregating is a substantial new contribution to existing GIS conflation methods. 3. Method The SMA algorithm was developed to create standardized sidewalk data for the street network within a study site located inside the King County Urban Growth Area (UGA), the urbanized area in the western part of Washington State, US, which includes the City of Seattle. The following describes the SMA algorithm and parameter selection process. 3.1. Source data Sidewalk data were available in polyline GIS formats and collected from jurisdictions and a planning agency between 2008 and 2011. The study area covered 13 jurisdictions within the King County UGA. It included the county’s three largest cities (Seattle, Bellevue, and Kent), and covered 681 km2, or more than 50% of the county’s UGA. Sidewalk GIS data for four of the jurisdictions were obtained from the Puget Sound Regional Council
International Journal of Geographical Information Science
5
Downloaded by ["University at Buffalo Libraries"] at 08:48 27 April 2015
(PSRC), the region’s Metropolitan Planning Organization, in October 2011. Eight jurisdictions had sidewalk data in a double-line format, representing sidewalks on the left or the right side of street centerlines, while five jurisdictions used a single-line format, in which sidewalk attributes were attached to a segment located at the street centerline. King County Metro Transportation Network (TNET) data were selected as the master street network data set to which all local jurisdictions’ sidewalk data were transferred. Downloaded from the King County GIS Center in December 2010, it contained 79,928 segments for a total of 8,430 km of streets within the study area.
3.2. Standardizing sidewalk source data The SMA algorithm could be directly applied to data available in double-line format (eight jurisdictions provided such data). Data available in a single-line format had to be transformed into a double-line format (five jurisdictions provided such data) first and processed afterwards. This was done by offsetting sidewalk geometries to the appropriate side of street centerlines by 6 m, using the Copy Parallel function in ArcMap 10. The 6-m offset distance was selected to represent the minimum distance from the street centerline to the edge of typical 2- or 3-lane urban arterials or collectors, as determined by local road standards (Haff 1993). The final standardized sidewalk input data set used to apply the SMA algorithm had 722,276 sidewalk segments.
3.3. The Split-Match-Aggregate (SMA) algorithm The SMA algorithm estimates sidewalk coverage, separately for the left and right sides of a street segment, by calculating the ratio of the total length of sidewalks to the length of the street segment. Algorithm 1 describes the SMA process for estimating the right side sidewalk information of a street segment S, along with a graphical example. The algorithm consists of three steps. Step 1 describes the split procedure for making short subsegments. Step 2, consisting of sub-steps from 2a to 2e, explains the match procedure of the calculation of the lengths of associated sidewalk sub-segments. Step 3, consisting of sub-steps from 3a to 3c, is responsible for the aggregate procedure of the segment-level sidewalk classifications. The algorithm itself is iteratively applied to each of the street segments in the data set. An identical procedure is applied for the left sidewalk calculation, but with a buffer created to the left rather than to the right in Step 2a. The sidewalk coverage is determined as the ratio of the sum of total sidewalk segment length SDW and its tolerance-added length SDWT to the street segment length STR (Step 3c). It is calculated separately for both right sidewalks SWKright and left sidewalks SWKleft , and recoded as full coverage, partial coverage, or no coverage. The variable SDWT ¼ SDW þ T , which represents the total sidewalk length SDW increased by a pre-specified tolerance value T , is necessary because street centerline segments are usually longer than their associated sidewalk segments due to the lack of sidewalk representation and the absence of sidewalks in reality at street intersections or at driveways. Based on the previously described variables, full sidewalk coverage is defined as SDWT being greater than or equal to 90% of the street segment length STR. Partial sidewalk coverage is defined as sidewalk coverage being less than full sidewalk coverage but with SDW less than or equal to 10% of the street segment length STR. No sidewalk coverage is defined if the coverage is smaller than the partial sidewalk coverage.
6
B. Kang et al.
Algorithm 1: Split-Match-Aggregate (SMA) algorithm. Step
Downloaded by ["University at Buffalo Libraries"] at 08:48 27 April 2015
1:
Procedure Description Input: One street segment S and l sidewalk segments W 1 ; . . . ; W l Output: The right side sidewalk coverage variable SW K right for each S. The left side sidewalk coverage variable SW K left is computed in an analogous way by creating a buffer to the left in Step 2a. Split the street segment S and each of the sidewalk segments W 1 ; . . . ; W l at every vertex first and then split again with lengths less than or equal to 3 m, thus obtaining n street sub-segments s1 ; . . . ; sn and p ¼ m1 þ . . . þ ml sidewalk sub-segments wl;1 ; . . . ; wl;m1 ; w2;1 ; . . . ; w2;m2 ; . . . ; wl;1 ; . . . ; wl;ml .
2: 2a:
For each si , Create a buffer bk with a pre-specified width B from to the right side of sk .
2b:
Select those sidewalk sub-segments wi; j which intersect the buffer bk .
2c:
2d:
2e:
Graphical Example
Deselect wi; j of which the angle a wi; j ; sk between wi; j and sk is greater than or equal to a pre-specified angle α. Calculate the length d wi; j ; sk of those portions of wi; j that lie completely within the buffer bk .
Calculate the total sub-segment P sidewalk length sdwk ¼ d wi; j ; sk associated with the street sub-segment sk .
sdw1 ¼ d w1;1 ; s1 þ d w1;2 ; s1 ; sdwl ¼ d w1;2 ; s2 (Continued )
International Journal of Geographical Information Science
7
Algorithm 1: (Continued ). Step 3a: 3b:
Downloaded by ["University at Buffalo Libraries"] at 08:48 27 April 2015
3c:
Procedure Description
Graphical Example P
Calculate the total sidewalk segment length SDW ¼ sdwk . Add a pre-specified tolerance value T to SDW in order to obtain the total sidewalk lengths SDW T associated with S on the right side. Determine the right side sidewalk coverage SW K right of the street segment S, expressed as the ratio of the total sidewalk length SDW or SDW T to the length STR of the street segment S. 8 2; if SDW 0:1 STR and SDWT < 0:9 STRðpartial coverageÞ > > > > < SW K right ¼ 1; if SDWT 0:9 STRðfull coverageÞ > > > > : 0; in all other casesðno coverageÞ
The algorithm uses the following three parameters for recognizing and harmonizing differences in geometries between the sidewalk data and the street data: (1) the buffer size B, which defines the maximum allowed distance between a street sub-segment and a sidewalk sub-segment; (2) the angle α, which specifies the maximum allowed angle between the street sub-segment and the sidewalk sub-segment; and (3) the tolerance T, which represents an increment of length being added to the calculated sum of total sidewalk sub-segment lengths in determining full sidewalk coverage. To our knowledge, no other study has analyzed these parameters and no known values or gold standards were available to aid in selecting parameter values. Therefore, we used a small random sample of street segments to select the best parameter combination from candidate values. We chose 3 different values for B (12 m, 15 m, and 18 m), 4 values for α (10°, 15°, 20°, and 25°), and 6 values for T (15 m, 18 m, 21 m, 24 m, 27 m, and 30 m). In total, all possible 72 combinations of the parameters were tested. The algorithm was implemented within the computing environment of PostgreSQL 9.1.1, PostGIS 2.0.1, and R 2.15.1.
3.4. Parameter value selection process A randomly selected 5% parameter-development sample of the TNET street network segments (4,043 of 79,928) was used to select the best combination of values for the parameters B, α, and T with respect to the agreements between manual and algorithmic classifications. A trained analyst assessed and manually classified sidewalk coverage of the parameter-development sample. The analyst determined the association when a street segment and its corresponding sidewalk segments had reasonably acceptable proximity and angle between them. To minimize human errors, the analyst conducted two rounds of manual classifications. The analyst used a background of aerial photos for providing the environmental context of sidewalk and street segments. The parameter-development sample’s manual classification results were compared with the algorithmic results for the 72 parameter combinations. The best parameter combination was selected as having the greatest percentage of street segments for which manual and algorithmic classification results were in agreement.
8
B. Kang et al.
Downloaded by ["University at Buffalo Libraries"] at 08:48 27 April 2015
4. Results When using the mean of the left- and right-side classification results as an indicator, all 72 parameter combinations showed overall agreement rates greater than 93.6% with the manual classification of the parameter-development sample. Table 1 shows that the combinations with the five highest overall agreement rates had a 96.44% rate or greater (the last column). Full and no sidewalk coverage classifications showed the highest levels of agreement rates of 96.2% or greater (columns 5, 6, 9, and 10), while partial coverage classifications varied from 69.0% to 81.4% (columns 7 and 11). The first combination (B = 12 m; α = 25°; T = 18 m) had the best overall agreement rate (96.50%) and was selected for the algorithm-based classification of sidewalk coverage to street network data for the entire sample. The selected best parameter combination showed an excellent overall agreement rate of 96.50% with the manual classification. All 72 combinations yielded robust results with the overall agreement rates ranging between the best combination (96.50%) and the worst one (93.61%). However, when analyzing the results with respect to the three different coverage values, the estimation of partial coverage was sensitive to changes in parameter values. Partial coverage had a relatively large range of the agreement rates among 72 combinations, ranging from 56.6% to 85.0% (data not shown) and even among the highest five combinations, ranging from 69.0% to 81.4% (Table 1), while full coverage and no coverage had a small range of 6.1% and 5.1% respectively among 72 combinations. There were 158 disagreements in the left-side classification and 125 disagreements in the right side between the manual- and algorithmic classification results for the selected best parameter combination within the 4,043 parameter-development sample street segments. After accounting for 48 disagreements on both sides, there were a total of 235 street segments showing segment-level disagreement incidents (5.81% of 4,043 street segments). Table 2 compares the results of the best automatic classification to those of the manual classification in the parameter-development sample for the selected parameter combination. The agreement rate was 96.1% for the segments on the left side of the streets and 96.9% for the segments on the right side. Cohen’s kappa statistics showed high agreement results (left side: κ = 0.926, P-value < 0.0001; right side: κ = 0.941, P-value < 0.0001). Among the 4,043 street segments in the parameter-development sample, there were 158
Table 1.
Agreement rates for the top five parameter combinations. Agreement rate of left side (%)
Parameter value No.
B (m)
1* 2 3 4 5
12 12 12 12 12
Sidewalk coverage
α (°) T (m) 25 20 25 25 20
18 24 21 24 21
Agreement rate right side (%) Sidewalk coverage
No
Full Partial
All
No
Full Partial
All
Overall
96.3 96.4 96.2 96.2 96.4
96.9 97.4 97.4 97.6 97.1
96.1 96.2 96.1 96.2 96.1
97.3 97.5 97.3 97.2 97.5
97.4 97.6 97.7 97.9 97.4
96.9 96.7 96.8 96.7 96.8
96.50 96.49 96.45 96.45 96.44
Note: * Selected parameter combination.
80.5 75.4 75.4 73.7 75.4
81.4 69.0 72.6 69.9 71.7
International Journal of Geographical Information Science
9
Table 2. Comparison of the results between the selected parameter combination (B = 12 m; α = 25°; T = 18 m) and the manual classifications of the parameter-development sample. Manual classification Left side of street
Downloaded by ["University at Buffalo Libraries"] at 08:48 27 April 2015
No Algorithm- No based Full classification Partial All Agreement rate (%) Agreement test
Full
Partial
Right side of street All
No
Full
Partial
All
2,041 39 11 2,091 2,059 27 7 2,093 53 1,749 12 1,814 29 1,767 14 1,810 26 17 95 138 28 20 92 140 2,120 1,805 118 4,043 2,116 1,814 113 4,043 96.3 96.9 80.5 96.1 97.3 97.4 81.4 96.9 κ = 0.926; P-value < 0.0001 κ = 0.941; P-value < 0.0001
disagreements (3.9%) in the left-side classification and 125 (3.1%) in the right-side classification, showing a similar distribution pattern of disagreements by coverage type between the two street sides. Most disagreements were observed where sidewalk coverage was coded as no coverage by the manual classification, but as full coverage by the algorithm-based classification (53 disagreements in the left-side classification and 29 in the right-side classification). When the left- and right-side classifications were considered together, 48 street segments had disagreements on both sides and 187 street segments on either side. The SMA run using the selected best parameter combination classified each side of the 79,928 TNET street segments in the study area. As shown in Table 3, 37.8% had full coverage on both left and right sides (accounting for 33.6% of total street length), 44.6% had no coverage on both sides (accounting for 48.1% of total street length), and the remaining 17.6% had either full or partial coverage on one side and either partial or no coverage on the other side at the street segment level (accounting for 18.3% of total street length). Table 3. Algorithm-based sidewalk classification results within the entire study area at the street segment level. Number of street segments Coverage Full coverage both sides Full coverage one side and partial the other side Full coverage one side and no coverage the other side Partial coverage one side and no coverage the other side Partial coverage both sides No coverage both sides Total
Total street length
Absolute (count)
Relative (%)
Absolute (km)
Relative (%)
30,223 1740
37.8 2.2
2,831 221
33.6 2.6
9,054
11.3
793
9.4
2,088
2.6
343
4.1
1,156 35,667 79,928
1. 5* 44.6 100.0
184 4,057 8,430
2.2 48.1 100.0
Note: * Rounded by the largest remainder method from 1.4 to 1.5, making percentages added up to 100% in total.
10
B. Kang et al.
Downloaded by ["University at Buffalo Libraries"] at 08:48 27 April 2015
5. Discussion The algorithm showed an excellent overall agreement rate for the selected parameter combination and robust overall classification results for other combinations. Although classification results in partial coverage were relatively sensitive to parameters, because most street segments were classified as having full coverage or no coverage, overall agreement rates did not vary much across the 72 combinations. We visually investigated the total of 235 street segments within the parameterdevelopment sample to find possible sources of the disagreements. Three sources were identified: (1) limitations of the algorithm, (2) limitations of the input data, and (3) ambiguity and irregularity of geometries. First, 65 disagreements (27.7% of the total of 235) came from algorithm-based misclassification. The analyst identified 30 misclassifications of partial coverage. For example, the algorithm misclassified as partial coverage instead of full coverage when large portions of sidewalk geometries were not parallel to street geometries (Figure 2a). The algorithm made 35 additional misclassifications based on false associations between
Figure 2. Examples of disagreements: (a) Misclassification from rounded sidewalk geometries; (b) False association of sidewalks; (c) Input data misalignments; (d) Short intersection street segments; (e) Ambiguity from traffic islands.
Downloaded by ["University at Buffalo Libraries"] at 08:48 27 April 2015
International Journal of Geographical Information Science
11
street and sidewalk segments. False associations typically occurred when street segments were merged or split at sharp angles (Figure 2b). Second, 81 disagreements (34.5% of the total of 235) resulted from input data limitations. Misalignments between GIS data from the jurisdictions and from King County produced 39 disagreements (Figure 2c). Also, the TNET street network data set used segments shorter than about 7.5 m to span a number of seemingly randomly distributed intersections (Figure 2d). Because the number of these intersection segments was relatively small, the manual classifier considered them as parts of longer street segments to which they were connected, and coded them in the same fashion as the longer segments. In contrast, the algorithm, which relied on the geometric relationships of streets and sidewalks, classified these intersection segments as having no sidewalk. In the parameter-development sample, 37 cases were observed to have this issue. Five disagreements came from invalid street segments in the input data. Third, 89 disagreements (37.9% of the total of 235) came from ambiguous street and sidewalk geometries for which associations were difficult to determine. For example, streets whose travel lanes were separated by a traffic island or median were often represented by a separate polyline for each travel lane (Figure 2e). Disaggregating segments into sub-segments is essential to ensure the effectiveness of the algorithm’s classification process. Evidently, errors at the sub-segment level will have a lower impact on the results than errors occurring at the segment level. In our case, mismatching one or two sidewalk sub-segments would yield a 3- or 6-m error in estimating sidewalk length. This error is equivalent to only 3–6% of the 105-m mean street segment length in our entire sample. In contrast, errors in results of analyses conducted at the segment level (i.e. spatially joining sidewalk to street segments directly) would be in the range of hundreds of meters. Finally, the use of sub-segments also allows for keeping the tolerance parameter small. The SMA algorithm has five major benefits. First, the algorithm constitutes an automated method which substantially reduces human labor time and human errors. The trained analyst worked 90 hours in total for the manual classification of 4,043 street segments in the parameter-development sample. He made 95 classification errors (2.4% of the sample) (e.g., coding typos, confusion in the direction, and sides of streets), which were identified in the second round of manual classification. This implied that manual classification would take 223 8-hour days to complete the entire sample of 79,928 street segments, while the algorithm took less than 1 day, including data preparation time, with a medium-level server (Intel Xeon E3-1270 CPU, 64-bit, 16 GB RAM with OS RedHat Linux 6.3). Second, automation allows easy creation of new output data as soon as updated input data become available. Third, the algorithm can be applied or adapted to other regions where street network data sets are not integrated with sidewalk data. Fourth, the algorithm can be generalized and used to conflate polyline data sets other than streets and sidewalks. For example, collaborative mapping technologies using massive numbers of volunteers’ GPS traces show new ways of developing GIS data (e.g., pedestrian network maps from volunteers’ GPS traces) (Cao and Krumm 2009, Karimi and Kasemsuppakorn 2013). The SMA algorithm can be useful to integrate these newly created collaborative data with existing GIS data. Fifth, the proposed schema and the use of the algorithm offer an economical way of augmenting transportation network data where separate sidewalk data already exist. Future efforts to develop new sidewalk data in a GIS should consider referencing sidewalk segments to existing street network segments rather than creating distinct layers.
Downloaded by ["University at Buffalo Libraries"] at 08:48 27 April 2015
12
B. Kang et al.
The sidewalk and street network data set created by the SMA algorithm can be used in research and transportation modeling as well as in transportation asset management. Preferred pedestrian networks can be developed by isolating street segments based on their sidewalk coverage information and other attributes that influence the safety and comfort of walking, such as street classifications (e.g., principal arterial, minor arterial, collector arterial, and local roads), street width, posted traffic speed, etc. Sidewalk coverage provided at the segment level can also be aggregated by street-block face, neighborhood, or selected route. When sidewalk data exist, the algorithm can build pedestrian network data, enabling high-resolution route-level pedestrian behavior analyses such as walking route selection and walking mode choice analyses given origin and destination points. Future work will be required to develop detailed application methods and strategies in pedestrian network computations.
6. Conclusions The SMA algorithm conflates sidewalk coverage information to street network data automatically, based on the geometric relationships between street and sidewalk segments. Using a random parameter-development sample of 4,043 street segments, the algorithm was compared to the manual sidewalk coverage classification. The agreement rate of 96.5% between the automatic and the manual classification was excellent. The algorithm provides a transparent and replicable way to integrate secondary sidewalk GIS data to existing street network data sets. The SMA algorithm can be more generally applied to combine line data sets in GIS, such as those emerging in the field of GPS-driven data development. The integration of sidewalk and street data offers promising new developments in the study, analysis, and monitoring of pedestrian infrastructures.
Acknowledgments The authors want to thank Ms. Paula Reeves for her support.
Funding This study was supported in part by the National Institutes of Health [R01 HL091881] (PI: Brian E. Saelens) and by the Washington State Department of Transportation [Agreement T4118 Task 87] (PI: Anne V. Moudon).
References Cao, L. and Krumm, J., 2009. From GPS traces to a routable road map. In: 17th ACM SIGSPATIAL international conference on advances in geographic information systems (ACM SIGSPATIAL GIS), 4–6 November, Seattle, WA. New York: ACM, 3–12. Cervero, R., 2002. Built environments and mode choice: toward a normative framework. Transportation Research Part D: Transport and Environment, 7 (4), 265–284. doi:10.1016/ S1361-9209(01)00024-4 Cervero, R. and Kockelman, K., 1997. Travel demand and the 3Ds: density, diversity, and design. Transportation Research Part D: Transport and Environment, 2 (3), 199–219. doi:10.1016/ S1361-9209(97)00009-6 Chin, G.K.W., et al., 2008. Accessibility and connectivity in physical activity studies: the impact of missing pedestrian data. Preventive Medicine, 46 (1), 41–45. doi:10.1016/j.ypmed.2007.08.004
Downloaded by ["University at Buffalo Libraries"] at 08:48 27 April 2015
International Journal of Geographical Information Science
13
Cobb, M.A., et al., 1998. A rule-based approach for the conflation of attributed vector data. GeoInformatica, 2 (1), 7–35. doi:10.1023/A:1009788905049 Doytsher, Y., Filin, S., and Ezra, E., 2001. Transformation of datasets in a linear-based map conflation framework. Surveying and Land Information Systems, 61 (3), 165–176. Ewing, R. and Cervero, R., 2010. Travel and the built environment. Journal of the American Planning Association, 76 (3), 265–294. doi:10.1080/01944361003766766 Frank, L.D. and Engelke, P.O., 2001. The built environment and human activity patterns: exploring the impacts of urban form on public health. Journal of Planning Literature, 16 (2), 202–218. doi:10.1177/08854120122093339 Gabay, Y. and Doytsher, Y., 1994. Automatic adjustment of line maps. In: GIS/LIS 1994 annual convention, 25–27 October, Phoenix, AZ. Bethesda, MD: ASPRS, 333–341. Gallimore, J.M., Brown, B.B., and Werner, C.M., 2011. Walking routes to school in new urban and suburban neighborhoods: an environmental walkability analysis of blocks and routes. Journal of Environmental Psychology, 31 (2), 184–191. doi:10.1016/j.jenvp.2011.01.001 Haff, L.J., 1993. King county road standards. Seattle, WA: King County Department of Public Works. Handy, S.L., et al., 2002. How the built environment affects physical activity. American Journal of Preventive Medicine, 23 (2), 64–73. doi:10.1016/S0749-3797(02)00475-0 Haunert, J.-H., 2005. Link based conflation of geographic datasets. In: 8th ICA workshop on generalisation and multiple representation, 7–8 April 2005 A Coruña, Spain. Janssen, I. and Rosu, A., 2012. Measuring sidewalk distances using Google Earth. BMC Medical Research Methodology, 12 (1), 39. doi:10.1186/1471-2288-12-39 Karimi, H.A. and Kasemsuppakorn, P., 2013. Pedestrian network map generation approaches and recommendation. International Journal of Geographical Information Science, 27 (5), 947–962. doi:10.1080/13658816.2012.730148 Kitamura, R., Mokhtarian, P.L., and Daidet, L., 1997. A micro-analysis of land use and travel in five neighborhoods in the San Francisco Bay Area. Transportation, 24 (2), 125–158. doi:10.1023/ A:1017959825565 Lee, C. and Moudon, A.V., 2006a. The 3Ds+ R: quantifying land use and urban form correlates of walking. Transportation Research Part D: Transport and Environment, 11 (3), 204–215. doi:10.1016/j.trd.2006.02.003 Lee, C. and Moudon, A.V., 2006b. Correlates of walking for transportation or recreation purposes. Journal of Physical Activity and Health, 3, 77. López-Vázquez, C. and Manso Callejo, M., 2013. Point-and curve-based geometric conflation. International Journal of Geographical Information Science, 27 (1), 192–207. doi:10.1080/ 13658816.2012.677537 McCormack, G.R., et al., 2012. The association between sidewalk length and walking for different purposes in established neighborhoods. International Journal of Behavioral Nutrition and Physical Activity, 9 (1), 92. doi:10.1186/1479-5868-9-92 Moudon, A.V., et al., 2013. Sidewalk data in King county’s urban growth boundary. Seattle, WA: Washington State Department of Transportation. NAVTEQ, 2012. StreetMap Premium for ArcGIS North America NAVTEQ 2012. Release 2nd ed. Redlands, CA: ESRI. Neis, P. and Zielstra, D., 2014. Generation of a tailored routing network for disabled people based on collaboratively collected geodata. Applied Geography, 47, 70–77. doi:10.1016/j. apgeog.2013.12.004 Reed, J., et al., 2006. Perceptions of neighborhood sidewalks on walking and physical activity patterns in a southeastern community in the US. Journal of Physical Activity & Health, 3 (2), 243. Rodriguez, D.A. and Joo, J., 2004. The relationship between non-motorized mode choice and the local physical environment. Transportation Research Part D: Transport and Environment, 9 (2), 151–173. doi:10.1016/j.trd.2003.11.001 Rodriguez, D.A., et al., 2014. Influence of the built environment on pedestrian route choices of adolescent girls [online]. Environment and Behavior. Available from: http://eab.sagepub.com/ content/early/2014/01/22/0013916513520004 [Accessed 7 November 2014]. doi:10.1177/ 0013916513520004 Saelens, B.E. and Handy, S.L., 2008. Built environment correlates of walking: a review. Medicine and Science in Sports and Exercise, 40 (Supplement), S550–S566. doi:10.1249/ MSS.0b013e31817c67a4
14
B. Kang et al.
Downloaded by ["University at Buffalo Libraries"] at 08:48 27 April 2015
Saelens, B.E., Sallis, J.F., and Frank, L.D., 2003. Environmental correlates of walking and cycling: findings from the transportation, urban design, and planning literatures. Annals of Behavioral Medicine, 25 (2), 80–91. doi:10.1207/S15324796ABM2502_03 Savary, L. and Zeitouni, K., 2005. Automated linear geometric conflation for spatial data warehouse integration process [online]. In: Proceedings of 8th AGILE conference on GIScience, 26–18 May 2006, Estoril. Available from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.78.2404&rep=rep1&type=pdf [Accessed 10 November 2014]. Walter, V. and Fritsch, D., 1999. Matching spatial data sets: a statistical approach. International Journal of Geographical Information Science, 13 (5), 445–473. doi:10.1080/136588199241157 Yuan, S. and Tao, C., 1999. Development of conflation components. In: Proceedings of the international conference on geoinformatics and socioinformatics. Ann Arbor: University of Michigan, 1–13.