To cite this article: Qi Zhou & Zhilin Li (2015): How many samples are needed? An investigation of binary logistic regression for selective omission in a road network,
Cartography
and
Geographic
Information
Science,
DOI:
10.1080/15230406.2015.1104265
This article is also available online: http://www.tandfonline.com/eprint/tBVWiIpfYtD6mS5Szdit/full
How Many Samples are Needed? An Investigation of Binary Logistic Regression for Selective Omission in a Road Network
Qi Zhoua, Zhilin Lib a
Faculty of Information Engineering, China University of Geosciences,
Wuhan, P.R. China b
Department of Land Surveying and Geo-Informatics, Hong Kong
Polytechnic University, Kowloon, Hong Kong
*Corresponding author: Qi Zhou E-mail:
[email protected]
Abstract: Selective omission in a road network (or road selection) means to retain more important roads, and it is a necessary operator to transform a road network at a large scale to that at a smaller scale. This study discusses the use of the supervised learning approach to road selection, and investigates how many samples are needed for a good performance of road selection. More precisely, the binary logistic regression is employed; and three road network data with different sizes and different target scales are involved for testing. The different percentages and numbers of strokes are randomly chosen for training a logistic regression model, which is further applied into the untrained strokes for validation. The performances of using the different sample sizes are mainly evaluated by an error rate estimate. Significance tests are also employed to investigate whether the use of different sample sizes shows statistically significant differences. The experimental results show that in most cases, the error rate estimate is around 0.1-0.2; more importantly, only a small number (e.g., 50-100) of training samples is needed, which indicates the usability of binary logistic regression to road selection.
Keywords: road network; selective omission; map generalization; binary logistic regression
1. Introduction Spatial data can be represented at different scales, which may facilitate map navigation and spatial analysis. However, most national mapping agencies still currently face the problem of how to update different representations in a timely manner because such work is usually labor intensive, time consuming and costly. An ideal solution to the problem is to update the representation at the largest scale, with other representations at smaller scales produced or updated automatically (Li 2006). This solution requires the automated transformation of the largest scale representation to a smaller scale. This process is known as automated map generalization. Although this is becoming increasingly available (Stoter et al. 2014), fully automated transformation of a map from one scale to a smaller scale is still a research topic of intense interest in the field of mapping and cartography. This study is concerned with selective omission in road network data because a road is one of the most important geographical features on a map, and selective omission (meaning the retention of the more important roads) is an operation necessary for automated road network generalization.
Selective omission in a road network has been the subject of extensive studies. Some researchers analyzed road segments (Mackaness and Beard 1993; Mackaness 1995; (Thomson and Richardson 1995; Kreveld and Peschier 1998) or road intersections (Mackaness and Machechnie 1999) for selection, because a road network is always stored in a database as intersections and segments. Some workers built strokes, which are defined as “a set of one or more arcs in a non-branching and connected chain” (Thomson and Richardson 1999), and the selections were based on those strokes. Indeed, the use of strokes makes possible the analysis of road networks based on the importance of individual roads, even in the absence of all other thematic information (Thomson and Brooks 2007). The importance of each stroke may be determined by various properties, such as road length (Chaudhry and Mackaness 2005), stroke connectivity (Zhang 2004a), degree, closeness and betweenness centralities (Jiang and Claramunt 2004; Jiang and Harrie 2004). Some researchers proposed the use of areal
partition (Edwards and Mackness 2000), mesh density (i.e. the total length in a unit area) (Chen et al. 2009) or blocks (Gulgen and Gokgoz 2011) to determine whether road segments are retained. Others (Touya 2010; Li and Zhou 2012) proposed an integrated approach that considered different structures or patterns in a road network.
Although a number of approaches for selective omission in a road network are now available, each approach involves at least one or more parameters to determine the representation of a road network at a specific scale. For instance, the approach proposed by Chen et al. (2009) involves a mesh density parameter to determine which meshes or roads are retained. The commercial software ArcGIS recently published a new tool called thin road network, which involves a minimum road length parameter to determine the shortest road segment that is reasonable to display at the target scale (ArcGIS Resources 2012). Some approaches (Chaudhry and Mackaness 2005) used the so-called “Principle of Selection” or “Radical Law” proposed by Töpfer and Pillewizer (1966) to determine how many objects are retained at a specific scale, but the appropriate number or percentage may vary in different cases (Chen et al. 2009). So the determination of thresholds for various parameters has become an essential requirement for road selection (Touya 2010).
Several studies have focused on determining various parameters for selective omission in a road network. Zhou and Li (2014a; 2014b) employed supervised learning approaches (e.g. an artificial neural network), and they found that the criteria determining which and how many roads are retained at a specific scale can be adaptively determined. Supervised learning is not a new technique for automated map generalization, it has been viewed as a promising tool to acquire cartographic knowledge during the automated process (Weibel et al. 1995). Some success applications of supervised learning approaches to road classification (Balboa and López 2008), simplification and smoothing (Mustiere 2005) have also been presented. Although the use of supervised learning approaches requires a set of training examples, the national mapping agencies already have a series of existing maps at
different scales, which may be used as training samples. So it is possible to use old features (e.g., roads or buildings) in the existing maps as examples to acquire cartographic knowledge, and then to infer the corresponding representation of updated features based on the acquired knowledge. However, the appropriate percentage or number of training samples is important to obtain a good performance because if the training samples are too few, supervised learning approaches are easy to be over-fitting (that is, a model fits very well on training data but this is not the case for validation data). To our knowledge the existing researches mainly paid attention to the effectiveness of using supervised learning approaches to the automated map generalization rather than to a discussion of the appropriate percentage or number of samples needed for learning. Therefore, this study aims to investigate how many samples are needed to have a good performance for selective omission in a road network. In order to answer this question, a series of experiments were designed to test the performances of using different percentages or numbers of training samples. As a number of supervised learning approaches may be alternatives, this study mainly employs the binary logistic regression for testing because (1) this approach does not involve any parameters; and (2) it can perform significantly better than some other supervised learning approaches (e.g., K-nearest neighbor and Naïve bayes) as found by Zhou and Li (2014b).
This paper is structured as follows: Section 2 first introduces the basic principles of using the binary logistic regression for selective omission in a road network; Section 3 designs experiments to investigate how many training samples are needed, and also analyzes the experimental results. Section 4 further investigates the needed sample size with significance tests. Section 5 concludes the main findings and also discusses the limitations of this study.
2. Selective omission in a road network with binary logistic regression 2.1 Binary logistic regression (BLR) In statistics, binary logistic regression is a type of regression analysis. It is used to
model the relationship between a binary response variable (e.g., pass/fail, go/no-go, dead/alive etc.) and one or more predictor variables. Different from the widely used linear regression analysis, logistic regression does not assume linear relationships between predictor variables; and it never generates values greater than 1 or less than 0. Thus logistic regression is well suited to categorical variables (Garson 2012). The logistic regression equation is represented with a logistic function (see Eq. 1), that is, P (Y )
eb0 b1x1 b2 x2 bn xn 1 eb0 b1x1 b2 x2 bn xn
(Eq.1)
Where,
xi : predictor variable; bi : regression coefficient of xi ; and P(Y ) : probability of response variable Y occurring.
If the response variable has more than two categories, the logistic regression is called a multinomial logistic regression; if the response variable has just two categories, it is called a binary logistic regression. Normally, the multinomial logistic regression has similar results to the binary logistic regression. In this study, the binary logistic regression is employed to road selection because a road is either retained or eliminated, and thus the outputs can be viewed as a binary response variable.
The binary logistic regression can be used for classification. To be specific, the data with known predictor variables and a response variable are first viewed as samples to obtain an estimate of the regression coefficient vector b (b0 , b1,b2 ,
bn ) . Commonly
an iterative maximum likelihood method is used to compute the regression coefficient vector (Greene 2011). Then, the model with an estimated regression coefficient vector is further applied to classify new data whose predictor variables are known only. Therefore, binary logistic regression is actually a supervised learning approach. It is necessary to first determine both the predictor and response variables by using the binary logistic regression for selective omission in a road network.
2.2 Predictor and response variables of using BLR for selective omission in a road network (1) Predictor variables In a road network, normally the relatively important roads are selected. A number of properties can be used to describe the importance of a road. All these properties may be viewed as predictor variables, and they are introduced according to geometric, topological and thematic properties, respectively.
Geometric properties: common sense tells us that long and wide roads tend to be more important (Jiang and Harrie 2004). However, in a database, because a road network is often represented by single lines, road length is more widely used. Normally the longer the road length, the more important the road is.
Topological properties: the road network can be represented as a dual graph in which individual roads are taken as nodes and road intersections are taken as links (Porta et al. 2006). Then the importance of each road can be computed according to different topological properties, among which degree, closeness and betweenness are three famous ones (Jiang and Harrie 2004; Crucitti et al. 2006).
Degree measures the number of connections between a given road and other roads within a network. The degree of a given road i can be calculated as (see Eq. 2):
CiD aij
(Eq.2)
jN
where N is the total number of roads within the network; aij equals 1 if there is a connection between road i and road j , and it equals 0 otherwise.
Closeness measures the shortest distance from a given road to all other roads. The closeness of a given road i can be calculated as (see Eq. 3): CiC
N 1 jN , j i
d ij
(Eq.3)
where d ij is the shortest distance between road i and road j .
Betweenness measures the extent to which a given road is located between the paths that connect all other pairs of roads. The betweenness of a given road i can be calculated as (see Eq. 4):
CiB
1 ( N 1)( N 2)
j ,kN ; j k ;k i
n jk (i) n jk
(Eq.4)
where n jk is the number of shortest paths from j to k , and n jk (i ) is the number of shortest paths from j to k that pass through road i .
Thematic properties: they always exist in the database, if any. A number of thematic properties may be used to determine the importance of an individual road. Li and Choi (2002) investigated several properties, which were road type (or class), number of lanes, number of traffic directions and width. However, from our observations, these thematic properties are not always available.
(2) Response variable The response variable may be a binary decision whether a road/road segment is retained or not at a specific target scale. For example, a binary response variable involves two categories.
A road is retained at a specific scale; and
A road is not retained (or eliminated) at a specific scale.
3. Comparing the performances of using different percentages of training samples
3.1 Experimental data The experimental data involves three road networks with different target scales and different sizes (Figure 1). The first one is a digital road network of Hong Kong at 1: 50,000 scale, which was used as source data (Figure 1a). The corresponding road map
at 1: 100,000 scale was viewed as benchmark (or target data) for evaluation. Both of these data were provided by the Land Department of Hong Kong. The second one is a digital road network of Lower Hutt City (Figure 1b), which is a city in New Zealand. The source data is also at 1: 50,000 scale, but the corresponding road map or target data is at 1: 250,000. Both of these data were provided by Land Information of New Zealand. The third one is also a digital road network in New Zealand, but it covers a much larger region (i.e. Hawke’s Bay, Figure 1c). Both of the source data and the target data are still at 1: 50,000 scale and at 1: 250,000 scale respectively. In order to make the digital road network and the road map comparable, each road segment in the source road network was manually marked with an attribute to indicate whether it was retained or not in the corresponding target scale.
[Figure 1 near here]
3.2 Experimental design
Predictor variables: four properties were viewed as the predictor variables, and they are road length, degree, closeness and betweenness. Only these four properties were considered because they are geometric and topologic properties, which are available for all road network data. These properties were respectively computed based on each stroke in a road network, and all the strokes were built according to the principle of “good continuation” proposed by Thomson and Richardson (1999).
Response variable: Theoretically, the response variable should be whether a stroke is retained or not in a road network at the target scale. But in practice a stroke may comprise both road segments retained and not retained. So the road segments of a stroke were used for validation, and thus the response variable is an attribute recorded in the database to indicate whether a road segment is retained or not. For instance, “1” means a road segment is retained, and “0” means a road segment is not retained or eliminated. On the other hand, in order to use the stroke as the basic unit for selective omission, the predictor variables of each
stroke were assigned to the corresponding road segments consisting of this stroke. As an example, if the length of a stroke was 1000 meters, the lengths of all its road segments were assigned 1000 meters too. If ignoring this step, the road segment would be used as the basic unit for selective omission, and the continuous road segments of a stroke may be disconnected. As three sets of road networks were involved in this experiment, there are three categories of response variables (Table 1).
[Table 1 near here]
Training data: A certain percentage of strokes were used as samples for training. To be specific, the percentage was ranged from 1% to 9% with the interval 1%, and it was also ranged from 10% to 90% with the interval 10%. For each percentage, 30 groups of different random samples were chosen in order to minimize the subjectivity.
Validation data: the strokes in a road network which were not used for training were viewed as validation data.
Evaluation measure: The performance of selective omission was evaluated by a measure called error rate, which is a ratio of the number of road segments incorrectly classified to the number of all the road segments. This measure is varied from 0 to 1, and it has been widely used for evaluating the performance of other supervised learning approaches (Balboa et al. 2008; Zhou and Li 2014a; Zhou and Li 2014b).
Implementation platform: The binary logistic regression was implemented by free date mining software named TANAGRA (Rakotomalala 2005). Detailed steps of implementing binary logistic regression in TANAGRA can be found in the tutorials (http://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.html).
3.3 Experimental steps The main experimental steps are listed as follows:
Step 1: Randomly choose a certain percentage of strokes in a road network, and train all the road segments of these strokes with a logistic regression model;
Step 2: Apply this model into the untrained strokes, and calculate the error rate according to the response variable of all the road segments of these strokes;
Step 3: For each percentage, repeat the above two steps 30 times, and calculate an error rate estimate by averaging the 30 error rates, and also calculate the confidence interval for this error rate estimate;
Step 4: Repeat the above three steps with different percentages of training samples.
3.4 Experimental results and analyses Figure 2 first plotted the relationship between the different percentages (i.e., 1%-9%, and 10%-90%) of training samples and the corresponding error rate estimates for the three study areas, respectively.
[Figure 2 near here]
It can be seen in Figure 2 that, 1) In most cases, the error rate estimates for the different percentages of training
samples were around 0.1 or 0.2. So most of the road segments have been correctly classified with the binary logistic regression. This finding is consistent with that found by Zhou and Li (2014b). However, the error rate estimates tend to be different for the three study areas. As an example, the error rate estimate is around 0.2 for Hong Kong, and it is around 0.1 for Hawke’s Bay. This is possibly because of the different road patterns of these two tested road networks, and the tested road network of Hawke’s Bay has relatively much more rural roads than urban roads. 2) The error rate estimate for a relatively small percentage (e.g. 1%-4%) of training samples may have a wider confidence interval, but it tends to be stable if more samples were involved for training. This is especially for the study area of Lower Hutt City, which indicates that only a small percentage of training samples is needed for road selection with binary logistic regression. For the study areas of both Hong Kong and Hawke’s Bay, even a smaller percentage (e.g. 1%-2%) of training samples is needed because these road networks have relatively more strokes in total (Table 1).
Then, Figure 3 shows the selection results and corresponding error rates of using the different percentages of strokes for training. These results were obtained according to the steps 1 and 2 in Section 3.3. As an example, 2%, 3%, 20% and 30% of strokes in the tested road network of Lower Hutt City were randomly chosen for training a logistic regression model, which was then respectively applied into the entire road network for validation.
[Figure 3 near here]
Though comparing with the benchmark (Figure 3a), it can be found that, 1) The result in Figure 3b is not an effective selection because it retained many short and relatively unimportant roads; on the contrary, it eliminated long and relatively important roads. The selection results in Figures 3c, 3d and 3e perform much better, although it is not easy to visually determine which one has the best performance. On the other hand, these selection results also have much lower
error rates than that in Figure 3b. This also illustrates that a small percentage (e.g., 3%) of training samples may result in a good performance of road selection. 2) However, the selection result may still have the limitation on keeping the connectivity of a retained road network. Thus other algorithms (Kreveld and Peschier 1998; Chaudhry and Mackaness 2005) may be needed to refine the selection result.
4. Comparing the performances of using different sample sizes with significance tests In this section, significance tests will further be employed to investigate whether the performances of using different sample sizes show statistically significant differences. 4.1 Design of experiments A significance test often involves both the null hypothesis and the alternative hypothesis, and in this experiment,
The null hypothesis ( H 0 ) is that the use of different sample sizes for training makes no difference;
The alternative hypothesis ( H a ) is that the use of different sample sizes for training shows differences.
The observations for the significance tests were error rate estimates for several different sample sizes (i.e. 20, 50, 100, 200, 500 and 1000) obtained by 10-folder cross-validation (Refaeilzadeh et al. 2009). The detailed steps are listed as follows:
Step1: Randomly partition all the strokes in a tested road network into ten non-overlapped subsamples. Each stroke can only be belonged to one subsample.
Step2: Randomly choose a certain number of samples (e.g. 20, 50, 100, 200, 500 and 1000) from nine of the ten subsamples for training a classifier. Use the other single subsample for testing this classifier. Estimate the error rate for each sample
size (e.g. 50) by randomly sampling 30 times.
Step3: Repeat the step2 ten times, and use each of the ten subsamples for testing once. The ten error rate estimates for the same sample size consist of one group, and the significance tests aim to compare among the groups of different sample sizes.
As the above groups are not independent and the distribution of each group is unknown, we employed the non-parametric Friedman test (Friedman 1937) to compare the statistically significant difference of different sample sizes. If the null hypothesis of the Friedman test is rejected, we accept the alternative hypothesis that the use of different sample sizes for training has differences. However, which pair(s) of sample sizes existing difference is still unknown. Thus the Dunn's Multiple Comparison Test (Dunnett 1964) was further employed to make comparisons between all pairs of sample sizes.
4.2 Experimental results and analyses At first, figure 4 shows the plots of the relationship between the sample size for training and corresponding error rate estimate. As an example, only the plots of the ten subsamples of Hong Kong road network are listed here.
[Figure 4 near here]
It can be seen in Figure 4 that the error rate estimate first drops along with an increase of the sample size for training, and then it tends to be stable even further increasing the sample size. In most cases, the error rate estimate tends to be stable when the sample size for training is more than 50 or 100.
Then, the error rate estimates of the different sample sizes (i.e. 20, 50, 100, 200, 500 and/or 1000) were statistically compared through two significance tests (i.e. Friedman test and Dunn's Multiple Comparison Test). Results of these two tests are listed in Tables 2-3.
It can be seen in Tables 2-3 that the statistical results of using the different sample sizes for training are not totally the same for the three study areas. To sum up, 1) The error rate estimate for the sample size 20 is significantly higher than that for the sample size 200 or more (e.g. 500 and 1000); the error rate estimate for the sample size 20 is sometimes significantly higher than that for the sample size 100. 2) The error rate estimate for the sample size 50 is sometimes significantly higher than that for the sample size 200 and more (e.g. 500 and 1000). 3) However, there is not any statistically significant difference found among the error rate estimates for the sample sizes larger than 50 (i.e. 100, 200, 500 and 1000). Thus the needed sample size for training may be between 50 and 100.
5. Conclusions and discussions This paper investigated how many samples are needed using the supervised learning approach to road selection. The basic idea is to use different percentages and numbers of strokes as samples for training a supervised learning approach (i.e. binary logistic regression), which was then applied into the untrained strokes for validation. Three road networks with different sizes and different target scales were involved for testing.
The performance for each sample size was mainly evaluated with an error rate estimate, which was obtained by simple random sampling. Results showed that: (1) most of the error rate estimates are around 0.1-0.2; (2) more importantly, 50-100 random samples are needed using the binary logistic regression to road selection.
However, this study has several limitations.
First, we used simple random sampling and thus the error rate estimate was depended on not only the sample size, but also the distribution of training samples. As an example, Figure 5 is a visualization of the random training samples (i.e. strokes), which have the same sample size (i.e. 2%) but different distributions. The error rate for each group of training samples is also listed. It can be seen in this figure that the error rate for Figure 5a is 0.848, which is much higher than that (i.e. 0.184) for Figure 5b. This is possibly because that the samples for Figure 5a are all short and relatively unimportant strokes. Thus it is better to choose both strokes to be retained and those to be eliminated as training samples. An alternative solution is to use purposive sampling (Tongco 2007) or model-based sampling (Hansen et al. 1983) rather than random sampling. More importantly, how to choose the least and the most effective training samples is still worth discussing in the future.
[Figure 5 near here]
Second, in this study, we mainly investigated the needed sample size using the binary logistic regression. The needed sample size for other supervised learning approaches should also be investigated in the future. Moreover, most of the other supervised learning approaches (e.g. artificial neural network and decision trees) involve multiple parameters, thus it is also interesting to investigate whether the needed sample size varies with the different parameter settings.
Last but not least, this study tested the needed sample size with only three road network data. Although road network data have become more and more available (e.g. OpenStreetMap and TIGER/Line), there is still a lack of free road network data of the same region but at multiple scales. Therefore, the needed sample size for training should be further verified by testing on more road network data.
Acknowledgments The project was supported by National Natural Science Foundation of China (No. 41301523) and Fundamental Research Funds for the Central Universities, China University of Geosciences (Wuhan, P.R. China, No.CUGL140419). The authors were thankful to the Land Information of New Zealand and Lands Department of Hong Kong for providing the experimental data. The authors would also like to express special thanks to all the anonymous reviewers and the editor for their valuable comments.
References ArcGIS Resources. (2012). Thin Road Network (Cartography). In: ArcGIS Help 10.1. Redlands, CA: Esri. http://resources.arcgis.com/en/help/main/10.1/index.html#//007000000014000000 (Accessed August 15, 2014).
Balboa, J.L.G. and López, F.J.A., (2008). Generalization-oriented Road Line Classification by Means of an Artificial Neural Network. Geoinformatica, 12(3): 289-312.
Chaudhry, O. and Mackaness, W., (2005). Rural and urban road network generalization
deriving
1:250000
from
OS
MasterMap.
www.era.lib.ed.ac.uk/bitstream/1842/1137/1/ochaudry001.pdf. (Accessed January 31, 2009)
Chen, J., Hu, Y.G., Li, Z.L., Zhao, R.L. and Meng, L.Q., (2009). Selective omission of road features based on mesh density for automatic map generalization. International Journal of Geographical Information Science, 23(8): 1013 – 1032.
Crucitti, P., Latora, V. and Porta, S., (2006). Centrality measures in spatial networks of urban roads. Physical Review E, 73(3): 0361251-5.
Dunnett, C.W., (1964). New tables for multiple comparisons with a control. Biometrics, 20(1964): 482–491.
Edwards, A. and Mackaness, W. A., (2000). Intelligent generalisation of urban road network. In: Proceedings of Geographical Information Systems Research UK 2004 Conference (GISRUK 2000), University of York, April 5-7, pp.81-85.
Friedman, M., (1937). The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance. Journal of the American Statistical Association, 32(200): 675- 701.
Garson, G.D., (2012). Logistic Regression: Binomial and Multinomial. Asheboro, NC Statistical Associates Publishers.
Greene, W., (2011). Econometric Analysis (7th Edition). Prentice Hall Press.
Gulgen, F. and Gokgoz, T., (2011). A block-based selection method for road network generalization. International Journal of Digital Earth, 4(2): 133-153.
Hansen, M.H., Madow, W.G. and Tepping, B.J., (1983). An Evaluation of Model-Dependent and Probability-Sampling Inferences in Sample Surveys. Journal of the American Statistical Association, 78 (384): 776-793.
Jiang, B. and Claramunt, C., (2004). A Structural Approach to the Model Generalization of urban Street Network. GeoInformatica, 8(2): 157-173.
Jiang, B. and Harrie, L., (2004). Selection of roads from a Network Using Self-Organizing Maps. Transactions in GIS, 8(3): 335-350.
Kreveld, M. and Peschier, J., (1998). On the Automated Generalization of Road Proceeding of the 3rd International Conference on
Network Maps.
In:
GeoComputation.
http://www.geocomputation.org/1998/21/gc_21.htm.
(Accessed
January 31, 2010)
Leitner, M. and Buttenfield, B.P., (1995). Acquisition of Procedural Cartographic Knowledge by Reverse Engineering. Cartography and Geographic Information Systems, 22(3): 232-241.
Li, Z.L., (2006). Algorithmic Foundation of Multi-scale Spatial Representation. CRC Press (Taylor & Francis Group), Bacon Raton, 280 pp.
Li, X. and Claramunt, C., (2006). A spatial entropy-based decision tree for classification of geographical information. Transactions in GIS, 10(3): 451-467.
Li, Z.L. and Choi, Y.H., (2002). Topographic Map Generalization: Association of Road Elimination with Thematic Attributes. The Cartographic Journal, 39(2): 153-166.
Li, Z.L. and Zhou, Q., (2012). Integration of Linear- and Areal-Hierarchies for Continuous Multi-Scale Representation of Road Networks. International Journal of Geographical Information Science, 26(5): 855-880.
Mackaness, W.A. and Beard, M.K., (1993). Use of graph theory to support map generalization. Cartography and Geographic Information Systems, 20(4): 210-211.
Mackaness, W., (1995). Analysis of Urban Road Networks to Support Cartographic Generalization. Cartography and Geographic Information Systems, 22(4): 306-316.
Mackaness, W. and Mackechine, G., (1999). Automating the Detection and Simplification of Junctions in Road Networks. GeoInformatica, 3(2): 185-200.
Mustiere, S., (2005). Cartographic generalization of road roads in a local and adaptive approach: a knowledge acquisition problem. International Journal of Geographical Information Science, 19(8): 937-955.
Porta, S., Crucitti, P. and Latora, V., (2006). The network analysis of urban roads: A dual approach. Physica A: Statistical Mechanics and its Applications, 369(2): 853-866.
Rakotomalala, R., (2005). "TANAGRA: a free software for research and academic purposes", In: Proceedings of EGC'2005, RNTI-E-3, vol. 2, pp.697-702, 2005. (In French)
Refaeilzadeh, P., Tang, L., and Liu, H., (2009). Cross-Validation. Encyclopedia of Database Systems 2009, pp.532-538.
Stoter, J., Post, M., Altena, V.V., Nijhuis, R. and Bruns, B., (2014). Fully automated generalization of a 1: 50k map from 1: 10k data. Cartography and Geographic Information Science, 41(1): 1-13.
Thomson, R. and Richardson, D., (1995). A Graph Theory Approach to Road Network Generalisation. In: Proceeding of the 17th International Cartographic Conference, pp.1871–1880.
Thomson, R. and Richardson, D., (1999). The "good continuation" principle of perceptual organization applied to the generalization of road networks. In Processdings of the 19th International Cartographic Conference, Ottawa, Canada. pp.1215-1223. Thomson, R. and Brooks, R., (2007). Generalisation of Geographical Networks. In: Ruas, A., Mackaness, W.A. and Sarjakoski, L.T. (eds.). Chapter 13 in generalization of geographic information: Cartographic modeling and applications. Elsevier, pp.255-267.
Tongco, M.D.C., (2007). Purposive Sampling as a Tool for Informant Selection. A Journal for Plant, People and Applied Research, 5(2007): 147-158.
Touya, G., (2010). A road network selection process based on data enrichment and structure detection. Transactions in GIS, 14(5): 595-614. Töpfer, F. and Pillewizer W., (1966). The principle of selection. The Cartographic Journal, 3(1): 10-16.
Weibel, R., Keller, S. and Reichenbacher, T., (1995). Overcoming the knowledge acquisition bottleneck in map generalization: the role of interactive systems and computionnal intelligence. In: Proceedings of 2nd International Conference on Spatial Information Theory, pp. 139-156.
Yu, Z., (1993). The effects of scale change on map structure. Doctoral Thesis, Department of Geography, Clark University.
Zhang, Q., (2004). Road network generalization based on connection analysis. In: The 11th International Symposium on Spatial Data Handling, pp.343-353.
Zhou, Q. and Li, Z.L., (2014a). Use of Artificial Neural Networks for Selective Omission in Updating Road Networks. The Cartographic Journal, 51(1): 38-51.
Zhou, Q. and Li, Z.L., (2014b). A Comparative Study of Various Supervised Learning Approaches to Selective Omission in a Road Network. The Cartographic Journal, DOI: 10.1179/1743277414Y.0000000083.
Table 1 Three categories of response variables Total Study area
Category
Scale of
Scale of
source map
target map
1:50, 000
1: 100,000
number of
Response variables
strokes 'R_Y_1'- the road segment is retained at 1:100,000 scale; Hong Kong
1
4012
'R_N_1'- the road segment is not retained at 1:100,000 scale. Lower Hutt
'R_Y_2'- the road segment is retained at 1:250,000 scale; 2
818
1:50, 000
1: 250,000
City
'R_N_2'- the road segment is not retained at 1:250,000 scale. 'R_Y_3'- the road segment is retained at 1:250,000 scale;
Hawke’s Bay
3
3711
1:50, 000
1: 250,000 'R_N_3'- the road segment is not retained at 1:250,000 scale.
Table 2 Results of the Friedman test Study area
Hong Kong
Lower Hutt City
Hawke's Bay
Number of groups
6
5
6
Friedman statistic
38.18
29.2
31.03
Yes
Yes
Yes
Are means significantly different? (P < 0.05)
Table 3 Results of the Dunn's Multiple Comparison Test Study area
Hong Kong
Lower Hutt City
Hawke's Bay
Sample Size
Difference
Significant?
Difference in
Significant?
Difference
Significant?
Comparison Test
in rank sum
P < 0.05?
rank sum
P < 0.05?
in rank sum
P < 0.05?
20 vs. 50
8
No
6
No
16
No
20 vs. 100
23.5
No
16
No
28
Yes
20 vs. 200
39
Yes
24
Yes
30
Yes
20 vs. 500
40.5
Yes
29
Yes
40
Yes
20 vs. 1000
27
Yes
-
-
36
Yes
50 vs. 100
15.5
No
10
No
12
No
50 vs. 200
31
Yes
18
Yes
14
No
50 vs. 500
32.5
Yes
23
Yes
24
No
50 vs. 1000
29
Yes
-
-
20
No
100 vs. 200
15.5
No
8
No
2
No
100 vs. 500
17
No
13
No
12
No
100 vs. 1000
3.5
No
-
-
8
No
200 vs. 500
1.5
No
5
No
10
No
200 vs. 1000
-12
No
-
-
6
No
500 vs. 1000
-13.5
No
-
-
-4
No
(Noted: The sample size 1000 is not applicable to the tested road network of Lower Hutt City for which it has no more than 1000 strokes in total)
A list of Figures Figure 1 Three road networks at two different scales Figure 2 Plot the relationship between the different percentages of training samples and corresponding error rate estimates for three road networks Figure 3 Selection results (bold lines) and corresponding error rates for Lower Hutt City, using the different percentages of training samples Figure 4 Plot the relationship between the different sample size and corresponding error rate estimate for testing on the ten subsamples of Hong Kong road network Figure 5 Error rates of using the different training samples (strokes, highlighted with bold lines) for Lower Hutt City