An Empirical Comparison of Edge Effect Correction

0 downloads 0 Views 790KB Size Report
dius of the circle that circumscribes the study area; for example, for a square area, it ..... Levine & Associates, Annadale, Va., and the National Institute of Justice, ...
Ikuho Y a m & Peter A. Rogerson

An Empirical Comparison of Edge Effect Correction Methods Applied to K-function Analysis

This paper explores various edge correction methods for K-function analysis via Monte Carlo simulation. The correction methods discussed here are Ripley’s circunzference correction, a toroidal correction, and a guard area correction. First, simulation envelopesfor a random point pattern are constructed for each edge correction method. Then statistical powers of these envelopes are analyzed in terms of the probability of detecting clustering and regularity in simulated clusteringlregularity patterns. In addition to the K-function, K(h), determined for individual distances, h, an overall statistic k is also examined. A majorfinding of this paper is that the K-function method adjusted by either the Ripley or toroidal edge correction method is more powerful than what is not adjusted or adjusted by the guard area method. Another is that the overall statistic k outperforms the individual K(h) across almost the entire range of potential distances h.

1. INTRODUCTION

In the context of spatial data analysis, point pattern analysis has long been considered one of the most important topics. A major interest of spatial point pattern analysis is whether an observed point pattern follows some systematicprocess of clustering or regularity, or whether it follows a random process (Bailey and Gatrell 1995).Since many fields such as ecology, epidemiology, criminology, and geography share this interest, various methods for point pattern analysis, especially cluster detection methods, have been developed; some of them are related to the first-order trend of a point pattern, and others are related to the second-order intensity. Besag and Newell (1991) classify cluster detection methods into three categories. The first is general tests, where a single global statistic indicates whether any deviation from an expected pattern exists or not. The second is focused tests, which are designed to detect clustering around predefined foci. The third category is tests for the detection of cluster-

Ikuho Yamada is a Ph.D. student in the Department of Geography at the University at Buf f a b . Peter Rogerson is a professor in the Department nh2.

2.2. Edge Correction Methods for the K-function

To estimate the K-function for an observed point pattern, Ripley (1977) proposed a proper edge correction method, which weights ordered pairs of points based on their relative locations within the study area. The weight, w i , ,for a pair of points i and j is defined as the proportion of a circumference of a circfe centered at point i and passing through pointj that lies within the study area, represented by the thick parts of the arcs in Figure l(a). Our implementation of Ripley's circumference method takes into account three possible spatial relations between the circle and the edge of a rectangular study area illustrated in Figure l(a), while some authors consider only some of them, or treat them individually. For detailed discussion on this issue, see Hasse (1995). The edge corrected estimator of the K-function is given by

This method can avoid the bias caused by edge effects for hs that are less than the radius of the circle that circumscribes the study area; for example, for a square area, it can avoid the bias as long as h is less than 70.7% of the side of the area (Getis 1983).

100 / Geographical Analysis For such simple shapes as circles and rectangles, w ycan be given by explicit formulae (Cressie 1991). Although it is difficult to derive wy analytically for an arbitrarily shaped study area, it would still be possible to derive it numerically using GIS. In addition to Ripley's circumference method, which is specifically designed for the K-function, there are several edge correction methods that are generally used in spatial data analysis (Ripley 1979; Cressie 1991; Bailey and Gatrell 1995). One is a toroidal edge correction, which assumes that the top and the left of the study area are connected to the bottom and the right, respectively, as if the study area were a torus. This method is only available for rectangular study areas; in practice, it is implemented by surrounding the original study area with eight of the identical rectangles as in Figure l(b). Since this method is based on an assumption that the point pattern outside of the study area is the same as within the study area, it has the potential risk of bias, especiallywhen a cluster exists close to the study area edge. Thus, when using the toroidal method, one needs to carefully consider whether a phenomenon under study would satisfy that assumption. Another general method of edge correction is a guard area correction, which constructs a guard area inside or outside of the study area and uses points in the guard area only as destinations in measuring distance between points as in Figure l(c) and (d). Unlike the toroidal method, the guard area methods are applicable regardless of the shape of the study area, though the outer guard area method can be used only when data from the outside of the study area are available. Although Ripley (1979) excluded the guard area method in his study because his preliminary study showed its substantiallyinferior performance, we decided to include the outer guard area method in our study because it could be an only available choice in some situations. For example, when a study area is not a simple rectangle, the toroidal method is not applicable, and Ripley's method could be too complicated without proper software or skilled programmers. We, however, exclude the inner guard area method because of its inevitable disadvantage resulting from the abandonment of information (i.e., points within the guard area). The seriousness of

Guard Area

FIG.1. Edge correction methods: (a) Ripley's circumference method; (b) toroidal method; (c) inner guard area method; (d) outer guard area method.

Ikuho Yamada and Peter A. Rogerson 1 101 this disadvantage depends on the size of the guard area relative to the study area. In addition, our preliminary study showed that correcting edge effects by the inner guard area method does no better than applying no edge correction.

2.3. Statistical Tests Although it is theoretically known that the K-function under CSR is equal to n:hh2, knowledge of the sampling distribution of its estimate is required to test the significance of clustering/regularity detected in an observed point pattern (Bailey and Gatrell 1995). Since the theoretical derivation of this knowledge is complicated because of edge effects, Monte Carlo simulation is usually carried out to obtain criteria for testing the significance of the observed pattern. That is, the observed pattern is tested by the comparison with the upper and lower envelopes obtained from Monte Carlo simulation of CSR. In the original study by Ripley (1q76, 1977), the observed K-function values were examined by means of a plot of K ( h ) against distance h, along with the envelopes. Besag (1977) suggests plotting dk(h)/ n: against h so that the theoretical value of the K-function under the null hypothesis*becomesa straight line. Following his suggestion, some authors use the L-function, L( h ) ,defined as

i ( h )=

4%

-h ,

(4)

whose theoretical value is constant and equal to zero. In what follows, we use this convention. To avoid problems associated with multiple testing at different spatial scales, a statistic, k, which indicates the overall discrepancy of the observed point pattern from CSR, is often used in addition to the values of K( h ) determined for individual h. The overall statistic k is defined as follows (Cressie 1991):

k

= 0

(&

- &h)2dh ,

where ho is the maximum distance to be considered. The overall k is tested with a one-tail critical value obtained via Monte Carlo simulation. 3.METHODS

We examined the performances of the K-function analyses with the different edge correction methods discussed above, along with one without any edge correction (we refer to it as the non-correction method hereinafter) via Monte Carlo simulation. We used a 1 0 0 100 ~ square as a study area and the intensity of points h = 0.01 so that the study area has 100 points on average. We also set the maximum distance to be examined to 40, and thus used an outer guard area with a width of 40. For the simulations explained below, we applied all four edge correction methods (i.e., non-correction, Ripley’s circumference, toroidal, and outer guard area methods) to the same set of point patterns in order to avoid a simulation bias. That is, we first generated a pattern within a 180X 180 square with 324 (= 1802x0.01) points on average, and then used the central lOOX 100 square as the study region and the rest as the outer guard area. First, for each edge correction method, we created upper and lower envelopes of J?( h ) for individual h and the critical value of the overall k based on 2,000 simulated

102 / Geographical Analysis

CSR patterns. The significance level chosen is 5%, so that the 100th and 1901st largest L( h ) were determined to be the upper and lower envelopes, respectively, and the 100th largest k was determined to be the critical value of k. Next, we examined the statistical power of the envelopes and the critical values of k. We generated many types of clustering and regularity patterns and applied the Kfunction analyses with the four edge correction methods to them to determine the probability of detecting clustering$-egularity(i.e., statistical power). Note that a clustering or regularity tendency was detected when the observed L-function value was greater than the upper envelope or less than the lower envelope, respectively. A clustering pattern examined is characterized by an average number of children, c, for each parent and a radius, r, of a &sk centered at the parents where their children can be located. The procedure to generate a clustering pattern is as follows: (1) set a simulation area to be a (180+2r)X(180+2_r)square and the number of parents, p, to be 0.01(180+2r)2/c(recalling that h= 0.01); (2) locate p parents at random within the simulation area; (3) assign each parent a number of children randomly chosen among c 5 10; (4)assign each offspring a location randomly chosen within a disk of a radius r centered at its parent; (5) delete the parents and clip the central 180x180 square as the study area with the outer guard area.

We studied a clustering scenario with c = 25 and r = (20,30,40, 50, 60, 70). Since this procedure does not restrict the number of points within the study area, the study area could have a much smaller or larger number of points than expected. We accepted such a pattern except when it had either 1or 2 points within the study area; the K-function cannot be calculated in the former case, and in the latter case our preliminary study showed that the resulting K-function would perform very differently from the other cases. Note that such extreme cases are very rare (e.g., only 1of 1,000 simulations when c = 25 and r = 20) and the average number of points within the study area is about 100 for every radius r. For generating a regularity pattern, we adopted the sequential spatial inhibition process (Kaluzny et al. 1997): (1) locate a point within a 180X180 square randomly; (2) if the assigned location is within a distance, d, from any of the previously located points, ignore the new point; (3) repeat (1)and (2) until 324 points are obtained;

where the scale of regularity d ranges from 2 to 7, with an interval of 1. We examined 6 clustering and 6 regularity patterns with 1,000 simulations for each pattern, and they are all stationary as required by an assumption of the K-function analysis. Sample realizations of the random, clustering, and regularity patterns are shown in Figure 2. 4. RESULTS

4.1. The Simulation Envelopes and the Critical Value of k

Figure 3 shows the envelopes of L(h) determined by 2,000 random simulations, where h ranges from 1to 40. For the sake of simplicity, we will use “Non,” “Ripley,” “Torus,” and “Outer” to represent the edge correction methods in the figures and tables. As is well known, the non-correction method performs differently from the others; its envelopes are badly biased downward from the horizontal axis, which corresponds to the theoretical value of L(h),whereas the other methods have the axis in the midst of their upper and lower envelopes. The biased envelopes, however, are not

Zkuho Yamada and Peter A. Rogerson / 103

FIG.2. Sample realization of point patterns: (a) randomness; (b) clustering (r = 40); (c) regularity (d = 5 ) .

-Non Rpley -Torus Outer

FIG.3. The upper and lower envelopes (5% significance level; 2,000 realizations).

necessarily a problem as long as one’s purpose is not to estimate the L-function (or Kfuncti2n) value itself but to detect clusteringhegularity in a point pattern, because both L( h ) for the observed pattern and the envelopes are biased in the same manner. For the purpose of pattern detection, what may be more important is the width o,f the envelopes, which is related to the amount of variance in the estimated values of L( h ) . Intuitively, an edge correction method with wide envelopes is a method that yields more widely fluctuating estimates and may have lower statistical power under clustering and regularity alternatives. The width of the envelopes shown in Figure 4 reveals that the toroidal method is the most stable, followed by the Ripley method and the non-correction method in this order. Figure 4 also suggests that there is not much difference among the methods at a small scale, say, up to h = 15, except for the outer guard area method, although the toroidal method actually deqeases its envelope width beyond that scale. The variances of the estimated values of K( h), not presented here, give the same suggestion. This may be because the toroidal method uses exactly the same information repeatedly so that it has less variability. Note that the noncorrection method has narrower envelopes than the outer guard area methods at every scale. Table 1shows the critical value and variance of the overall k . The variance has the same tendency as the width of the envelopes discussed above, with the exception of

104 / Geographical Analysis

'I

-./

3.5 _--__-

*c--

3 2.5 2 1.5 1 0.5

{-

FIG.4. The width of the envelopes (5%significance level; 2,000 realizations).

the appreciably high value of the non-corrgction method. This is because the overall k accumulates deviations of the estimated K( h) from the theoretical K ( lz) as shown in equation ( 5 ) ,and thus the non-correction method, whose estimated K( h ) is biased downward, does not show a good performance. Therefore if one uses only the overall k with the non-correction method, one may get a misleading result. A possible adjustment of this is to replace the theoretical K ( h ) in equation ( 5 )with the average of the estimated i(h ) , Z?( h ) .That is,

The resulting critical value and variance of k' for the non-correction method are also presented in Table 1, showing their apparent improvement. Hence we used k' instead of k for the non-correctionmethod in examining its statisticalpower in the next section.

4.2. The Probability of Detecting Clustering /Regularity in a Clustering/Regularity Pattern As an example, the results for clustering radius r = 40 are shown in Figure 5, where the vertical axis indicates the statistical power of clustering detection, i.e., the probability of detecting clustering. With the exception of small scales, the non-correction, Ripley's, and toroidal methods detect clustering patterns more than 95% of the time up to h = 15, whereas the outer guard area method has power of less than TABLE 1 The Critical Value and Variance of the Overall Statistic k (5%significance level; 2,000 realizations).

Non by Eq. (5) (i.e.,k) Non by Eq. (7) (i.e, k') Ripley TONS Outer

Criticalvalue of k

Variance of k

2152 162 81 44 317

194489 3699 692 218 12281

Ikuho Yamuda and Peter A. Rogerson / 105

c =25 r4 0

FIc.5. The statistical power of cluster detection ( r = 40).

88%. The power of all methods decreases as h increases and the non-correction method shows this tendency most strongly. Figure 6(a) demonstrates the statistical power of the non-correction method for different clustering radii r, where the solid and broken lines represent the individual L(h)and the overall statistic k (in this case, k’), respectively. As the clustering radius r increases, both L(h)and the overall k’ become less likely to detect clustering. Except for the radius r = 30, the individual L(h)outperforms the overall k’ at certain ranges of h though the difference between the two is smaller at those ranges than the difference where the overall k’ outperforms L(h).This result is similar to that of Tango (ZOOO), who finds that a composite clustering statistic defined over a range of spatial scales is generally superior to the same statistic defined at a particular geographic scale. Figure 6(b) shows the power of Ripley’s method, illustrating that the range over which the individual L(h)is superior to the overall k is smaller than in the case of the non-correction method, as is the magnitude of any advantage of L(h)over k over that range. Results for the other two methods are very similar to Ripley’s method. We also looked at another clustering scenario with c = 50 and r = (20,30,40,50, 60, 70) and found qualitatively similar results. Figure 7 compares the statistical power of regularity detection for the four methods when the scale of regularity d is 5. Note that in Figures 7 and 8 h = 1 is omitted because these regularity patterns by their definition always result in L(1) = -1, which is exactly the same as the lower envelopes, and so no regularity pattern will be detected at h = 1. Here again, the non-correction, Ripley’s, and toroidal methods appear to be superior to the outer guard area method. The three methods perform similarly to one another, while the toroidal method becomes slightly more powerful than the others as h increases. All four methods have highest power around the true scale of the pattern, and they lose their power rapidly when h exceeds the true scale. This tendency is most clear for the outer guard area method. Figure 8 demonstrates the powers of the Ripley and toroidal methods for various scales of regularity. There is little difference between them, except that for the regularity pattern with d = 3 the Ripley method has notably low performance of the overall k where the individual L(h) has its maximum power. In the case of the non-

100%

100%

90%

QO%

80%

80%

70%

-Non

80%

-Non 70

50%

.......

30

70%

60% 50%

40%

40%

-

30%

- -k70

30%

20%

20%

10%

0%

-

+

g Z

F;

R

(h)

F1c.6.The statistical power of cluster detection ( r = 30, 50, and 70): (a) non-correction method; (b) Ripley's method. NOTE:the power of the overall statistic k with Ripley's method is 100%for r = 30, so that the line for " k 3 0 in (b) is not visible by overlapping the horizontal frame line of the diagram.

100%

90% 80%

0 70% 3

60%

.-.?

p*

50%

a,

regularity d = 5

...- ... Ripley

40%

Torus

2 30%

Outer

n

n

20% 10% 0%

FIG.7. The statistical power of regularity detection (d = 5 ) .

Ikuho Yainada and Peter A. Rogerson / 107

100%

90% 80%

4

70%

F 60%

p 53% 2

2

40% 30% 20%

10% 0%

F1c.8. The statistical power of regularity detection (d = 3, 5 , and 7): (a) Ripley’s method; (b) toroidal method. NOTE: the power of the overall statistic k is 100%for r = 5 and 7 , so that the lines for “k5” and “k7” are not visible by overlapping the horizontal frame line of the diagram.

correction and outer guard area methods, the powers of the overall k’ and k are only about 7% and 4%, respectively. The results above indicate that, whether an observed pattern has clustering or regularity tendencies, the individual L(h)outperforms the overall k only when h is close enough to the true scale of the observed patterns. This by itself is not surprising, but the finding that the overall k is only slightly weaker than the individual L(h) at the true scale of clusters, coupled with its large advantage at other scales, makes it an attractive choice. Table 2 shows the power of the overall statistic k for various scales of clustering and regularity. For the purpose of comparison, both k’ and k are shown for the noncorrection method. The overall k for the Ripley and toroidal methods appear to be relatively robust to the scale of patterns. Differences between the two methods are found at larger scales of h. The toroidal method outperforms the Ripley method for regularity detection, whereas the Ripley method is only slightly better than the toroidal methods for clustering detection. As for the non-correction method, although the overall statistic k‘ adjusted as in equation (7) clearly improves the original k , it still has lower power than the two methods above. The outer guard area method yields the least power with few exceptions. Comparisons between Table 2(a) and (b) and between Figure 6 and 8 imply that the power of the K-function method tends to be less sensitive to the scale of clustering than to that of regularity. This tendency may be explained in relation to both the nature of the K-function and a characteristic of the clustering patterns simulated in this paper. Since the K-function is basically a number of points within distance h of an arbitrary point, it can hardly distinguish clusteringhegularityfrom randomness when h is appreciably larger than the true scale of observed patterns. This scale effect may be more critical to regularity detection than to clustering detection because regularity patterns intuitively have a smaller range of possible scales than clustering patterns. Another possible explanation is that it could happen that some clusters overlap with one another to form denser clusters, because no restriction is imposed on relative locations of parents.

108 / Geographical Analysis TABLE 2 The Statistical Power of the Overall Statistic k. (a) clustering patterns: probabilities of finding clustering (%) Clustering radius

20

30

40

50

60 ~~~

Non k’ Non k Torus Ripley Outer

100 80.2 100 100 100

99.4 30.3 99.9 100 95.4

93.1 10.7 97.6 97.4 83.1

78.5 3.9 88.2 89.1 71

~

70 ~

60.2 2.5 75.1 75.8 58.3

46.9 4.4 58.5 62.3 52.5

(b) regularity patterns: probabilities of finding regularity (%) Scale of

regularity

Non k‘ Non k Torus

Ripley Outer

2

3

4

5

6

7

3.7 5.8 22.2 8.3 4.9

6.7 5.8 92.1 44.9 3.7

28.1 7.5 100 100 6.8

100 13.3 100 100 22.8

100 26.1 100 100 99.5

100 48.3 100 100 100

In summary, the Ripley and toroidal methods proved themselves to be the most effective edge correction methods, whereas the outer guard area method turned out to be less effective than even the non-correction method. This lower performance of the outer guard area method may result from the relatively higher variability of information used because of its reference to points within the guard area, while the other methods only use the information from the study area. 5. CONCLUSION

This paper made a comparison between four edge correction methods, including one without any correction, that are generally applied to the K-function method. The comparison focused on their statistical powers of clusteringhegularity detection, which were obtained by means of Monte Car10 simulation. The main findings are summarized as follows. First, it appears that the Ripley and toroidal methods perform better than the others in terms of both the individual L(h)and the overall k. Second, as far as an analysis aims to detect patterns in an observed point pattern, and not to estimate K(h)(or L(h))itself, the non-correction method does not have any inherent &sadvantage.Although the original overall k with the non-correction method has obviously deficient power compared to the other methods, the modified overall statistic k’ introduced in this paper can avoid this problem. As shown in Figure 6,8, and Table 2, the non-correction method has higher power than the outer guard area method. These two points imply that the outer guard area method is the least efficient method and that one would be better off by not doing any edge correction rather than by using the outer guard area method. However, it is obvious that the power of the guard area method depends on the relative size of the whole study area compared to the width of the guard area, which is determined by the maximum distance to be examined; if the study area is large enough, the outer guard area method might work no worse than the other methods. Third, the overall k outperforms the individual L(h)at most of scales h, except when h is very close to the true scale of the observed pattern. Even in that case, the excess power of the individual L(h)over the overall k is less than that of the overall k over L(h)at the rest of the scale range.

Ikuho Yamada and Peter A. Rogerson / 109 Although this paper concentrated on the comparison of edge correction methods, there are manifold factors that may affect the strength of edge effects: size and shape of the study area, intensity of points, relative location of clusters, etc. For example, clusters located close to the edge of the study area contribute strongly to edge effects. Moreover, the toroidal method, which the results in this paper suggest as one of the best edge correction methods, might not be suitable when all points in the study area form only one cluster existing at the edge, because the assumption that the outside of the study area still has the same pattern could hardly be satisfied. Since edge effects are a critical problem in analyzing spatial data, it will be worthwhile further investigating how other factors mentioned above would affect the results obtained in this paper. LITERATURE CITED Bailey, T. C., and A. C. Gatrell (1995).Interactive Spatial Data Analysis. Harlow: Longman. Bartlett, M. S. (1964).The Spectral Analysis of Two-Dimensional Point Process. Biornetrika 51 (3/4), 299-3 11. Gignow, and J-C. Menaut (1999).Demo raphy of a Savanna Palm Tree: Predictions from Bagmt;ekensive Spatial Pattern Analysis. Ecology SO(%), 1987-2005, Besag, f. E. (1977).Comment on “Modelling S p & l Patterns” by B. J. Ripley.Journa1of the Royal Stutistical Society. Series B 39(2),193-95. Besag, J., and J. Newel1 (1991). The Detection of Clusters in Rare Diseases.Journa1 ofRoyal Statistical Society. Series A 154(1),143-55. Clark, P. J., and F. C. Evans (1954).Distance to Nearest Neighbor as a Measure of Spatial Relationships in Population. Ecology 35(4),445-53. Cliff, A. D., and . K.Ord (1975).Model Building and the Analysis of Spatial Pattern in Human Geography. Iournal o)Royal Statistical Society 37(3),297-348. Ciomes, D. A., M. Rees, and L. Tumbuli(1999). Identifying Aggregation and Association in Fully Mapped Spatial Data. Ecology 80(2), 554-65. Cressie, N. (1991).Stniistics for Spatial Data. New York: John Wiley & Sons. Diggle, P. J. (1979).On Parameter Estimation and Goodness-of-Fit Testing for Spatial Point Patterns. Biometrics 35,87-101. Getis, A. (1983).Second-Order Anal sis of Point Patterns: The Case of Chicago as a Multi-center Urban Region. The Professional Geograpler 35(l),73-80. Haining, R. (1990).Spatial Data Analysis in the Social and Environmental Sciences. Cambridge: Cambridge University Press. Hasse, P. (1995).Spatial Pattern Anal sis in Ecolo Based on Ripley’s K-function: Introduction and Methods of Edge Correction. Journal orvegetation g e n c e 6,575-82. He, F. and R. P. Duncan (2000).Densi de endent Effects on Tree Survival in an Old-Growth Douglas Fir Forest. The Journal of Ecology 88%), &6-88. Jones, A. P., I. H. Langford, and G. Bentham (1996).The Application of K-function Analysis to the Geographical Distribution of Road Traffic Accident Outcomes in Norfolk, England. Social Science and Medicine 42(6),879-85. Kaluzny, S. P., S. C. V e p T. P. Ca,doso, and A. A. Shelly (1997).S+SpatialStats User’s Manual for Winclaws" and U N I P . ew York. Spnnger. Levine, N. (1999).Crime Stat: A Spatial Program for the Analysis of Crime Incident Locations. Ned Levine & Associates, Annadale, Va., and the National Institute of Justice, Washington, D.C. August 1999.(http://w.icpsr.umich.edu/NACJD/crimestat.html) Pellegrini, P. A,, and S. Reader (1996).Duration Modeling of Spatial Point Patterns. Geographical Analysis 28(3),219-43. Rechel, J. L.,and M. C. Nicholson (1994).Spatial Pattern Analysis of Mule Deer Locations in the San Bernardino Mountains, California. GZSILIS ’94, 643-57. Ripley, B. D. (1976).The Second-Order Analysis of Stationary Point Process.Jmrna1 of Applied Probability 13,255-66. Riple , B D. (1977).Modelling Spatial Patterns. Juurnal of the Royal Statistical Society. Series B 39(2),

17LI-92. Ripley, B. D. (1979).Tests of “Randomness” for Spatial Point Patterns.Journa1 of the Royal Statistical Society. Series B 41(3),368-74. Rogerion, P. A. (2001).Statistical Methohfor Geography. London: Sage Publications. Tan 0,T (2000).A Test for Spatial Disease Clustering Adjusted for Multiple Testing. Statistics in Medicine 18, igi-204.