QUANTIFYING INFORMATION IN LINE GENERALIZATION

3 downloads 0 Views 255KB Size Report
it must be assumed that the digits will occur with uniform probability (i.e., with ten different states, ... For example, in a standard decimal digit place (s = 10) the.
QUANTIFYING INFORMATION IN LINE GENERALIZATION Sarah E Battersby and Keith C. Clarke Department of Geography University of California at Santa Barbara Santa Barbara, CA 93106 USA

Battersby, S. E. and Clarke, K. C. (2003) Quantifying Information content in line generalization. Proceedings of the 21st International Cartographic Conference (ICC), Durban, South Africa, 10-16 August. pp. 118-126. 1.

INTRODUCTION

The origins of information theory lie in the work of Hartley (1) and Nyquist (2). Further work was done by Shannon (3) in his seminal work in which he examined the theory of the flow of information in communications systems. In applications to geographic research, applications of information theory were being explored as early as 1955 with Rosenberg’s examination of information flows in photogrammetric systems (4). Later considerations were by Marchand (5), Batty (6), and Thomas (7). In the cartographic realm, application of information theory has primarily focused on the preset type, location, and number of graphic elements (8, 9). Unfortunately, with this type of analysis, there is an issue with subjectivity of readers, and a lack of congruence between the elements on the map and the amount of information extracted from those elements (10). Recent reconsideration of the value of the approach has been by Neumann (11), Tobler (12), Li & Huang (13) and Battersby & Clarke (14). In general, however, the prime attention given to the theory in cartography peaked in the 1970s, and then languished until the recent resurgence starting in the late 1990’s. Of the works mentioned above, the most critical to information theory is that of Shannon (3). In Shannon’s paper, potential rate of communication across an imperfect communication channel was quantified. In this, a value was assigned to the quantity of information transferred from a transmitter to a receiver – the quantity of information is dependent on the number of possible states for each bit of information; when there are more possible states, the information potential is greater. While Shannon’s work was based on quantification of potential rate of communication, study of information theory is not limited to this area. In this paper, we examine the Coordinate Digit Density (CDD) function as a method for applying principles of information theory to explore the effects of line generalization on geographic datasets. This analysis highlights issues in state-sequence dependency inherent in geographic datasets, changes in information content due to different generalization tolerances, and explores the differences between the two different generalization methods included in ESRI’s ArcInfo workstation, bendsimplify and pointremove (Douglas-Peucker). 2.

The CDD defined

The Coordinate Digit Density (CDD) function relies on the geographic primitive of location, as defined by the locations’ numeric coordinates1. These coordinates can be reduced to a series of numeric variables, each possessing a finite number of possible states. The coordinate value, their frequency of occurrence, and the number of possible states can be used to calculate a measure of information content for a geographic feature. As the information content value is based on probability of a particular states occurrence, without knowledge of the randomness or redundancy of the digit distribution, it must be assumed that the digits will occur with uniform probability (i.e., with ten different states, each will occur exactly ten percent of the time). With this in mind, we are operating under the following assumptions about geographic coordinates: 1. 2. 3. 4. 5.

Distribution of digits will be perfectly random Each digit consists of a known, finite number of states There are no inter-digit dependencies Information content is contained entirely within the coordinates on the feature The degree of non-randomness is equivalent to information content

Operating under our first assumption of randomness of digit distribution, the probability, P, of any individual character’s, d, occurrence in digit place n can be measured as:

1

We realize that not all location coordinates are numeric (e.g., Military Grid) nor are they restricted to only x and y coordinates, and while the applications examined in this paper are based on standard x,y coordinates, the CDD is not restricted to this type of coordinate only.

(1)

P(d n ) = 1

sn

with s being the number of states possible in any digit place. For example, in a standard decimal digit place (s = 10) the probability of any particular digit occurring would be equal to 0.1. Through measurement of state occurrences across all points in a feature, O(dn), the overall variation of the digit can be calculated. This amount can be attributed to each digit place, n, in a coordinate as a value Dn:

( 2)

Dn = O(d n ) − P (d n )

Summation across all possible states gives a total digit density for a digit place:

(3)

H (n ) =

s

∑ D (s) n

1

In Shannon’s formulae, H is a measure of information potential, or how much we do not know. Here, in our equation, the information value H measures redundancy, or how much we do know. This value varies as a function with N values over the n digits in the coordinate. Note that negative contributions to the total are possible, for example when a digit is absent, thus we sum the magnitudes. Some values will be positive, when the proportion of the actual state frequencies is nonrandom across the coordinates. For example, in the coordinates 6050153.86, 1983799.77; 6051217.62, 1981421.08; 6053435.88, 1983925.92, the values for the first digit in the Eastings are all the same. Out of the ten possible digits for this digit place, all are the same value, so H will be calculated as (1.0 – 0.1) + 9 * (|0.0 – 0.1|) or 1.8. This indicates the product of the absolute values of the difference between expected and actual occurrence; since the digit “6” occurs 100% of the time the actual occurrence is 1.0, where the expected occurrence was 10% (1/10 of the possible digits). The other nine digits possible had an actual occurrence rate of 0%. At the opposite extreme, if all digits occur at the same frequency, then H = 0. This would appear as 10 * |(0.1 – 0.1)| = 0. So for each decimal digit n, there is a value between 0 and 1.8 that corresponds to the degree of non-randomness and therefore the information quantity. As expressed above, this value is not really independent of the adjacent digit values. For a longitude, for example, the first digit has two states, and the second ten. However, if the longitude’s first digit is “1”, then there are only nine second digit states, since 190 degrees is not possible. Similarly, if the first two digits are 18, then the third digit can only be 0. While Shannon did consider these dependencies, particularly with the intent of reducing their data volume, in the case of coordinates it is the spatial co-dependence that we seek to measure. Accordingly, we use only one simple value s for each n, given by the total possible number of states. One advantage of this over expedience is that for any point set, we have an absolute metric for H. Additionally, in terms of data storage and processing, from a computer’s perspective, any decimal digit place containing a digit will have ten possible states – regardless of its geographic structure. With this function we can calculate a value for overall information content for all points in a coordinate set. Where traditional Shannonian information theory calculations revolve around the “surprise” or unexpected events in data, the CDD measures redundancy, or the unanticipated order and structure appearing in a coordinate set. Just as where information content in Shannon’s calculations measuring “randomness” relied on the probability of multiple events occurring in conjunction with each other, the calculation of total information content in a geographic dataset is a simple function of summation across digits to attain a value for all points, C, in the set:

( 4)

I (C ) =

n

∑ H (n ) 0

This metric permits calculation of total information content for any set of geographic coordinates – points, lines, polygons, individual features, and entire maps or datasets. In the following sections, an example of application to line data will be examined. 3.

ABOUT THE DATASET

The dataset used to examine the Coordinate Digit Density (CDD) function is a Koch Island (a quadric Koch teragon Figure 1). This fractal dataset was selected for two reasons – generalization of fractal shapes is similar to the simplification of features such as coastlines, and the dataset could be precisely controlled so that the intricacies of the generalization can be more easily examined (at least more easily than they could be with a less familiar dataset). The Koch Island used in this exploration was created so that the object was centered at 0,0, and each line segment used was exactly one unit in length and the coordinates of each node were integers (to seven decimal digits of precision).

Figure 1. A Koch Island One of the features of the CDD is that the exact distribution of values for each digit place can be examined; an example of distribution can be seen for the Koch Island that was used in Figure 2. It is plain to see that the character of the dataset is visible in this digit distribution. The frequency of positive and negative coordinates is equal – with only a small percent of coordinates having no sign (equal to zero). There are only two possible values in the first digit place – 0 or 1, and by the second digit place, the values have become almost random in distribution. In larger datasets, the distribution shows definite state-sequence dependencies and areas of true randomness, which are especially noticeable in the decimal places – as can be seen in Figure 3.

Figure 2. Distribution of digits in the Koch teragon. Since the object is symmetrical, eastings and northings show the same digit distribution. The digit place is given on the horizontal axis, and the digit value is given on the vertical axis. Each cell represents the frequency of occurrence (percent) of that value in a particular digit place.

Figure 3. Distribution of digits in the eastings of a dataset containing roads for Santa Barbara County, California. In the next section, we will go beyond the use of CDD data for basic characterization of datasets, and will explore the uses of the CDD to examine the effects of line generalization. 4.

APPLICATION TO MAP GENERALIZATION

Generalization is a process to remove extraneous detail, while still maintaining the characteristics of a line. As the Coordinate Digit Density (CDD) function measures redundancy, or how much we know, the removal of extraneous data in the dataset should prove to increase the information content through reduction of redundancy in the coordinate set. This hypothesis was tested by applying the two most common generalization algorithms in GIS, the Douglas-Peucker (15) method and ESRI’s Bendsimplify (16). The Douglas-Peucker algorithm was designed for reduction of vertices needed to represent a line, while the bendsimplify algorithm was designed more to preserve the shape of the original line (17). Figure 4 shows the results of the two different generalization algorithms on a Koch Island. As can be seen in Figure 4, while the final results of the generalization process are similar, the methods used to reach those results are quite different. When the information content for these generalized datasets are compared using the CDD, a pattern of increasing information content is seen with both generalization algorithms. As can be seen inFigure 5, the information content does not increase at the same rate for the two different methods. While both datasets show increasing information content, there are several extreme spikes in the bendsimplify data. Closer inspection of the effects of the line simplification shows that the bendsimplify algorithm is adding additional nodes in locations where there were previously no nodes. When this happens, the effect of the precision (all nodes were integers to seven decimal places of precision) of the original coordinates is lost, and the increasing randomness in the decimal places of the data becomes a factor.

Figure 4. Generalization of a Koch Island

Figure 5. Difference in information content for the eastings of the Koch Island. Ten different levels of generalization are shown, and as the tolerance increases so does the information content. It is particularly interesting to examine the specific causes of the data spikes through examination of the distribution of digits diagram. While Figure 2 represented the distribution of digits before any generalization took place, Figure 6 shows the difference between the original and the generalized datasets. In terms of the effects of this distribution on the overall information content of the dataset, the contribution of each digit place can be examined, as seen in Figure 7. From this chart, the general influence of each digit place can be seen – in the original ungeneralized dataset the sign “digit” is the main contributor, the first whole number is the main cause of the increase in information content in the dataset generalized with the Douglas-Peucker algorithm, and the decimal places are the main cause of increase in the bendsimplify dataset.

Figure 6. A comparison of digit distribution after generalization with the Douglas-Peucker and Bendsimplify methods. In both instances, a tolerance of 2.5 was used.

Figure 7. H(n) value for individual digit places for the original dataset (in black), and the dataset generalized with a tolerance of 2.5 using both Bendsimplify (in brown), and the Douglas-Peucker algorithm (in blue). 4.1 Application to coastline data Now we will demonstrate the application of the CDD in examining a 1:24000 line dataset for the central coast region of California. Instead of generalizing using the same tolerance for each dataset, we generalized to specific levels of point reduction – 50%, 20%, 10%, 5%, and 2% of the original points remaining. The effects of these generalizations can be seen in Figure 8. It should be noted that in order to attain these levels of reduction, very different tolerances were necessary in each of the generalization methods (e.g., to reduce to 20% of the original points the tolerance using Douglas-Peucker was 7m, and 250m for Bendsimplify.

Figure 8. Effects of generalization on a coastline dataset. Just as increase in information content with increasing generalization was seen in the coastline data, a similar trend was seen in the coastline data (Figure 9). The increase was most dramatic with Douglas-Peucker. The pattern of increase was similar for both Eastings and Northings, though, as would be expected, since the North/South complexity of the coastline is greater, the total information contained in the Northings was greater than that seen in the Eastings.

Figure 9. Change in information content in 1:24000 coastline with generalization. 5.

CONCLUSION

These results represent preliminary analysis of two line generalization methods using the Coordinate Digit Density Function (CDD). The CDD is a set of formulae designed to examine the information content contained within the geographic coordinates that define the structure of a geographic dataset. In the CDD, a function loosely based on Shannon’s mathematical theory of information has been defined and illustrated. This paper shows some of the functionality of the CDD and the computer program that implements it. This research has shown the exact portions of the overall dataset that were generalized, the extent to which they were reduced by generalization, and the effect of each digit on the overall information content of the dataset. The results that were found are promising, and lead us to the belief that the CDD may be useful as a method to “score” and compare datasets for a particular area using the information content values. As well, it can be imagined that the CDD may have applications in other areas such as data compression and encryption. 6. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.

REFERENCES Hartley, R.V.L., Transmission of information. Bell systems technical journal, 1928. 7: p. 535-563. Nyquist, H., Certain factors affecting telegraph speed. Bell systems technical journal, 1924. 3: p. 324-346. Shannon, C.E., A mathematical theory of communication. Bell systems technical journal, 1948. 27: p. 379-423, 623-656. Rosenberg, P., Information theory and electronic photogrammetry. Photogrammetric engineering, 1955. 21(4): p. 534-555. Marchand, B., Information theory and geography. Geographical analysis, 1972. 4: p. 234-257. Batty, M., Spatial Entropy. Geographical analysis, 1974. 6: p. 1-32. Thomas, R., Information statistics in geography, in CATMOG #31. 1981: Norwich. Sukhov, V.I., Application of information theory in generalization of map contents. International yearbook of cartography, 1970. X: p. 41-47. Pipkin, J.S., The map as information channel: Ignorance before and after looking at a choropleth map. The Canadian cartographer, 1975. 12(1): p. 80-82. Salichtchev, K.A., Some reflections on the subject and method of cartography after the sixth international cartographic conference. Canadian cartographer, 1973. 10: p. 106-111. Neumann, J., The topological information content of a map / An attempt at rehabilitation of information theory in cartography. Cartographica, 1994. 31(1): p. 26-34. Tobler, W., Introductory comments on information theory and cartography. Cartographic perspectives, 1997. 27: p. 4-7. Li, Z. and P. Huang, Quantitative measures for spatial information of maps. International Journal of Geographic Information Science, 2002. 16(7): p. 699-709. Battersby, S. and K. Clarke, Information content in map generalization. In review. Douglas, D.H. and T.K. Peucker, Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. The Canadian cartographer, 1973. 10: p. 112-122. Wang, Z. Manual versus automated line generalization. in GIS/LIS '96. 1996. Denver, CO. Environmental Systems Research Institute, Map generalization in GIS: Practical solutions with workstation ArcInfo. 2000, Redlands, CA: ESRI Press.

Suggest Documents