the creation of a national multiscale database for the ... - CiteSeerX

THE CREATION OF A NATIONAL MULTISCALE DATABASE FOR THE UNITED STATES CENSUS. Robert B. McMaster, Martin Galanda, Jonathan Schroeder, and Ryan Koehnen Department of Geography University of Minnesota Minneapolis, MN 55455 [email protected]

ABSTRACT Although considerable developments in automated generalization have taken place over the last thirty years, it is still difficult to solve generalization problems with off-the-shelf software due to the limited capability of the algorithms and complexity of the databases. At the National Historical Geographic Information System (NHGIS) project at the University of Minnesota (http://www.nhgis.org/), work is currently underway to design a multiple scale database at 1:150,000, 1:400,000, and 1:1,000,000 through the application of data models and generalization algorithms. The NHGIS is taking a two-fold approach using both specific algorithmic approaches and object-oriented data modeling. Early results from the application of a mixture of standard and custom-tailored algorithms—such as the Douglas and Visvalingham routines—has shown promising results, especially along coastal areas. Examples of coastal generalization at a variety of scales are provided. The project will continue to develop the needed generalization algorithms as specific geographical conditions are encountered. Current work is identifying the specific constraints—such as distance between points and/or objects—and specific algorithms needed for generalization at the various scales and applying these in a comprehensive generalization framework that goes beyond the tract boundary specific approach.

INTRODUCTION Over the past twenty years, there has been a rapid growth in the area of geographic information system--powerful computer-based methods for the acquisition, storage, analysis, and display of spatial data--and the related creation of spatially-addressable data sets. Much of these spatial data are related to population statistics, including the Bureau of the Census’ TIGER files (geocoded street and enumeration-unit files that allow for spatial analysis and mapping). GIS has significantly broadened the scope of questions that can be asked with geospatial data, has enabled a rapid growth in spatial analysis, and has popularized the use of mapping techniques for the display of spatial information. GIS has facilitated the rapid growth of geodemographic analysis, including geomarketing, and many forms of population analysis in general, by integrating together the wealth of population data that are now spatially-referenced with the powerful spatial-analytic capabilities of GISs. Examples of these spatial analyses include the assessment of environmental justice/racism at multiple spatial scales (regional, urban, community); the calculation of segregation indices and evaluation of urban poverty, and the indentification of concentrated poverty; the development of neighborhood indicators, including a multitude of economic and social measures based on population data; and the spatial/temporal analysis of census data. Increasingly, researchers are attempting to use the census geographic base files for historical geodemographic analyses. For instance, after the 1990 census it was possible to document the changes in geodemographics between 1980 and 1990, using the 1990 TIGER files. A common application was mapping the change in minority populations between the two periods. However, researchers are constrained mostly to two or three decades of temporal analysis with the availability of only pre-1970 digital files. The development of digital geographic base files for the period 1790 to 1990 will allow a detailed analysis of population change, at much finer levels of resolution (especially tract level), for most urban areas. Many potential research projects/application areas would benefit from the availability of such boundary files. A major part of the NHGIS project involves the creation of a multiple-scale version of the database. As explained below, the multiscale project first attempted to develop specific measurement and generalization techniques for the generalization of tract boundaries. Although somewhat successful, a second more recent effort has taken a more

comprehensive approach through the identification of a series of constraints of generalization, and the application of a comprehensive generalization framework to apply specific algorithms based on these constraints.

Project Goals The overall goal for the project is to create, at the NHGIS site, a multiple scale database of historical US census boundaries that can be used by census researchers based on their needs.

The Need for A Multiple Scale Database Users of the United States Census have a variety of needs. Certain researchers need to analyze census data at the state scale using county-level or even tract-level data. Others need to look at patterns at the regional or even urban scale. Certain research applications even call for neighborhood-level analysis mostly using block-group or block-level data. Before the creation of the NHGIS database, researchers could only use one scale/resolution for all purposes. Thus, the project will recreate the NHGIS database at three scales. A decision was made to create three versions of the NHGIS boundaries—one at 1:150,000 (basically the scale of the original TIGER data), 1:400,000, and 1:1,000,000. The 1:150,000 scale version was basically designed for urban analyses such as mapping poverty in the Twin Cities over time. The 1:400,000 version allows for a regional assessment of, for instance, the growth of suburbia. Finally, the 1:1,000,000 version is appropriate for a state-level, or multi-state analysis of census data. Although the European literature contains a number of conceptual frameworks for automated map generalization, few have had as significant influence on American workers as the models of Kilpelainen (1995) and Brassel and Weibel (1988). Kilpelainen (1995) developed alternative frameworks for the representation of multi-scale databases. Assuming a master cartographic database, called the Digital Landscape Model (DLM), she proposed a series of methods for generating smaller-scale Digital Cartographic Models (DCMs). The master DLM is the largest-scale most accurate database possible, whereas secondary DLMs are generated for smaller-scale applications. Digital Cartographic Models, on the other hand, are the actual graphical representations, derived through a generalization/symbolization of the DLM. In her model, each DCM, labeled Generalized version 1, 2,…n, is generated directly from the initial master database. A separate DLM is created for each scale/resolution, and the DCM is directly generated from each DLM. The master DLM is used to generate smaller-scale DLMs (model generalization), which are then used to generate a DCM at that level. In certain instances, “secondary” DLMs are used to generate “tertiary” DLMs. The assumption is that DCMs are generated on an as-needed basis (cartographic generalization). The additional complexity for the NHGIS project is that the boundary changes between decades also must be incorporated into the model (Figure 1).

Figure 1. The Digital Landscape and Digital Cartographic models.

The Localized Approach for Generalization As part of the NHGIS project, a series of experiments were conducted to generalize both tract boundaries for Hennepin County and the coastline of Florida. This approach included both the application of existing Arc/GIS generalization algorithms as well as the design of specific algorithms. One set of experiments involved the design of specific measurements to ascertain the complexity of tract boundaries.

As an example of this work, Figure 2 depicts the application of the Douglas algorithm (Douglas and Peucker, 1973) for simplifying Hennepin County tract boundaries. It should be noted that this algorithm often works very well on natural features—although it can cause crossovers at certain scales— but less well on straight-line boundaries with many right angles (such as those found in cities). Note how the Douglas algorithm cuts off these right angle features, which is unacceptable in an urban setting.

Measurement. In generalization, procedural measures are those needed to invoke and control the process of generalization. Such measures might include those to: (1) select a simplification algorithm, given a certain feature class; (2) modify a tolerance value along a feature as the complexity changes; (3) assess the density of a set of polygons being considered for agglomeration, (4) determine whether a feature might undergo a type change (e.g., area to point) due to scale modification; and (5) compute the curvature of a line segment to invoke a smoothing operation. Alternatively, quality assessment measures evaluate both individual operations, such as the effect of simplification, and the overall quality of the generalization (e.g., poor, average, excellent). Several studies have reported on mathematical/geometric measures, including Buttenfield (1993), Plazanet (1993), and McMaster (1986), and others.

Figure 2. An example of census tract boundary simplification. In order to calculate the intrinsic complexity of a boundary a simple trendline was calculated (Figure 3). The trendline is computed as the connection of the inflection points along a curve, and correlates strongly with the fractal dimension (Thibault, 2002) that has been used to evaluate linear complexity when compared to the original length of the curve itself. Attempts to then utilize this measure to classify line features based on complexity (and adjust the tolerance values) were not successful.

Figure 3. The calculation of a trendline along tract boundaries.

The Leung-Koehnen Algorithm The Leung-Koehnen algorithm was developed at the University of Minnesota by Kai-Chi Leung and Ryan Koehnen as part of the NHGIS project (Figure 4). The algorithm attempts to “jump across” small inlets along a coastline and allow for a generalized boundary.

Figure 4. Principle of the Leung-Koehnen algorithm. The Leung-Koehnen algorithm takes a multiparameter approach, using two tolerance parameters: distance tolerance: dt and area tolerance: at. For each point p1…n on a line L the following must be computed: 1. For point p1 on line L construct new line segments nls1…n to all other points p1…n that are not consecutive to p1. 2. Consider all new line segments nls1…n that are shorter than distance dt. 3. Choose the longest line segment nls1 from nls1…n. 4. Create a polygon Poly1 that consists of the endpoints of nls1 and all points between them on the line L. 5. For point p1 on line L create a circle c with center p1 and radius dt. 6. Select all line segments ls1…n of line L that intersect circle c and do not have an endpoint of point p1. 7. For all selected line segments ls1…n of line L determine if there is a perpendicular line plx that passes through point p1. 8. For all perpendicular lines pl1…n that pass through point p1, create line segments with endpoints p1 and the intersection ix, of lsx and plx. 9. Consider all new perpendicular line segments npls1...n that are constructed from points p1 and i1…n that are shorter than distance dt. 10. Choose the longest line segment npls1 from npls1…n. 11. Create polygon Poly2 that consist of the endpoints of npls1 and all points between them on line L. 12. Measure the areas a1 and a2 of Poly1 and Poly2. 13. Choose the largest area from a1 and a2 that is less than area tolerance at. 14. Depending on whether a1 or a2 is chosen, the appropriate segment nls1 or npls1 is inserted to line L and all intermediary points are deleted. Figure 4 depicts the raw TIGER vector data for the Tampa/St. Petersburg area of Florida while Figure 5 shows the generalized version. These data, encoded at a scale of 1:150,000 show the complexity of both the natural and humancreated coastline along the Florida coast. Figure 6 provides an enlargement of part of Figure 5. Two simplification approaches have been taken here, on the left is the Visvalingham algorithm (Visvalingam and Whyatt 1993) that uses an areal tolerance (in this case 8,000 square meters) to select critical points, and is considered to be robust in maintaining the original character of the line. This is a somewhat novel approach, in that most simplification algorithms use a linear distance to determine the proximity between the original feature and the simplified version. An areal tolerance (similar to the Leung-Koehnen algorithm) measures the “area” of change as points are eliminated and the two features are displaced. When this area is too large, based on the user-defined tolerance, no further points are eliminated. The enlargement on the right of Figure six is a two pass method applying both Visvalingham, followed by

the Leung-Koehnen algorithm. Note in particular the performance of the dual approach along the complicated “canaled” coastline, where it becomes difficult to retain the rectangular nature of this human created landscape.

Figure 4. The unmodified TIGER data for the Tampa-St. Petersburg coastline (original scale of 1:150,000).

Figure 5. A generalized version of Figure 4. Although considerable developments in automated generalization have taken place over the last thirty years, it is still difficult to solve generalization problems with off-the-shelf software due to the limited capability of the algorithms and complexity of the databases. At the National Historical Geographic Information System (NHGIS) project at the University of Minnesota (http://www.nhgis.org/), work is currently underway to design generalization software for specific problems, such as this coastline example. One such algorithm, designed by Kai Chi Leung and programmed by Ryan Koehnen, is designed to retain the critical right angle geometry of such landscapes, while also reducing the number of canals.

Figure 6. An enlargement of the generalization (From Inset Number 2 on Figure 5).

A COMPREHENSIVE GENERALIZATION FRAMEWORK Although the localized approaches described above produced interesting results, for several reasons they were viewed as problematic. The problems included both technical and conceptual issues. Conceptual impediments included: (1) need to process polygons, not only lines, (2) the need for topological and geometric consistency across time, and (3) the need for a production framework (knowledge engine). A critical technical impediment included the need to retain persistent topology while generalizing complex polygons (See a more detailed explanation in Galanda, et. al., 2005). To overcome these problems, and to establish a comprehensive generalization framework, several basic generalization principles were developed. Of course, one principle involved the hierarchical organization of census units, where higher-order census units establish independent partitions within lower-order census units (e.g., lock groups nest within tracts). A second principle was the need for generalization to proceed backwards in time—similar to the tract editing process (McMaster and Lindberg, 2003) where 2000 was the master decade. The comprehensive approach also involved constraint-based generalization in conjunction with active objects (Ruas 1999, Barrault et al. 2001, Galanda 2003, Duchêne 2004).

Generalization Constraints Ruas (1999) distinguished two groups of constraints with respect to the generalization process, namely constraints of generalization and constraints of maintenance. Constraints of generalization are the motor of generalization as any violation of one of these constraints indicates a need for generalization. Constraints of maintenance relate to properties of map objects that should be preserved during the generalization process. These constraints can be either strict, i.e. they have to be respected, or flexible, i.e. they should be maintained as faithfully as possible. In other words, the generalization process is driven by the fulfillment of constraints of generalization while constraints of maintenance restrict potential solutions space. Galanda (2003) identified four constraints of generalization at the level of individual polygons for the automated generalization of polygonal subdivision, i.e. “Redundant Points”, “Outline granularity”, “Narrowness”, and “Minimal area” ( Table 1). Constraints of generalization 1. Redundant points

Constraints of maintenance 1.Geometric consistency

2. Outline granularity 3. Narrowness of a feature 4. Minimal area

2. Line intersection 3. Neighborhood relation 4. Co-Existence 5. Solution space 6. Positional accuracy 7. Relative arrangement

Table 1. Constraints of generalization and constraints of maintenance. This framework – discussed more in detail in Galanda et al. (2005) – has proven successful in overcoming many of the challenges associated with (a localized approach for) the automated generalization of historical US census boundaries, such as the amount of data involved in the project, the need for an nearly-automated solution, the need to establish a well-defined production system, the need to generalize polygons rather than lines and, due to the temporal component, the resulting need to preserve both geometric consistency, and topological consistency across scales and decades. Early results from the project are promising, as we move into an implementation and production stage.

IMPLEMENTATION The outlined framework is implemented in ESRI’s ArcGIS using C# and ArcObjects while data are stored in an Oracle Database that is accessed for automated generalization through ArcSDE. The implementation of the model generalization has been completed and successfully tested for target scales up to 1:1,000,000 for different US counties. It involves (1) the elimination of sliver polygons from the base dataset (= union of all boundaries for all decades) and (2) the removal of redundant points using the Douglas Algorithm (Douglas and Peucker 1973) and an angle/area simplification algorithm that removes vertices where the enclosed angle and area is considered insignificant.

Figure 7. 2000 census tracts on part of the Florida coastline using the comprehensive generalization framework. Generalized for a scale of 1:400,000. Original boundaries shown in gray. Implementation with respect to cartographic generalization is still on-going and has concentrated, until recently, on the two most common constraint violations. That is, polygons violating the ‘Minimal Area’ constraint are removed if they belong to a tract that is represented by multiple polygons – note that such a tract exhibits most often at least one part that does not violate the ‘Minimal Area’ constraint. Otherwise, polygon outlines violating the ‘Outline Granularity’ constraint are simplified using an enhanced Visvalingam algorithm (Visvalingam and Whyatt 1993) that allows the maintenance of vertices which enclose about perpendicular angles. Figure 7 shows the 2000 tract boundaries of Tampa Bay, Florida for a scale of 1:400,000 after the completion of model generalization and the currently implemented version of cartographic generalization.

CONCLUSION The National Historic Geographic Information System (NHGIS) is a five-year NSF-funded project to build a comprehensive census database, for both boundary files and attribute data, for the entire United States at both the

county and census tract level at multiple scales. Using both existing digital data, and scanned census maps, a temporal database is being built. It is clear that the intrinsic needs of the myriad users will necessitate the availability of a multiple scale geographic database for mapping from the neighborhood to the state-level. Ongoing work based on a comprehensive generalization framework has applied a series of constraints to drive the generalization decisions. Future work will apply, within this object framework, existing and newly-developed algorithms to tackle difficult geographies.

Acknowledgements

This work is supported by the National Science Foundation under Grant No. BCS0094908 – an infrastructure grant provided for the social sciences.

REFERENCES Barrault, M., Regnauld, N., Duchêne, C., Haire, K., Baeijs, C., Demazeau, Y., Hardy, P., Mackaness, W., Ruas, A., and Weibel, R., 2001, Integrating multi-agent, object-oriented and algorithmic techniques for improved automated map generalization. Proceedings 20th International Cartographic Conference, Beijing, China, 2110–2116. Brassel, K. and R. Weibel, 1988, A Review and Conceptual Framework of Automated Map Generalization, International Journal of Geographical Information Systems, 2(3), 229-244. Buttenfield, B.P., 1991, A Rule for Describing Line Feature Geometry, In Buttenfield, B.P., and McMaster, R.B. (Eds), 1991. Map Generalization: Making Rules for Knowledge Representation. Longman, London, 150-171. Douglas, D.H., and Peucker T.K., 1973, Algorithms for the Reduction of the Number of Points Required to Represent a Digitized Line or its Character, The Canadian Cartographer 10(2), 112–123. Duchêne, C., 2004, Généralisation cartographique par agents communicants: Le modèle CartACom. Application aux données topographiques en zone rurale. Ph.D. thesis, Université Paris VI Pierre et Marie Curie. Galanda, M., 2003, Automated Polygon Generalization in a Multi Agent System. Ph.D. thesis, Department of Geography, University of Zurich. Galanda, M., Koehnen, R., Schroeder, J., and McMaster R.B., 2005, Automated Generalization of Historical U.S. Census Units, Proceedings 8th ICA Workshop on Generalisation and Multiple Representation, La Coruna, Spain. http://ica.ign.fr/ Kilpelainen, T., 1995, Requirements of a Multiple Representation Database for Topographical Data with Emphasis on Incremental Generalization, Proceedings of the 17th International Cartographic Conference, Vol. 2, (Barcelona, Spain): 1815-1825. McMaster, Robert B., 1986, A Statistical Analysis of Mathematical Measures for Linear Simplification, The American Cartographer, 13(2), 330-346. McMaster, R.B., and Lindberg, M., 2003, The National Historical Geographic Information System (NHGIS). Proceedings 21st International Cartographic Conference, Durban, South Africa, 821-828. Plazanet, Corinne, 1993, Measurement, Characterization and Classification for Automated Line Feature Generalization, Proceedings, Twelfth International Symposium on Computer-Assisted Cartography, Charlotte, NC, ASPRS/ACSM, Bethesda, MD: 59-68. Ruas, A., 1999, Modèle de généralization de données géographiques à base de contraints et d‘autonomie, Ph.D. thesis, Université de Marne-la-Vallée.

Thibault, Phillipe A., 2002, Cartographic Generalization of Fluvial Features, Ph.D. Dissertation, Department of Geography, University of Minnesota Visvalingam, M., and Whyatt, D., 1993, Line generalization by repeated elimination of points, The Cartographic Journal 30(1), 46–51.

the creation of a national multiscale database for the ... - CiteSeerX

the creation of a national multiscale database for the ... - CiteSeerX

Suggest Documents

The Creation of a Registration and Information Database for Cultural ...

Creation of Time-Varying Voiceprint Database - CiteSeerX

Development of the National Database for Primary Care ... - CiteSeerX

A Life Cycle Model for the Creation of National

A peculiarity of the creation operator - CiteSeerX

A CLIMATE DATABASE FOR THE MARTIAN ... - CiteSeerX

parameterised database creation for construction elements of ...

A methodology for the automated creation of fuzzy expert ... - CiteSeerX

Creation and analysis of a Polish speech database for ...

A Life Cycle Model for the Creation of National Venture ... - CiteSeerX

Creation of a Temperature Stability Database for ... - Wiley Online Library

Creation of a Genome-Wide Metabolic Pathway Database for Populus ...

Creation of a locusâspecific database for AIPmutations

A multiscale dynamic programming procedure for ... - CiteSeerX

the israeli national genetic database*

the national dna database - Parliament.uk

Database Creation and Modeling for Estimating ...

A domain decomposition preconditioner for multiscale ... - CiteSeerX

National Database of Household Appliances - CiteSeerX

Multiscale Biomimetic Topography for the Alignment of ... - CiteSeerX

THE NATIONAL LANDSLIDE DATABASE OF GREAT BRITAIN ...

The Design and Implementation of a Database For ... - CiteSeerX

A Database for the Life-Cycle Assessment of Procter ... - CiteSeerX

PFD: a database for the investigation of protein folding ... - CiteSeerX

the creation of a national multiscale database for the ... - CiteSeerX