ChoroWare: A Software Toolkit for Choropleth Map ... - CiteSeerX

3 downloads 943 Views 121KB Size Report
The importance of visualization in exploratory data analysis has long been ... In the domain of exploratory spatial data analysis (ESDA), visualization is also ...
ChoroWare: A Software Toolkit for Choropleth Map Classification Ningchuan Xiao Marc P. Armstrong David A. Bennett Department of Geography 316 Jessup Hall The University of Iowa Iowa City, IA 52242 E-mail: {ningchuan-xiao; marc-armstrong; david-bennett}@uiowa.edu

Abstract Choropleth mapping plays an important role in exploratory spatial data analysis and many objectives have been suggested to guide choropleth map construction. Because these objectives often conflict, a choropleth map that is best for one objective may not be best for others. Choosing a set of choropleth class intervals is thus a multiobjective problem. This paper describes a software toolkit, called ChoroWare, that is designed to help cartographers identify a set of class intervals that is suitable for a specific application. The software uses a genetic algorithm to generate a set of nondominated solutions to the choropleth class interval problem and uses a visualization tool to interactively display members of this classification set. Using ChoroWare, cartographers can examine trade-offs between alternatives and find choropleth classifications that meet their objectives.

1. Introduction The importance of visualization in exploratory data analysis has long been recognized by researchers. Tukey asserts that, “except when learning the numerical part of a new technique, no problem of exploratory data analysis is ‘solved’ without something to look at” (Tukey, 1977: 56). In the domain of exploratory spatial data analysis (ESDA), visualization is also especially important (Anselin, 1998). Among various visualization techniques used in ESDA, choropleth mapping is often used to display 1) patterns in spatial observations of enumeration areas, such as counties or Census Tracts, and 2) resulting spatial statistical outcomes, such as measures of local spatial autocorrelation (see, for example, Anselin, 1995; Getis and Ord, 1992). To produce a choropleth map, spatial observations are grouped into classes and each class is then assigned a particular area symbol that is used to shade the enumeration areas. Classifications can be performed in a large number of ways, with each method focusing on a specific criterion, such as geographical structure (Monmonier 1972; Murray and Shyy 2000), statistical similarity (Jenks and Caspall 1971), and other statistical characteristics (e.g., quantiles) (Slocum 1999: 67).

1

Choropleth map producers may have several criteria that they wish to consider during the design process. It is unlikely, however, that a particular classification will satisfy all their criteria. In practice, a classification that is optimal for one criterion may turn out to be poor from other perspectives. Therefore, an appropriate design goal for a choropleth map is to consider classification as a multicriteria (or multiobjective) problem. That is, instead of looking for a single classification that is best for all criteria, it is more useful for cartographers to examine the classification alternatives along a Pareto-like front where no classification can be considered to be better than, or dominate, others. The purpose of this paper is to describe the design, implementation and use of a software toolkit, called ChoroWare, that can be used to help cartographers find a set of class intervals that is well-suited to a specific application. Using ChoroWare, users can explore a variety of alternative classifications and select one that they deem most suitable. Theoretical issues underlying the design of ChoroWare are discussed in Section 2. The design and implementation of ChoroWare are discussed in Section 3. A user’s guide is provided in Section 4. We conclude the paper with discussion about possible future development of ChoroWare and its connection with Open Source technology.

2. Theory The major difference between solving a multiobjective optimization problem and solving one with a single objective is that, in a multiobjective context, there is more than one optimal solution. To elucidate this concept, let us consider a multiobjective optimization problem with the general form: min f = [ f1 (x), f 2 (x),..., f k (x)]T subject to x ∈ S where x is a vector of decision variables, and S is the range of feasible of solutions. For this problem, a solution x(1) is said to dominate solution x(2) if and only if both of the following conditions are satisfied: 1) f i (x (1) ) ≤ f i (x ( 2 ) ), ∀i , and 2) f i (x (1) ) < f i (x ( 2 ) ), ∃i . In other words, if no objective value of solution x(1) is greater (worse) than that of solution x(2), and there is at lease one objective value of solution x(1) that is less (better) than that of solution x(2), then we say that solution x(1) dominates solution x(2). The relation between any two solutions in the solution space can assume one the following two states: 1) dominated and dominating, where the conditions of domination are satisfied, or 2) nondominated, where the conditions of domination are violated. If we plot all feasible solutions of the problem in its solution space, there will be a certain number of solutions that form a particular set called the Pareto-optimal set. For the solutions

2

within this set, no one can be said to be dominated, while a solution outside this set is dominated by at least one solution in this set. The Pareto-optimal set is also called the Pareto front. Obviously, the goal for a multiobjective optimization technique is to find the Pareto front. Though conducting a search for nondominated alternatives is difficult for many multiobjective optimization problems (Cohon, 1985; Miettinen, 1999), recent developments have shown that genetic algorithms (GAs) are effective in generating alternatives for them (Deb, 2001; Zitzler et al. 2001; Xiao et al. 2002). In the rest of this section, we first discuss basic GA principles, in the context of single objective optimization, and then discuss their use in multiobjective optimization. 2.1 Genetic algorithms (GAs) Genetic algorithms are developed using the general principles of adaptation and evolution observed in biological systems (Holland, 1975). In a GA, a population of individual solutions is randomly initialized and then manipulated by a set of iterative operations, including selection, recombination, and mutation. At the end of each iteration, solutions are evaluated according to a set of objectives. Figure 1 shows a general procedure for a GA. t := 0 Generate initial population P(t)

Evaluate each individual in P(t) Stop? no

yes

Genetic operations: • Selection • Crossover • Mutation New population P(t+1) t:=t+1

End

Figure 1. The general procedure of a genetic algorithm. 2.2 Multiobjective GAs Several types of genetic algorithms have been developed to solve multiobjective problems (see Deb, 2001). Our approach is based on one method. Here, a GA is designed to assign a fitness value to a solution according to its dominance in the population. Because fitness is used to drive

3

the selection process, solutions found closer to the Pareto front have a greater chance to be selected and manipulated to form the next generation. 2.3 Specialized island model for multiobjective genetic algorithms Since the goal of a multiobjective GA is to search for a set of solutions, it is essential to ensure that the solutions found are evenly distributed along the Pareto front. This can be achieved by maintaining a diverse population in the GA and several techniques have been suggested to maintain population diversity (see Deb, 2001). In ChoroWare, we extended the island model (see Cantú-Paz and Goldberg 2000) and designed a specialized island model. In our approach, the entire population is divided into several sub-populations (“islands”) and each subpopulation is processed using a local set of evolutionary operations. A mechanism, called migration, is used to exchange individuals, at a certain rate and interval, among the islands. Inside each subpopulation, an individual (i.e., a specific classification scheme) is evaluated against a (partial) set of objectives. 2.4 Using GAs to search for choroplethic class intervals A solution to the multiobjective problem of choropleth class interval selection breaks a sorted array of observations into several sections, and observations within each section belong to one class. We use an integer string to represent a classification in our GA implementation, where an integer is the index of a break point. The search for choropleth class intervals is then equivalent to a search for a set of break points.

3. ChoroWare: Design, Implementation, and Functionality As discussed above, ChoroWare has been developed to help cartographers find a set of class intervals that is suitable for their specific application. To achieve this goal, two critical tasks must be completed. First, ChoroWare must be able to generate the set of nondominated alternatives that are on, or close to, the Pareto front. Then, a visualization tool is needed to allow users to interactively display the trade-offs between criteria and the resulting choropleth map for each non-dominated alternative. The overall architecture of ChoroWare is illustrated in Figure 2. ChoroWare ChoroData

MoGA/Choro

ChoroFront

ChoroVisual

Figure 2. Overall structure of ChoroWare. •

ChoroData: This module reads in the raw spatial data (in vector or raster formats) and generates the data files needed by the genetic algorithm. Data generated by ChoroData include 1) an array of all observations (for vector format data), 2) a two-dimensional array of observations (for raster), 3) an array of non-repeated observations (for both vector and raster), and 4) an array of linked lists of neighbors for each polygon in a vector data structure. ChoroData requires a polygon shape file for vector data, and an ASCII grid file for raster data.

4

• • •

MoGA/Choro (multiobjective GA for choroplethic classification): This module is used to generate solutions that are on, or close to, the Pareto front. ChoroFront: Solutions generated by MoGA/Choro include both dominated and nondominated solutions. This module is used to extract the nondominated solutions from the output of MoGA/Choro and write them to a file. ChoroVisual: This module is used to display the solution space and the choropleth map for a user-selected nondominated solution extracted by ChoroFront.

3.1 MoGA/Choro During the design of MoGA/Choro, code development was influenced by considerations related to code reuse as well as a recognition that changes would be needed: 1) Reusability. The code, or at least a large bulk of the code, must be reusable so that the program can be expanded in the future. More explicitly, we considered how the code, without major modification, could be migrated to a parallel programming environment. 2) Flexibility. There will always be new objectives to consider, and it should be easy for these objectives to be incorporated into MoGA/Choro. 3) Extendibility. Though many objectives may exist, it is impractical for developers to build all possible objectives into the program. Therefore, users must be able to incorporate their own objectives into the program. Based on these concerns, we designed MoGA/Choro using object-oriented techniques (Figure 3). The object-oriented programming paradigm has been widely used in GIS applications and spatial analysis because it provides strong support for code reusability and problem modeling (Rumbaugh, 1991; Bennett, 1997). We describe the main classes in Figure 3 here: population

model_base model_jenks

subpopulation select() crossover()

geno_base

model_equal_area model_autocorr

models model_boundary_error genotype chromosome init_chrom() mutation( ) fitness( )

data_attr parameters

data_attr_unique a

a

a

data_neighbor_list b

data_raster

a contains many b’s

b a contains one b

b b derived from a

Figure 3. The design pattern of MoGA/Choro. The notation used here is adopted from Gamma et al. (1995).

5



• •

• •

The population class. An object of the population class is responsible for running the main loop of the program, which is consistent with the steps illustrated in Figure 1. Also, it contains many subpopulations, as objects of the subpopulation class, and maintains a record of the migration of individuals among these subpopulations. The subpopulation class. An object of the subpopulation class is essentially a GA, which contains a number of objects of the genotype class. The genotype class. An object of the genotype class represents an individual (a solution) in the GA. It has a member datum called chromosome, which stores the break points of the class intervals for a particular classification. Member function init_chrom() is used to initialize the classification, mutation() is used to mutate chromosome, and fitness() is used to assign a fitness value to chromosome. A genotype object also contains an aggregating object of models, which contains the models used to calculate the objective values. The models class. An object of the models class contains all the models used to compute objective values. The parameters class. An object of the parameters class contains all the data sets needed to compute objective values. Since these data sets are used by models, pointers to these data components are passed to each specific model (e.g., model_equal_area).

We have developed four “built-in” objectives and the evaluation of each objective is carried out by a class in Figure 3. For the sake of convenience, we minimize each objective. •

• • •

The model_jenks class. An object of this class is used to compute goodness of variance fit, or GVF (Robinson et al. 1984: 363), an operational form of the tabular accuracy index specified by Jenks and Caspall (1971). Originally, the objective was to maximize GVF. The objective here is to minimize EVF = 1-GVF. The model_equal_area class. This computes the gini coefficient of the areas of each class (GEA). The objective is to minimize GEA. The model_autocorr class. An object of this class is used to maximize spatial autocorrelation based on Moran’s I of the resulting classification clusters on the map (MIC), and equivalently, in MoGA/Choro, we minimize MIC1=1-MIC. The model_boundary_error class. An object of this class is used to maximize the boundary accuracy index, or BAI (Jenks and Caspall 1971), and equivalently, we minimize boundary error, BE=1-BAI, in MoGA/Choro.

Section 4.5 discusses how to add new objectives into the program. 3.2 ChoroVisual The nondominated alternatives of the choropleth classifications extracted by ChoroFront are visualized using the module ChoroVisual. This module was developed using Gtk+ and Gdk, an Open Source technology for graphical user interface development.1 The interface of ChoroVisual currently consists of six windows (see Figure 4, in a clockwise order starting from the upper-left corner):

1

http://www.gtk.org

6

Figure 4. A screenshot of ChoroVisual. 1) 2) 3) 4) 5) 6)

Front list: a list of all nondominated alternatives, Map window whose title shows the name of the polygon shape file, Value path: a parallel coordinate plot, Solution space: a multivariate plot, Classifier: a legend editor (for manual classification), and Database: a list of the attributes of polygons.

In the “Solution space” window, a user can choose to display a plot formed by any two variables listed in the left panel. Among these variables, the last four are the built-in objectives. Each classification alternative is displayed as a small white square in the solution space drawing area, and a red square is used to indicate the current classification being used to draw the map. The current classification is also highlighted in the front list and in the value path (the red line). Linkages among the map, multivariate plot, and front list enable a user to examine the alternatives on the fly. The legend editor is designed to help users further explore each selected alternative. In the classifier window, a user can examine the histogram of the observations and the break points for the classification, adjust the classification, and change the color of classes. In so doing, a user can start from an acceptable classification and then, based on this classification, search for a better one.

7

4. A User’s Guide Compiling and running ChoroWare requires a Linux distribution (we developed it under RedHat 7.2), a GNU C++ compiler, and Gtk 1.2 or above2. Though we did not test for all cases, we believe that as long as the software requirements are met, ChoroWare can be configured and compiled using most Linux distributions. 4.1 ChoroData ChoroData can be used to generate input data for MoGA/Choro from either an ESRI polygon shape file or an ASCII grid file. The command lines are: chorodata

or chorodata

where poly and raster are key words, in_shape is the name of a shape file (not including the extension file names), field_name is the name of the field in the dbase (.dbf) file used for choropleth mapping, and in_ascii_grid is the name of the ASCII grid file. 4.2 MoGA/Choro A configuration file is needed to run MoGA/Choro. An example configuration file is shown in Figure 5. This file is divided into several sections. The first section, called [common], is used to define the type of input information, input and output files, general specifications of the GA, and the structure of subpopulations. Though most of the variables are self-explanatory, some require a description of their functions: chromlen: the length of the chromosome, which is equal to the number of internal break

points. For example, a five-class choropleth map has four internal break points and therefore chromlen = 4. rand_seed: a seed used to initialize the random number generator. If rand_seed = -1, the program uses system time as the random number seed. top_N: the total number of individuals that will be written into the output files (running file and solution file). model_mask: this number determines which built-in objectives are actually used. The reason for this is that a user may not want to explore all available objectives. If we have a total of four built-in objectives available, then setting model_mask = 1111 means that all are used, and model_mask = 0101 means only the first and third objectives are used at runtime (counting from right to left). Following the [common] section is a set of sections used for the subpopulations. Each section is marked as [subx] where x is the index (starting from 0) of a subpopulation. Figure 5 shows a complete section for the first subpopulation (i.e., sub0). Here is an explanation of some of the variables used in a subpopulation section: 2

RedHat Linux: http://www.redhat.com GNU: http://www.gnu.org More Linux distributions can be found at http://www.ibiblio.org (formerly http://www.sunsite.unc.edu)

8

[common] data_type=polygon # input data files shape_file=5state/5state poly_file=5state/5state.dense.data.poly neighborlist_file=5state/5state.neighborlist attribute_file=5state/5state.dense.data.attribute # output files running_file=5state.dense.running solution_file=5state.dense.sol num_objs=4 chrom_len=4 num_subpop=9 popsize=50 maxgen=150 model_mask=1111 report_freq=1 top_N=20 migration_frq=2 num_elite=20 rand_seed=-1 [sub0] cross_meth=-1 gap=0.73 pcross=0.95 pmutation=0.76 mutat_meth=-1 sorting_mask=1001 migrate_to=-1 [sub1] ...

Figure 5. A configuration file for MoGA/Choro. Note that only the section of subpopulation 0 is given. cross_meth: specifies which crossover method is to be used at runtime. When this value

is –1, the program will randomly select a method for each iteration. gap: a real number between 0 and 1 to specify how large a gap is needed between two successive generations. The gap technique was developed by De Jong (1975) to increase the diversity of GA populations (islands in MoGA/Choro). When gap equals 0.9, about 10% individuals of the subpopulation will be randomly picked and then directly copied to the next generation without any modification. pcross: a real number between 0 and 1 to specify the probability of crossover. pmutation: a real number between 0 and 1 to specify the probability of mutation. mutat_meth: specifies which mutation methods are to be used at runtime. When this value is –1, the program will randomly select a mutation method. sorting_mask: specifies which objectives are used to evaluate individuals in the subpopulation. This variable is used to achieve the specialization of a subpopulation. If sorting_mask = 1001, the subpopulation is specialized to an optimization problem consisting of the first and fourth objectives. The specification of this variable must be consistent with the setting of model_mask

9

in the [common] section, otherwise, the program will print an error message and stop execution. If model_mask=1011, for example, then setting sorting_mask to 1001 is consistent, but sorting_mask=1100 is inconsistent because the third (underlined) objective is not used. The design of sorting_mask allows users to specify the specialization of an island (subpopulation). migrate_to: specifies the destination of migrating individuals in that subpopulation. If this value is –1, then individuals from this subpopulation will migrate to all other subpopulations. After the configuration file is set up, the following command can be used to launch MoGA/Choro: mogachoro

4.3 ChoroFront The output files generated by MoGA/Choro are read by ChoroFront, which then extracts the nondominated solutions and writes them into a file, called “front file”. The name of the front file consists of the name of the running_file and a postfix of “.front”. For the above example of 5state, the file name will be 5state.dense.running.front (see Figure 5). The command line of ChoroFront is: chorofront

where the configuration_file_name is the same as used above in mogachoro. 4.4 ChoroVisual Finally, the command line to run ChoroVisual is: chorovisual

where cv_input is a configuration file for ChoroVisual. An example of this configuration is given in Figure 6. A variable called head_description in this file is used to specify another file that describes the names of each column in the front file (e.g., 5state.dense.running.front). An example head description file is given in Figure 7, where four objectives are used. The names of columns in the front file are used in the “Front list” and “Solution space” windows in ChoroVisual. [common] data=5state/5state field=POP90_SQMI num_objs=4 num_class=5 front=5state.dense.running.front head_description=5state.head.description

Figure 6. Configuration for ChoroVisual.

10

Gen Subga No Rank EVF GEA MIC1 BE Min d1 d2 d3 d4 Max

Figure 7. A file describes the name of each column in a “front file”. 4.5 Adding New Objectives Adding a new objective into ChoroWare requires the user to complete four steps: 1) Write the code. Figures 8 and 9 show a template of the header and implementation, respectively, of a new objective called model_new. 2) Register the new objective in the models class. To do this, the init() function in the implementation of the models class must be modified (Figure 10). 3) Recompile the program. To do this, we need to add the file name of the source code into the make file and rerun make. 4) Update the configurations of MoGA/Choro and ChoroVisual.

5. Conclusions and Discussion We have discussed the design, implementation and use of ChoroWare. Using this software tool, cartographers and spatial analysts can have greater flexibility in choosing a suitable classification scheme for choropleth maps than current GIS and cartographic software provide. Future developments will focus on the following issues: •





3 4

Developing a more user-friendly interface for data preparation, and especially software configuration. Currently, users must manually write configuration files for MoGA/Choro and ChoroVisual. We are considering an integrated environment that includes both data preparation and software configuration procedures. To expedite the performance of GA search, especially for large data sets, it is necessary to incorporate MoGA/Choro into a parallel programming environment. Fortunately, this issue was considered during the design of MoGA/Choro. Given the new languagebinding features of MPI-2,3 most of our code will be reusable. Integrating ChoroWare with other powerful visualization tools such as GGobi.4

http://www.mpi-forum.org http://www.ggobi.org

11



MoGA/Choro can generate a large number of nondominated alternative classifications (often > 1000), which may present a challenge to users who are trying to find desirable classifications. Data mining techniques, such as clustering, may help users discover the nature of trade-offs and be more informed when exploring alternatives. #ifndef __H_MODEL__NEW__ #define __H_MODEL__NEW__ #include "gadefs.h" #include "model_base.h" class model_new : public model_base { public: model_new(); model_new(parameters*); virtual ~model_new(); void init(parameters*); double run(chromosome); double run_v(chromosome); double run_r(chromosome); private: parameters* var_set; }; #endif

Figure 8. The header file of a new objective model_new. #include "headers.h" #include "model_new.h" model_new::model_new() { // default constructor } model_new::model_new(parameters *param) { // constructor init(param); } model_new::~model_new() { // delete allocated memories } void model_new::init(parameters* param) { var_set = param; } double model_new::run(chromosome chrom) { if (var_set->data_type == 1) return run_r(chrom); else return run_v(chrom); return 0; } double model_new::run_v(chromosome chrom) { // compute and return the value of the new objective, for vector } double model_new::run_r(chromosome chrom) { // compute and return the value of the new objective, for raster }

Figure 9. A template of implementation of new objective model_new.

12

#include "headers.h" ... #include "model_new.h" ... void models::init(parameters* param) { ... models[4] = (model_base*) new model_new(var_set); ... }

Figure 10. Register the new objective into the models class.

References Anselin, L. 1995. Local indicators of spatial association – LISA. Geographical Analysis 27: 93115. Anselin, L. 1998. Exploratory spatial data analysis in a geocomputational environment. In P. A. Longley, S. M. Brooks, B. Macmillan and R. McDonnell (eds.), GeoComputation: A Primer. New York: Wiley, 77–94. Bennett, D. 1997. A framework for the integration of geographical information and modelbase management. International Journal of Geographical Information Science 11 (4): 337357. Cantú-Paz, E., and D. E. Goldberg. 2000. Efficient parallel genetic algorithms: theory and practice. Computer Methods in Applied Mechanics and Engineering 186: 221-238. Cohon, J. L. 1978. Multiobjective Programming and Planning. New York: Academic Press. Deb, K. 2001. Multi-Objective Optimization Using Evolutionary Algorithms. Chichester: John Wiley & Sons. De Jong, K. A. 1975. An Analysis of the Behavior of a Class of Genetic Adaptive Systems. Doctoral dissertation, Department of Computer and Communication Sciences, University of Michigan, Ann Arbor, MI. Gamma, E., R. Helm, R. Johnson, and J. Vlissides. 1995. Design Patterns: Elements of Reusable Object-Oriented Software. Reading, MA: Addison-Wesley. Getis, A. and J. K. Ord. 1992. The analysis of spatial association by use of distance statistics. Geographical Analysis 24: 189-206 Goldberg, D. E. 1989. Genetic Algorithms in Search, Optimization and Machine Learning. Reading, MA: Addison-Wesley. Holland, J. H. 1975. Adaptations in Natural and Artificial Systems. Ann Arbor, MI: University of Michigan Press. Jenks, G. F., and F. C. Caspall. 1971. Error on choroplethic maps: definition, measurement, reduction. Annals of the Association of American Geographers 61 (2): 217-244. Miettinen, K. M. 1999. Nonlinear Multiobjective Optimization. Boston: Kluwer Academic Publishers. Monmonier, M. S. 1972. Contiguity-based class-interval selection: a method for simplifying patterns on statistical maps. The Geographical Review 62 (2): 203-228.

13

Murray, A. T., and T.-K. Shyy. 2000. Integrating attribute and space characteristics in choropleth display and spatial data mining. International Journal of Geographical Information Science 14 (7): 649-667. Robinson, A. H., Sale, R.D., Morrison, J.L., and Muehrcke, P.C. 1984. Elements of Cartography. 5th ed. New York, NY: John Wiley & Sons. Rumbaugh, J., M. Blaha, W. Premerlani, F. Eddy, and W. Lorensen. 1991. Object-Oriented Modeling and Design. Englewood Cliffs: Prentice Hall. Slocum, T. A. 1999. Thematic Cartography and Visualization. Upper Saddle River, NJ: Prentice Hall. Tukey, J. W. 1977. Exploratory Data Analysis. Reading, MA: Addison-Wesley. Xiao, N., D. A. Bennett, and M. P. Armstrong. 2002. Using evolutionary algorithms to generate alternatives for multiobjective site search problems. Environment and Planning A 34 (4): 639-656. Zitzler, E., K. Deb, L. Thiele, C. A. C. Coello, and D. Corne, eds. 2001. Evolutionary MultiCriterion Optimization: Proceedings of the First International Conference. Berlin: Springer.

14

Suggest Documents