BUILDING RESEARCH & INFORMATION (2009) 37(5-6), 520– 532
RESEARCH PAPER
Urban data-mining: spatiotemporal exploration of multidimensional data Martin Behnisch1 and Alfred Ultsch2 1
Institute of Historic Building Research and Conservation, ETH Zu«rich, CH-8093 Zu«rich, Switzerland E-mail:
[email protected]
2
Department of Mathematics and Computer Science, Philipps-University of Marburg, Hans-MeerweinStrae, D-35032 Marburg,Germany E-mail:
[email protected]
‘Urban data-mining’ describes a methodological approach to reveal logical or mathematical and partly complex descriptions of patterns and regularities inside a set of geospatial data. The cyclical methodology procedure is characterized by six main tasks following the initial step of data collection: data inspection, structure visualization, structure definition, structure control, operationalization, and knowledge conversion. Geovisualization and spatial analysis supplement the process of knowledge conversion and communication. The multidimensional mining approach is presented as a case study applied to 12 430 German communities to analyse multidynamic characteristics between 1994 and 2004. In particular, Emergent Self Organizing Maps (ESOM) are performed as an appropriate method for clustering and classification. Their advantage is to visualize the structure of data and later on to define a number of feasible clusters. A good evidence-base for decision-makers and the implementation of planning tools would be the spatiotemporal exploration of multidimensional data leading to specific details, explanations and abstractions in the context of dynamic community behaviour. The presented techniques are expected to be of increasing interest for the management and development of building stocks, as well as for urban and regional planning processes. Keywords: building stock, data-mining, Geographic Information Science (GIS), spatiotemporal analysis, urban analysis «L’exploration de donne´es urbaines» de´crit une approche me´thodologique qui vise a` re´ve´ler les descriptions logiques ou mathe´matiques et partiellement complexes des sche´mas et re´gularite´s au sein d’un jeu de donne´es ge´ospatiales. Cette proce´dure qui suit une me´thodologie cyclique se caracte´rise par six taˆches principales suivant la taˆche initiale de collecte des donne´es: inspection des donne´es, visualisation des structures, de´finition des structures, controˆle des structures, ope´rationnalisation et conversion en connaissances. La ge´ovisualisation et l’analyse spatiale comple`tent le processus de conversion en connaissances et de communication. Cette approche faisant appel a` l’exploration de donne´es multidimensionnelles est pre´sente´e sous forme d’une e´tude de cas applique´e a` 12430 communes allemandes afin d’en analyser les caracte´ristiques multidynamiques entre 1994 et 2004. Il est en particulier re´alise´ des cartes autoorganisatrices e´mergentes (ESOM), s’agissant d’une me´thode adapte´e au groupage et a` la classification. Elles ont pour avantage de visualiser la structure des donne´es et de de´finir ulte´rieurement un certain nombre de groupes re´alisables. L’exploration spatio-temporelle de donne´es multidimensionnelles conduisant a` des de´tails, des explications et des abstractions spe´cifiques dans le contexte d’un comportement communautaire dynamique constituerait une bonne base probante pour les de´cisionnaires et pour la mise en œuvre d’outils de planification. Il est attendu des techniques pre´sente´es qu’elles soient d’un inte´reˆt croissant pour la gestion et le de´veloppement des parcs baˆtis, ainsi que pour les processus de planification urbaine et re´gionale. Mots cle´s: parc baˆti, exploration de donne´es, Syste`mes d’Information Ge´ographie ou ge´omatique (SIG), analyse spatiotemporelle, analyse urbaine
Building Research & Information ISSN 0961-3218 print ⁄ISSN 1466-4321 online # 2009 Taylor & Francis http: ⁄ ⁄www.informaworld.com ⁄journals DOI: 10.1080/09613210903189343
Urban data-mining
Introduction Most of the large databases currently available have a strong spatiotemporal component and potentially contain information that might be of value. Datamining is commonly defined as the inspection of data. Mining implies a laborious process of searching for hidden information in a large amount of data (Han and Kamber, 2006). The ultimate goal of datamining is to provide evidence-based insight through a deeper understanding of data (in the mind of the analyst) and to produce results that can be utilized at policy and strategy levels. Important requirements for ‘knowledge discovery’ are interpretability, novelty and the usefulness of results. Since the use of the term ‘data-mining’ is quite diverse, the authors offer a short but more general definition of data-mining and knowledge discovery (Ultsch, 1987). Data-mining means the inspection of a large data set with the aim of knowledge discovery. Knowledge discovery is the discovery of new patterns in the data, i.e. knowledge that is unknown in this form so far. This knowledge has to be presented symbolically and should be understandable for human beings as well as useful in knowledge-based systems. An important goal of knowledge discovery is the search for patterns in data that can help explain the underlying process that generated the data. A central issue of datamining is the transition from data to knowledge. The conversion of sub-symbolic patterns and trends in data to a symbolic form is seen as the most difficult and most critical part of data analysis (Ultsch and Korus, 1995). Symbolically represented knowledge – as sought by data-mining – is a representation of facts in a formal language such that an interpreter with competence to process symbols can utilize this knowledge. In particular, human beings must be able to read, understand and evaluate this knowledge. The knowledge should be useful for analysis, diagnosis, simulation and/or prognosis of the process that generated the data set. There are several trials to build up content-based classifications in the focus of research about geospatial objects (e.g. buildings, building stocks, cities, regions) and their similarities. Approaches are summarized and partly listed in chronological order in Behnisch (2009). Clustering (i.e. unsupervised classification) is the process of finding intrinsic groups (i.e. empirical typologies or clusters) in a set of data without knowing a priori which data set belongs to which class. Classification is the task of assigning class labels to a data set according to a model where classes are known. Typological grouping processes are systematic instruments to develop statistical scales and criteria. For example, Harris started in 1943 (Harris, 1943). He was a pioneer in city classification and ranked US cities
according to industrial specialization data. Later on in the 1970s studies were geared to measure socio-economic properties and shifted more towards the goals of public policy. In recent years the evaluation of the performance of different cities is becoming increasingly important for sustainable development (Arlt et al., 2001). The patterns of demographic and economic changes in Germany are also part of several investigations (Siedentop et al., 2003; Gatzweiler et al., 2003). Critical properties of geospatial objects are discussed and analysed by Demsar (2006). It is to emphasize that most of the former classification studies are usually calculated by hierarchical clustering algorithms (e.g. WARD, k-means). Especially in the field of urban and spatial planning as well as regional science, data are usually multidimensional, spatially correlated and heterogeneous. These properties make former multivariate approaches often inappropriate for these data, as their basic assumptions cease to be valid (e.g. independently generated and identically distributed). Furthermore, several cluster algorithms are limited to find clusters of specific shape (e.g. spherical, ellipsoid). Extracting knowledge from geospatial data therefore requires special approaches. The authors describe the term ‘urban data-mining’ as a methodological approach that discovers logical or mathematical and partly complex descriptions of urban patterns and regularities inside the data. This multidimensional mining approach will be presented below in a successive way by finding groups (cluster) of German communities with the same multi-dynamic characteristics. The aim is to obtain a precise view of each detected pattern and to formulate spatial abstractions. The approach leads to the generation of a hypothesis that might be valuable for further investigations.
‘Urban data-mining’ ^ exploring multidimensional data The cyclical methodology procedure is characterized by six main tasks (Figure 1) following the initial step of data collection. The main tasks on the far right of Figure 1 contain several aspects within the circle and are roughly explained below. It should be considered that the analysis certainly starts with a relevant problem or specific research question. According to the presented steps within the circle there are often several combinations and processes necessary to find an appropriate solution or probably surprising answer. In particular, the cyclical approach provides the ability to identify hidden relationships and unusual patterns within a large amount of data. But human interaction is important during the mining process to analyse and validate partial results as early as possible and to guide further processing steps. 521
Behnisch and Ultsch
Figure 1 ‘Urban data-mining’ approach. Source: Ultsch (2006) and adapted by Behnisch (2009)
Data inspection
Structure control
Examination of the variables to gain insight into the data and the relations between data reformulate variables to make them compatible and comparable. Pre-processing is crucial for a successful outcome of the analysis. If the data are not cleansed and normalized, there is a danger of obtaining spurious and meaningless results. For many similarity measures, e.g. the commonly used Euclidean distance, normalization of data needs to be considered to avoid undesired emphasis of features with large ranges and variances.
The openness of the formation of clusters needs an additional validation of intermediate results. Regression is the task of explicitly modelling variable dependencies to predict a subset of the variables from others (Hastie et al., 2009). Regression can also be used to replace missing values. Discriminant analysis is applicable to determine the class of an observation based on a set of variables. The structure control supports the explanation and description of a classification result.
Structure visualization
High-dimensional data are projected on a low dimensional grid through projection procedures such as multidimensional scaling (MDS), Sammon’s mapping (Sammon, 1969) and, in particular, emergent selforganizing maps (ESOM). Thus, the structures of the objects can be established in different forms (e.g. distance and density-based approaches and their combinations). Structure de¢nition
Clustering is the process of finding intrinsic groups, called clusters, in a data set. Each cluster should be as homogeneous as possible and distinct from other clusters. A cluster can be defined based on distances or densities. Classification is the task of assigning class labels to a data set according to a model learned from training data where classes are known. Results can suggest a general typology and lead to the development of prediction models using subgroups instead of the total population (¼amount of objects, e.g. communities). 522
Operationalization
New objects can be associated to existing classes by classifiers representing a model in the form of rules or decision trees. A classifier is based on learning, testing and validation of data sets. It is expressed in a sub-symbolic or symbolic form whereas a symbolic classifier assists human skills of comprehension. Knowledge conversion
The most important step is the generation of useful, new and unsuspected knowledge. It is required to be representable in a linguistic form that is understandable to humans and automatically usable by knowledge-based systems. With extracted knowledge it is possible to diagnose unknown examples. Geovisualization (MacEachren, 1994) and spatial analysis support the interpretation.
Inspection and transformation ^ raw data of the German dynamic behaviour Six variables were selected for the classification analysis of 12 430 German communities. The data refer to
Urban data-mining
the statistics of population (V1), migration (V2), tax capacity (Steuereinnahmekraft; V3), dwellings (V4), employment (V5), and commuters (V6). The variables are often used in former approaches (e.g. Gatzweiler et al., 2003) and are based on statistical data from well-known and easily accessible institutions. All these variables together are needed to obtain a deeper view into the dynamic behaviour of communities.
cases. One unsuspected result during the pre-processing is the finding of a specific decision boundary in tax capacity (E170/person). It was therefore used for the optimized calculation of this variable (Table 1) and might be useful for further studies. All six variables are characterized by dichotomy. Figure 2 shows all transformed variables using histograms and scatter plots.
The dynamic processes are mostly characterized by positive or negative percentage quotations between the years 1994 and 2004. Some variables are used to describe the present situation referring to one specific year. For example, tax capacity provides an indication of the economic and financial situations of communities (Statistisches Bundesamt, 2005). This composite and complex indicator exclusively refers to the German public finance system and is influenced by several values (e.g. income tax, property tax, value added tax, municipal collection rate).
Pertaining to the classification approach (e.g. U Matrix and subsequent U C-Algorithm) and according to the Euclidean distance the data need to be standardized. Figure 3 shows the dichotomized variables using PDE for the investigation of z-transformed data. A multidimensional dynamic implies a combined view to all these dichotomized variables. The question is now: ‘Are there communities that have similar problems with regard to dynamics?’ Under the assumption that all combinations of characteristics exist, 64 classes might describe the multidimensional dynamic of German communities. An equal distribution of objects to these classes gives a prior probability of 1/ 64 ¼ 1.56% for each class.
Table 1 shows the calculation of variables. The inspection of data included the visualization in form of histograms, Q-Q plots (where ‘Q’ stands for quantile), Pareto density estimation (PDE) plots (Ultsch, 2003), and Box plots. The first hypothesis to the distribution of each variable is a bimodal distribution of log-normal distributed data (data . 0, skewed to the right; data , 0, skewed to the left). It was decided to use transformation measurements such as ladder of power to take into account restrictions of statistics (Hand et al., 2001). All variables were transformed by using: y ¼ signðxÞ logðjxj þ 1Þ As a result of pre-processing, a mixture of two distributions is found with decision boundary zero in most
Structure visualization and clustering of similar dynamics ^ U -Map and U C-algorithm The power of self-organization allows the emergence of structure in data and supports its visualization, clustering and labelling concerning a combined distance and density-based approach. To visualize high-dimensional data, a projection from the highdimensional space onto two dimensions is needed (¼planar map). This projection onto a grid of neurons is called a self-organizing map (SOM). There are two different SOM usages. The first are SOM, introduced by Kohonen (1982). Neurons are
Table 1 Examination of all six distributions ID: Label
Calculation
Decision boundaries
Size of classes
V1: Population
Change of population as a percentage between 1994 and 2004 (%)
C1: Data 0 C2: Data . 0
4650, 37% 7780, 63%
V2: Migration
Net balance of migration (into/away) between 1997 and 2004 (^)
C1: Data 0 C2: Data . 0
4460, 36% 7970, 64%
V3: Tax capacity
Deviation to a de¢ned tax value (E170/person), 2003 (%)
C1: Data 170 C2: Data . 170
2884, 23% 9546, 77%
V4: Dwellings
Change in the number of dwellings,1994 and 2004 (%)
C1: Data 0 C2: Data . 0
487, 4% 11 943, 96%
V5: Employment
Change in employment,1997^ 2004 (%)
C1: Data 0 C2: Data . 0
7333, 59% 5097, 41%
V6: Commuters
Ratio of [(in-commuter) ^ (out-commuter)]/population, 2004 (%)
C1: Data 0 C2: Data . 0
10 830, 87% 1600,13%
523
Behnisch and Ultsch
Figure 2 Scatter plots of the transformed variables (D ¼ 6, N ¼ 12 430)
identified with clusters in the data space (k-means SOM) and there are very few neurons. The second are SOM where the map space is regarded as a tool for the visualization of the otherwise high-dimensional data space. These SOM consist of thousands or tens of thousand neurons. Such SOM allow the emergence of intrinsic structural features of the data space and therefore they are called emergent SOM (ESOM) (Ultsch, 1999). The ESOM preserves the neighbourhood relationships of the high-dimensional data and the weight vectors of the neurons are thought as a sampling point of the data. The U-Matrix has become the canonical tool for displaying the distance structures of the input data on ESOM. The P-Matrix takes density information into account. The combination of a U-Matrix and a P-Matrix leads to the U -Matrix. On this U -Matrix a cluster structure in the data set can be detected directly. The examples in Figure 4 allow a comparison of both methods using the same data to see in an appropriate way whether there are cluster structures.
versus the border. This is important during the learning phase and structures the projection. In many applications important clusters appear in the corner of such a planar map. Using ESOM as a basis for clustering has the advantage of a non-linear disentanglement of complex structures. The clustering of the ESOM can be performed at two different levels. The ‘best match’ visualization can be used to mark data points that represent a neuron with a defined characteristic. Best matches, and thus corresponding data points, can be manually grouped into several clusters. Not all points need to be labelled; outliers are usually easily detected and can be removed. Secondly, the neurons can be clustered by using the clustering algorithm U C, which is based on grid projections and uses distance and density information (Ultsch, 2005a). At most times an aggregation process of objects is necessary to build up a meaningful classification (Behnisch and Ultsch, 2008). Assigning a name to a cluster is one of the most important and crucial processes in order to define the meaning of a cluster.
The often-used finite grid as map has the disadvantage that neurons at the rim of the map have very different mapping qualities compared with neurons in the centre
On the U -Matrix a cluster structure in the data set can be detected directly. Such visualization is used in tiled form to avoid border effects. Afterwards, a so-called
524
Urban data-mining
Figure 3 Qualitative distribution of all variables (Pareto density estimation)
island view is realized by mask to reduce redundancies which means each neuron is nearly visible at once. An ESOM with 50 150 neurons is trained with the preprocessed dynamic data. The corresponding U -Map (island view; Figure 5) delivers a geographical landscape of the input data on a projected map (imaginary axis). The cluster boundaries are expressed by mountains, which means the value of height defines the distance between different objects, which are displayed on the z-axis. A valley describes similar objects,
characterized by small U-heights on the U -Map. Data points found in coherent regions are assigned to one cluster. All local regions lying in the same cluster have the same spatial properties. The presented U -Map already includes the clustering results of a clustering algorithm (U C) and offers a visualization of hidden and unsuspected structures (13 cluster). Six main classes are defined due to the interpretation and aggregation of sub-clusters. Figure 6 shows the assigned label of each class.
Figure 4 k-means self-organizing map (SOM) by Kaski et al. (2002) (left) and U -Matrix (right) 525
Behnisch and Ultsch
Figure 5 Island of multidimensional community dynamics (U -Map). This particular landscape of numbered clusters reveals 13 clusters having similar communities each with characteristic properties
Structure control: explaining patterns of multidimensional dynamic It is possible to integrate techniques of knowledge discovery to understand the structure in a complementary form. It supports the finding of an appropriate cluster aggregation and denomination (Bishop, 2006). The clusters are described by a classification and regression tree (Breiman et al., 1984). In this manner, two or three variables are extracted to obtain a deeper view to the structure of clusters (Table 2). The table presents the cluster label, individual properties of clusters and their size (number of communities).
Figure 6 Observed patterns of multidimensional dynamics 526
So-called CART analysis consists of four basic steps. The first step consists of building trees, during which a tree is built using recursive splitting of nodes. Each resulting node is assigned to a predicted class, based on the distribution of classes in the learning data set which would occur in that node and the decision-cost matrix. The assignment of a predicted class to each node occurs whether or not that node is subsequently split into child nodes. The second step consists of stopping the tree-building process. At this point a ‘maximal’ tree has been produced that probably greatly overfits the information contained within the
Urban data-mining
Table 2 Machine-generated explanations of the structure of clusters Cluster label
Cluster description based on the extracted variables
Growing Dynamic Communities
Commuting with growing jobs
migration positive employment positive between 1997^ 2004
two sub clusters are formed depending on population population positive
migration negative
population negative
Cluster size sub cluster A1 (3377 communities) sub cluster A2 (174 communities) sub cluster A3 (310 communities)
two sub clusters are formed depending on commuters
sub cluster A4 (678 communities) sub cluster A5 (127 communities) sub cluster B1 (3197 communities)
migration positive
Dormitory towns
Hard reachables, losers from German reuni¢cation
three sub clusters are formed depending on population and tax capacity
employment negative between 1997^ 2004
sub cluster B2 (324 communities) sub cluster B3 (276 communities)
migration negative
two sub clusters are formed depending on dwellings
Baby boomers losing jobs
population positive
Special cases
1300 communities, outliers
learning data set. The third step consists of tree ‘pruning’, which results in the creation of a sequence of simpler and simpler trees, through cutting off of increasingly important nodes. The fourth step consists of optimal tree selection, during which the tree that fits the information in the learning data set (but does not overfit the information) is selected from among the sequence of pruned trees.
sub cluster B4 (2114 communities) sub cluster B5 (158 communities) sub cluster B6 (395 communities)
It is astonishing that just a few classes contain more than 50% of all 12 430 communities. In particular there are six main multidimensional dynamics (Figure 6) and one class of outliers. This class contains all classes in minority that means not fully occupied (below 1%). The approach poses some questions. Why do some classes not exist? What factors have an effect on the observed distribution of communities?
The aim of knowledge discovery implies the transition of data to knowledge. Such knowledge is required to be previously unknown, interesting and useful. Against this background, knowledge conversion should clarify the understanding of observed multidimensional dynamics of German communities.
Secondly, the dependencies between clusters and variables are discovered by the integration of knowledgediscovery techniques. An extraction of explanations is realized and fits the minimal information for each cluster (CART). According to the requirements of knowledge discovery, such knowledge needs to be unsuspected and previously unknown. Due to the combined interpretation, it might be utilized by decision-makers within the field of spatial planning.
At first, the investigation of distributions leads to the finding of dichotomy in all six variables (positive or negative development). One assumption was that all combinations of characteristics might exist and describe the multidimensional dynamic of German communities. The authors therefore expected 64 classes based on a manual classification and an equal distribution of objects to these classes. It was surprising that the distribution of objects was totally different. A deeper understanding was realized by U -Matrix and U C-algorithm.
Thirdly, the structure and the machine-generated explanations were validated mindful of the spatial analyst and yielded a spatial abstraction. The localization of objects and selected spatial analysis are suitable for the continuous control of corresponding interpretations. Results are therefore combined with official city hierarchies (e.g. low-level-centre or highlevel-centre), spatial typologies (e.g. central area or periphery), transport infrastructure (e.g. highway and railway system), and isochrones. Finally, classes are
Knowledge conversion ^ transition to knowledge and spatial abstraction
527
Behnisch and Ultsch
Figure 7 Localization of multidimensional dynamics
satisfactorily addressed to the approved pressure factors for urban dynamic development (population and employment) and are represented by specific combinations of dynamic properties. Figure 7 shows the spatial distribution of classified communities. It further contains the relative proportion of communities in the eastern (the former German Democratic 528
Republic without Berlin) and western parts of Germany. Needless to say, many former studies already discovered growing and shrinking processes in Germany. It is therefore well known that the southern and western parts of Germany are growing and the
Urban data-mining
Figure 8 View of the sub-cluster ‘Losers from German reuni¢cation’. This map also contains a symbol (black dots) for each community called ‘Hard reachables’. Sub-symbolic patterns and data trends are represented by the cluster labels. The local urban properties of the communities should be examined in detail by further investigations
529
Behnisch and Ultsch
eastern part (the former German Democratic Republic) is shrinking. These studies typically looked at the sum of variables to illustrate growing or shrinking gradients. In contrast to these approaches, the authors want to identify multiple dynamic behaviours of communities as well as characteristic patterns. One advantage is that unexpected patterns will emerge out of the collected data both in their dynamic and in a localized way. For example, it was unexpected that one pattern of communities is characterized by a negative net balance of migration, but by a growing population. The development of population is able to compensate the recent population outflow. Typical communities are further characterized by a negative job development. The characteristic label assigned to this cluster is ‘Baby boomers losing jobs’. This dynamic pattern is predominant in the western part of Germany and often found in small towns (e.g. Kleinstadt). Such communities are, on the one hand, near the urban areas and, on the other hand, in specific rural areas that might be attractive and a desirable residence for young families. One assumption is that single and flexible persons move to nearer places with a manifold job market and other socio-cultural conditions. Another cluster (‘Hard reachables and losers from the German reunification’) is also highlighted in Figure 8. Massive falls in population and job losses are shaking urban quarters, towns, cities, and regions in Germany. The dramatic developments in East Germany since 1989, which resulted in vacancy of over 1 million flats and houses, the abandonment of innumerable industrial sites, and the transformation of established social and cultural institutions, are turning out to be a rising general pattern in the system of communities. The localized sub-cluster contains many small rural communities and is explicitly extracted in East Germany using this multidimensional approach (Table 2, Machine-generated explanations). It is unfortunate that a long time after German reunification the summarized communities are characterized by such dramatic developments. This development is explained by a massive negative job situation, a clear negative net balance of migration, and a stationary or even decreasing number of dwellings. Furthermore, public transport is inconvenient.
Discussion and conclusions In the presented ‘urban data-mining’ case study, a typical (straightforward) unsupervised classification procedure was applied to geospatial data (for 12 430 communities). It should be noted that procedures entailing data-mining and knowledge discovery are currently not employed in spatial and urban planning. The issue for this case study is therefore to apply typical techniques and already-known methods in an exemplary way to validate the process and outcomes. The 530
approach successfully discovers multiple and unsuspected dynamic behaviours of German communities. In particular, multidimensional patterns are explored to trigger discussions in the application domain and to reveal insights about spatiotemporal phenomena. The authors expect that in the future the concept of datamining in connection with knowledge-discovery techniques will become increasing important for the urban research and planning processes. The presented ‘urban data-mining’ approach provides the ability to identify hidden relationships and unusual patterns within a large amount of community data. First, the pool of data is examined and the importance for the investigation of distributions is demonstrated according to the multidimensional dichotomy. Afterwards, it is shown that the use of ESOMs is an appropriate method for clustering and classification. The advantage is to visualize the structure of data and later on to define a number of feasible clusters using a U C-algorithm while a typical hierarchical algorithm would fail (for general clustering problems, see Ultsch, 2005) to examine these 12 430 communities. The presented approach leads to the identification and abstraction of multidimensional dynamics. Six main dynamics of communities are discovered. Structure control and interpretation were realized by generating trees with the aim of identifying a significant number of explanatory variables. Knowledge conversion provides the mentioned transition from data to knowledge and generates several hypotheses for further investigations. Specific clusters should be investigated in detail by other structural and temporal parameters (e.g. age of the population, buildings, infrastructure, etc.). Furthermore, field investigation in selected areas should be conducted to obtain more reliable statistical data in space and time. In particular, long-term time courses serves as a basis for making decisions, as well as to control decisions that have been taken (Ba¨tzing and Dickho¨rner, 2001). Another aspect deals with spatial outliers (Shekhar et al., 2003) and might be of specific interest for further research. The modifiable unit problem (Openshaw, 1984) should be also taken into account to optimize the analysis of German communities. How can urban data-mining be deployed to study shrinking and growing processes? With its regional planning policy, the German federal government aims to create equivalent social conditions in all subregions. Politicians and planners previously assumed that if enough public money were invested in infrastructure, then private investment would follow in deprived areas. However, other factors such as regional economic and demographic changes will foster regional and spatial disparities in the future. This complexity entails comprehensive and probably multidimensional
Urban data-mining
approaches. New types of communities are arising and precise individual concepts are needed for their urban development. There is a need for reinventing or rediscovering the properties that politicians and planners need to engage with. The development of communities requires new perspectives in many ways: economic, social, cultural, societal, and in the building stock (Banse and Effenberger, 2006). Beyond it is necessary to work on different scales (e.g. nationally, regionally and locally) and with efficient strategies on different administrative levels. It is therefore important to start with the local potential of the specific place and not simply rely upon a top-down approach. In particular, decisions about funding should be based on the real problems of local communities. In consideration of different multidimensional patterns, it is possible to provide a data-based approach on spatial relations and neighbourhoods (e.g. comparative strengths, interregional communication and cooperation). It is important to work with extracted knowledge when formulating strategies for the future development of communities. Therefore, a need for adjusted planning tools exists. A good base for the implementation of such tools is the spatiotemporal exploration of multidimensional data leading to specific details and explanations in the context of dynamic behaviour. Common standards for a continuous observation of specific planning processes or mitigation measurements should be established in consideration of possible multidimensional results. Especially the integration of temporary multidimensional investigations might encourage the short- and long-term development of communities. Furthermore, procedures on the basis of knowledgebased systems are currently not sufficiently developed for direct integration into the regional and urban planning and development processes (Streich, 2005). Such approaches might lead to a benchmark system for regional policy or to other strategic instruments such as semi- or fully automated urban monitoring systems.
Acknowledgements The data were created by the Federal Institute for Research on Building, Urban Affairs and Spatial Development (BBSR) within the Federal Office for Building and Regional Planning (BBR). This research was supported, in part, by the Landesgraduiertenfo¨rderung Baden Wu¨rttemberg (LGF) and by the subsidy award from the Rossmann Foundation, Karlsruhe. The authors would like to thank PD Dr. rer. nat. habil. Nguyen Xuan Thinh for the profitable discussions.
References Arlt, G., Go¨ssel, J., Heber, B., Hennersdorf, J., Lehmann, I. and Thinh, N.X. (2001) Auswirkungen sta¨dtischer Nutzungs¨ Rstrukturen auf Bodenversiegelung und Bodenpreis. IO Schriften No. 34, Leibniz-Institut fu¨r o¨kologische Raumentwicklung e.V., Dresden. Banse, J. and Effenberger, K.H. (2006) Deutschland 2050 – Auswirkungen des demographischen Wandels auf den Woh¨ R Texte No. 152, Leibniz-Institut fu¨r nungsbestand. IO o¨kologische Raumentwicklung e.V., Dresden. Ba¨tzing, W. and Dickho¨rner, Y. (2001) Die Typisierungen der Alpengemeinden nach ‘Entwicklungsverlaufsklassen’ fu¨r den Zeitraum 1870–1990. Mitteilungen der Fra¨nkischen Geographischen Gesellschaft, 48, 273–303. Behnisch, M. (2009) Urban Data Mining, Universita¨tsverlag, Karlsruhe. Behnisch, M. and Ultsch, A. (2008) Urban data mining using emergent SOM, in Chr. Preisach, H. Burkhardt, L. Schmidt-Thieme and R. Decker (eds): Data Analysis, Machine Learning and Applications. Proceedings of the 31st Annual Conference of the German Classification Society, Springer, Berlin, pp. 311–318. Bishop, Ch. (2006) Pattern Recognition and Machine Learning, Springer, Berlin. Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984) Classification and Regression Trees, CRC Press, Boca Raton, FL. Demsar, U. (2006) Data Mining of Geospatial Data: Combining Visual and Automatic Methods, Urban Planning Department, KTH Stockholm. Gatzweiler, H.P., Meyer, K. and Milbert, A. (2003) Schrumpfende Sta¨dte in Deutschland? Fakten und Trends, in BBR (ed.): Informationen zur Raumentwicklung, Selbstverlag, Bonn, 10/11, pp. 557–574. Han, J. and Kamber, M. (2006) Data Mining–Concepts and Techniques, 2nd edn, Morgan Kaufmann, San Francisco. Hand, D., Mannila, H. and Smyth, P. (2001) Principles of Data Mining, MIT Press, Cambridge, MA. Harris, C.D. (1943) A functional classification of cities in the United States. Geographical Review, 33(1), 86–99. Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning, 2nd edn, Springer, Berlin. Kaski, S., Nikkila¨, J., To¨ro¨nen, P., Venna, J., Castre´n, E. and Wong, G. (2002) Analysis and visualization of gene expression data using self-organizing maps. Neural Networks, 15(8–9), 953–966. Kohonen, T. (1982) Self-organizing formation of topologically correct feature maps. Biological Cybernetics, 43(1), 59–69. MacEachren, A.M. (1994) Visualization in modern cartography: setting the agenda, in A.M. MacEachren, D.R.F. Taylor (eds): Visualization in Modern Cartography, Pergamon, Oxford, pp. 1–12. Openshaw, S. (1984) The Modifiable Areal Unit Problem. Geo Books, Norwich. Sammon, J.W. (1969) A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, C-18(5), 401–409. Shekhar, S., Lu, C.-T. and Zhang, P. (2003) A unified approach to detecting spatial outliers. GeoInformatica, 7(2), 139–166. Siedentop, St., Kausch, St., Einig, K. and Go¨ssel, J. (2003) Siedlungsstrukturelle Vera¨nderungen im Umland der Agglomerationsra¨ume, Heft 114, Bundesamt fu¨r Bauwesen und Raumordnung, Bonn. Statistisches, Bundesamt (2005) Qualita¨tsbericht Realsteuervergleich. Statistisches Bundesamt (DESTATIS), Selbstverlag, Wiesbaden. Streich, B. (2005) Stadtplanung in der Wissensgesellschaft, VS, Wiesbaden. Ultsch, A. (1987) Control for Knowledge-based Information Retrieval, der Fachvereine, Zu¨rich. Ultsch, A. (1999) Data mining and knowledge discovery with emergent self organizing feature maps for multivariate time 531
Behnisch and Ultsch
series, in E. Oja, S. Kaski (eds): Kohonen Maps, Elsevier, Amsterdam, pp. 33–46. Ultsch, A. (2003) Pareto density estimation: a density estimation for knowledge discovery, in D. Baier, K.D. Wernecke (eds): Innovations in Classification, Data Science, and Information Systems. Proceedings of the 27th Annual Conference of the German Classification Society, Springer, Berlin, pp. 91–100. Ultsch, A. (2005a) Clustering with SOM: U C, in Proceedings of the Workshop on Self-Organizing Maps (WSOM 2005), Paris, France (available at: http://www.uni-marburg. de/fb12/datenbionik/pdf/pubs/2005/ultsch05clustering) (available at: 6 July 2009), pp. 75–82.
532
Ultsch, A. (2005b) U C self-organized clustering with emergent feature map, in Proceedings of Lernen, Wissensentdeckung und Adaptivita¨t (LWA/FGML 2005), Saarbru¨cken, Germany, pp. 240–246. Ultsch, A. (2006) Knowledge discovery, lecture notes. Unpublished MS, Department of Mathematics and Informatics, University of Marburg. Ultsch, A. and Korus, D. (1995) Automatic acquisition of symbolic knowledge from subsymbolic neural networks, in Proceedings of the 3rd European Congress on Intelligent Techniques and Soft Computing (EUFIT’95), Aachen, Germany, 28– 31 August 1995, Vol I, pp. 326–331.