International Journal of Medical Informatics (2003) 70, 79 /94
www.elsevier.com/locate/ijmedinf
Integrating GIS components with knowledge discovery technology for environmental health decision support ´darda,c,*, Pierre Gosselinb,c, Sonia Rivesta, Marie-Jose ´e Proulxa, Yvan Be Martin Nadeaua, Germain Lebelc, Marie-France Gagnonc a
Centre for Research in Geomatics, Universite´ Laval, Pavillon Casault, Que´bec, Canada, G1K 7P4 Institut national de sante´ publique du Que´bec INSPQ, Beauport, Canada c Centre hospitalier universitaire de Que´bec (CHUQ), 2705 boulevard Laurier, Sainte-Foy, Que´bec, Canada, G1V 4G2 b
Received 20 June 2002; received in revised form 25 August 2002; accepted 17 October 2002
KEYWORDS Decision-support; Environmental health; Geographic knowledge discovery (GKD); Geographic information systems (GIS); Spatial on-line analytical processing (SOLAP); Public health
Summary This paper presents a new category of decision-support tools that builds on today’s Geographic Information Systems (GIS) and On-Line Analytical Processing (OLAP) technologies to facilitate Geographic Knowledge Discovery (GKD). This new category, named Spatial OLAP (SOLAP), has been an R&D topic for about 5 years in a few university labs and is now being implemented by early adopters in different fields, including public health where it provides numerous advantages. In this paper, we present an example of a SOLAP application in the field of environmental health: the ICEM-SE project. After having presented this example, we describe the design of this system and explain how it provides fast and easy access to the detailed and aggregated data that are needed for GKD and decision-making in public health. The SOLAP concept is also described and a comparison is made with traditional GIS applications. – 2002 Elsevier Science Ireland Ltd. All rights reserved.
1. Introduction Public health organizations collect significant volumes of data. Monitoring and assessing trends of environmental exposures and related health problems require health specialists to access appropriate information in a timely manner. This is true for public health planning, management and surveillance purposes in general. Quality informa-
tion helps to identify and prioritize problems, to develop and evaluate policies and actions, to organize clinical health services delivery, to guide research and development, to contribute to standards and guidelines development as well as to monitor progress and to inform the public. These general needs can be fulfilled with a series of systems that can be grouped in the next classes following increasing levels of technical complexity:
*Corresponding author. E-mail addresses:
[email protected] (Y. Be
[email protected] (P. Gosselin), ´dard),
[email protected] (S. Rivest),
[email protected] (M.-J. Proulx),
[email protected] (M. Nadeau),
[email protected] (G. Lebel),
[email protected] (M.-F. Gagnon). 1386-5056/03/$ - see front matter – 2002 Elsevier Science Ireland Ltd. All rights reserved. doi:10.1016/S1386-5056(02)00126-0
80
1) find what information exists and where it is stored (e.g. digital libraries, web portals, search engines, metadata engines, spatial data infrastructures); 2) access and query the data (e.g. Database Management Systems (DBMS), low-end Geographic Information Systems (GIS), spatial viewers); 3) visualize pre-built outputs (e.g. Executive Information Systems (EIS), dashboards, DBMS views, GIS result sets); 4) create new types of outputs (e.g. query builders, report builders, low-end GIS); 5) perform advanced analysis (e.g. statistical packages, high-end GIS); 6) perform interactive exploration of large amounts of data (e.g. On-Line Analytical Processing (OLAP), Spatial OLAP (SOLAP)); 7) trigger automatic detection of patterns in the data (e.g. data mining). A recent study of the needs of high-level health specialists conducted across Canada in different organizations (federal, provincial, universities, private companies) showed that the above functions are currently being used by the respondents to an important degree, but mostly with the non-spatial component of the data [1]. Non-spatial knowledge discovery is already emerging at that level of decision-making. However, the picture is completely different in non-specialized local or regional public health agencies as shown for instance in a large survey done for the province of Quebec [2] where these tools are only sparingly used, although in demand for the mid-term. The pan-Canadian study also showed that the geospatial component is expected to be used in all of the functions listed above in the very short term. According to this study, it is going to happen because the use of the geospatial component allows for better presentation and visualization of the data, improved dissemination and communication, enhanced analysis and better support for decision-making. The study also indicated that specialists gradually implement geo-digital libraries, spatial viewers, low-end and high-end GIS packages to conduct the first five functions with the geospatial component of the data. However, the integration of the geospatial component in the last two functions is only starting, mostly in research groups. New technologies that better support these functions would give health specialists a new exploration and analysis potential known as Geographic Knowledge Discovery (GKD) [3,4]. GKD tools directly support decision-making involving geospatial data.
Y. Be ´dard et al. Although knowledge discovery in general may be conducted using tools such as OLAP and data mining (respectively, user-driven and software-driven knowledge discovery), today’s commercial packages rarely take into account the geospatial component of the data. To support GKD in an efficient manner, appropriate geospatial technology has to be developed. GIS alone is simply not an efficient solution because it is built on a transactional paradigm [5]. In order to better meet the above-mentioned needs with a fast and intuitive solution, one needs a system built on the multidimensional paradigm. Integrating GIS and OLAP to offer interactive data exploration in a hypertext style manner with cartographic, statistical and tabular views represent an innovative solution. Their combination into so-called SOLAP [6 /9] provides a new capability for GKD that goes well beyond traditional GIS. For example, SOLAP and spatial data mining [10,7,11] allow the users to easily and rapidly navigate within geospatial datasets as well as within descriptive data. This better supports the creation and/or the first validation of hypotheses as implicit spatial relationships between phenomena rapidly become evident and new relations are more likely to emerge in the mind of the user since he did not have to bother about Structured Query Language (SQL)-type commands or wait for more than a few seconds between the display of detailed or general maps, detailed or aggregated histograms, detailed or summarized tables, etc. Such GKD tools help to discover new tracks for analysis and to better focused research, to rapidly eliminate irrelevant hypotheses and also to have access to large volumes of data that are sometimes difficult to access for non-GIS and non-database specialists. Furthermore, an improved and easier access to basic GIS functionalities for non-specialists facilitates the inclusion of data-based evidence in the decision-making processes for such generic tasks as locating service delivery units (on the basis of criteria such as population density, access by public transit, rates of disease in the community, etc.) and marketing health promotion programs where they are most needed (for instance, high-risk subpopulations). Today, SOLAP has become a viable solution to explore geospatial health data and the most recent research aims at improving the underlying fundamental concepts as well as the capabilities of emerging commercial offerings (see [4] for good examples). The goal of this paper is to present an example of a SOLAP-based system developed for environmental health purposes during the ICEM-SE project. We begin with a presentation of the
Geographic knowledge discovery
system that has been developed, including an overview of its content, its functions and its architecture. Then, we describe the SOLAP concept in more details with references to the ICEM-SE data and GIS technology where appropriate. This description is followed by a comparison between SOLAP and GIS.
2. The ICEM-SE project: a practical example of a SOLAP-based system for environmental health This section describes a practical example of a system developed during the ICEM-SE project (Cartographic Interface for the Multidimensional Exploration of Environmental Health Indicators on the World Wide Web). This research and development project aimed at building a new type of cartographic interface for the multidimensional exploration of environmental and health indicators via the World Wide Web. It is one of the GEOIDE projects (Canada’s Network of Centers of Excellence in Geomatics) [12]. SOLAP-based
2.1. Objectives of the ICEM-SE project The general objective of the project was to improve the capabilities of geomatics technologies for decision-support (user-driven GKD). More precisely, the project aimed at building and testing a new geomatics solution, a SOLAP-based user interface, to facilitate the access and exploration of environmental and health geospatial data for health specialists [13]. The long-term goal of the ICEM-SE project was to help reduce health risks caused by an environmental source by providing a quick and easy access to high quality environmental and health data to improve decision-making and interventions, access to statistics and other information and the discovery of new knowledge. The geomatics research objectives were to adapt the concepts coming from multidimensional databases and OLAP for a geospatial context. In parallel, the health research objectives were to develop meaningful indicators in environmental health for their use in the SOLAP-based interface. The technical objective was to develop fully functional prototypes that would provide users without any GIS background the capability to easily and rapidly explore their data in order to understand complex phenomena related to environmental health.
81
2.2. General description of the prototypes The project strategy included the exploration of different combinations of technologies and the development of generic shells to build prototypes. Using commercial software components, our prototypes were designed as much as possible to facilitate their reuse for new types of data exploration or other public health applications. This strategy allows for: / the addition of new health and environmental data; / the addition of new cartographical layers; / the customization of the different displays (number of classes used in a statistical chart or map, default parameters for map or chart semiology, selection of axes in tables, etc). Within the prototypes developed, users can easily navigate through: / different levels of details, from local to regional and provincial levels and conversely, for example; or from all cancers to cancers of the respiratory system to lung cancers for another example; / different themes, from asthma to cancer and other diseases, or from industrial pollutants to drinking water quality and other environmental factors for example; / different epochs; / different subgroups of population (age groups and sex); / different statistical measures. Results are presented via displays with synchronized refreshing that can be used to navigate into the database with functions such as ‘drill-down’, ‘roll-up’, ‘pivot’ and ‘drill-across’ for example. The synchronized displays may include several: / thematic maps; / statistical diagrams (bar charts, pie charts); / tables.
2.3. Development of the prototypes Two SOLAP prototypes have been developed: an entry-level prototype for simpler navigation and a high-end prototype with more GKD functions. Both prototypes are based on the multidimensional database structure as used in data warehousing, OLAP and data mining systems. The developed prototypes provide quick and easy access to environmental and health data for a temporal coverage varying from 5 to 15 years depending on the
82
Y. Be ´dard et al.
topic and for a geographic coverage for the province of Quebec at the local level (community health centers (CLSC)), regional level (regional health authorities (RSS)) and provincial level. Currently, the prototypes contain data and metadata about the following indicators: / Cancer (incidence and deaths); / Respiratory diseases (hospitalizations deaths); / Notifiable diseases (incidence); / Poisonings; / Air quality monitoring; / National pollutant release inventory; / Greenhouse gas; / Pesticides sales; / Waste management; / Environmental health teams activities.
and
Other indicators can easily be added to either prototype, as this has become a straightforward operation. The actual indicators are structured differently in the two prototypes as explained in the next pages.
2.4. Details on the health data sources The health data sources used in this project were the following: / individual data on new cancer cases (incidence): the Quebec tumors file (‘Fichier des tumeurs du Que ´bec’), Quebec Ministry of Health and Social Services; / individual death data: the deaths file (‘Fichier des de ´ce `s’), Quebec Ministry of Health and Social Services; / individual hospitalization data: the Med-Echo ´cho’), Quebec Ministry registry (‘Registre Med-E of Health and Social Services. For each case (incidence, death or hospitalization), the data collected at the time of the event were: the diagnosis or the death cause (according to the International Classification of Disease, Ninth revision), the sex, the age, the event date, the municipal code and the postal code of the individual’s principal residence. The postal code has been used to assign the correct CLSC code. This process has been done according to the territorial divisions (the M-22 system) effective 31st March, 1999, as recommended by the Quebec Ministry of Health and Social Services. Due to the confidentiality of individual health data, these are currently only available to the public health director for the region under his responsibility. For this project, an agreement has
been made with the Quebec regional board, and only the postal codes for the Quebec region residents were available. The population data per year, per sex, per 5-year age group and per community health center were obtained from the Quebec Ministry of Health and Social Services [14]. The standardized rates were calculated according to the direct standardization method. The weight system used in the standardization process has been calculated from the 1991 population data (men and women grouped) for the whole province of Quebec. The comparative figure corresponds to the ratio of the standardized rate for a particular territory and the standardized rate for the province.
2.5. Entry-level ICEM-SE prototype First, to illustrate the operation of the entrylevel prototype, a simple analysis is presented: what were the regional health authorities, in the province of Quebec, with a high comparative hospitalization figure from asthma in 1998? For the Quebec region (the region of Quebec City) in particular, what was each community health center comparative hospitalization figure? Were the comparative figures different according to the sex of the affected persons? To conduct this analysis with a GIS would require several lines of SQL querying (if the user masters SQL and the database structure) or complex manipulation (if the GIS provides a graphical user interface). In addition, the response times could vary from several seconds to minutes. On the other hand, with the ICEM-SE entry-level SOLAP prototype, the user executes rapidly with a few mouse clicks the following tasks: 1) the user clicks on the desired information elements (called dimension members and measures in the multidimensional vocabulary), for example ‘Asthma’/‘Hospitalizations’/‘Comparative figure’/‘Regional health authorities’/‘1998’ in the selection trees of the navigation panel (see Fig. 1) and, after clicking on the appropriate button, the prototype always displays within 10 s the corresponding thematic map. 2) To have more details about the Quebec region, the user clicks on this region directly on the map and then executes a drill-down operation (by clicking on the appropriate button of the user interface). 3) To see the comparative figures for men and women, the user first selects ‘Women’ in the
Geographic knowledge discovery
83
Fig. 1 Interface of the ICEM-SE entry-level prototype application.
Fig. 2 Map of the comparative hospitalization figure from asthma, for the different regions of the province of Quebec, in 1998, for all the population.
84
Y. Be ´dard et al.
Fig. 3 Map of the comparative hospitalization figure from asthma, in 1998, for the community health centers corresponding to the Quebec region, for all the population.
Fig. 4 Map of the comparative hospitalization figure from asthma, in 1998, for the community health centers corresponding to the Quebec region, for women.
Geographic knowledge discovery
85
Fig. 5 Map of the comparative hospitalization figure from asthma, in 1998, for the community health centers corresponding to the Quebec region, for men.
Fig. 6 Bar chart of the comparative hospitalization figure from asthma, in 1998, for the community health centers corresponding to the Quebec region, for men.
86
Y. Be ´dard et al.
‘Population’ selection tree and displays the ‘Women’ map by clicking on the ‘Map’ button. Then, the user selects ‘Men’ in the ‘Population’ selection tree and displays a second map by clicking on the ‘Map’ button. 4) The user then changes the type of display for a bar chart by clicking on the ‘bar chart’ button of the interface. This sequence of tasks is illustrated in Figs. 1 /6. It is to be noted that this example does not take into account the level of statistical significance (this is the default option). However, it is possible to display the results using the 1 or 5% levels of statistical significance of comparative hospitalization figure. This new type of interface is very easy to use and fast enough to support decision-making. It allows fast and easy navigation within the health and environmental data at whatever level of aggregation they are. The displays are drawn and updated very quickly (e.g. 3 /10 s on a notebook PC). The same analysis, conducted with a traditional GIS and transactional database, would have required the following steps (assuming that the health statistics are already calculated, inside a DBMS or a statistical package, for example). To create the first thematic map (regional health authorities): / / / / /
to select the appropriate geographic layer; to select the type of thematic map (range map); to select the field to map; to create the ranges of the map; to modify the styles of the ranges.
To create the second thematic map (community health centers): / to repeat all the same steps required to create the first map. To create the third thematic map: / to modify the field to map; / to modify the ranges of the map; / to modify the styles of the ranges. To create the fourth thematic map: / to modify the field to map; / to modify the ranges of the map; / to modify the styles of the ranges. To create the bar chart: / to select the data to graph; / to select the type of graph;
/ to modify the different parameters of the graph according to the preferences. These steps may require several mouse clicks each and they are not straightforward for non-GIS specialists (e.g. doctors, epidemiologists). In addition, the process of creating the different displays takes a certain time. This does not help the user to maintain a train of thought when analyzing the data or when trying to find correlations and trends. The entry-level prototype has been developed using a so-called ‘star data structure’ with MICRO† SOFT ACCESS and the SOFTMAP† cartographic engine (a cartographic visualization software from SoftMap Technologies Inc.) in a custom VISUAL BASIC† application. It aims to satisfy the needs of typical users (about 90% of the public health professional staff) and remains a low-cost solution. Currently, base data are imported from different government sources. Statistical data and their aggregates are calculated by specialists using the SAS† statistical package before being integrated in the SOLAP database. Users can access these prepared data as they are, without any customization except for on-the-fly reclassifying. Nevertheless, users have the freedom to choose the desired analysis elements (called dimension members and measures in the multidimensional vocabulary), the display types and the graphical semiology. Different types of choropleth maps and point maps are possible on top of raster 1:50 000, 1:250 000 and 1:8 000 000 topographic base maps (satellite imagery is also available). Navigation is possible within the navigation menus and the thematic maps, but not in the statistical charts and tables. With this prototype, doctors, epidemiologists and other health professionals can now produce, within seconds, several hundreds of thousands of maps without even touching their keyboard. This entry-level prototype presently works on a local workstation. In the next version, however, it will be extended for use on the Internet by replacing the SOFTMAP† GIS engine by JMAP† (a JAVA-based web mapping solution that supports groupware, from Kheops Technologies Inc).
2.6. High-end ICEM-SE prototype Below is another simple analysis to illustrate the operation of the high-end prototype. For example, the user is searching for possible causes of the high comparative hospitalization figure from asthma for the period covering 1994 /1998. With the ICEM-SE high-end SOLAP prototype, the user rapidly executes the following tasks:
Geographic knowledge discovery
Fig. 7
87
Interface of the ICEM-SE high-end application prototype.
Fig. 8 Comparative hospitalization figure from asthma for the different regions of the province of Quebec, for 1994 / 1998, for all the population.
88
Y. Be ´dard et al.
Fig. 9 Average SO2 concentration for 24 h periods (default), for the different regions of the province of Quebec, for 1994 /1998.
Fig. 10 1998.
Average SO2 concentration for hour periods, for the different regions of the province of Quebec, for 1994 /
Geographic knowledge discovery
Fig. 11 1998.
89
Average O3 concentration for hour periods, for the different regions of the province of Quebec, for 1994 /
Fig. 12 Average O3 concentration for hour periods, for the different sampling stations of the Montreal region, for 1994 /1998.
90
1) the user clicks on the desired dimension members and measures, for example ‘Hospitalizations from respiratory diseases’/‘Asthma’/‘Comparative figure’/‘Regional health authorities’/‘1994 /1998’ in the selection trees of the navigation panel. The displays are updated automatically. The user finds that high comparative hospitalization figures are mostly located in the regions surrounding Montreal. 2) The user wants to look at the air quality monitoring results for these regions and selects ‘Air quality monitoring’/‘SO2’/‘Average concentration’/‘Regional health authorities’/ ‘1994 /1998’ in the selection trees. The results for the 24-h periods (the default) do not seem to have a relation with the high comparative figures. 3) The user selects the ‘Hour’ period in the ‘Periods’ selection tree. The new results do not seem to have a relation with the asthma problems. 4) The user selects ‘O3’ in the ‘Contaminants’ selection tree. The results seem to be more interesting than the SO2 results. More investigation could then be undertaken to verify if there is a certain correlation between the high comparative hospitalization figures from asthma and the high average O3 concentrations. 5) To have more details about the average O3 concentration at the different sampling sites of the Montreal region, which region has the highest average concentration among all the different regions of the province of Quebec, the user executes a drill-down operation, directly in the table (in the Montreal region cell). This sequence of tasks is illustrated in Figs. 7 / 12. This example shows that a SOLAP-based interface allows the user to concentrate on his analysis needs rather than on how to use the software or on how to formulate queries. This second prototype uses MICROSOFT SQL SERVER† , Microsoft Analysis Services† (Microsoft’s OLAP server), PROCLARITY† (an OLAP client from ProClarity Inc.) and KMAPX† (a MapInfo MapX-based plug-in allowing basic cartographic visualization and manipulation of the geospatial data, also from ProClarity Inc.). It is developed using HyperText Markup Language (HTML) and VBSCRIPT† and is accessible via the Internet for the clients that have installed the ProClarity plug-in. The cartographic component of this high-end prototype will also, in
Y. Be ´dard et al. the short term, be replaced by JMAP as the latter is a more flexible cartographic visualization and manipulation engine with groupware functions, interoperability capabilities and a very efficient vector-based applet. This high-end prototype is also very easy to use and very fast over the web. It aims at satisfying the needs of technically advanced users (about 10% of the public health professional staff). It is more flexible than the first prototype presented and allows the users to create their own dimension members and measures from the data that are stored in the databases. The different health and environment indicators are here structured in multidimensional data cubes. Users have the freedom to select the desired dimension members, existing or new measures, the display types and the graphical semiology. Navigation is possible via the navigation menus and via all the display types. Navigation via the legend is also possible. This highend prototype is intended for an Internet use.
Table 1 Characteristics of OLTP
Original source Detailed data
OLTP
and
OLAP
systems
OLAP
Copy or read-only data Detailed and aggregated data Current data Historical and current data Priority to data security Priority to data exploraand integrity tion and analysis Normalized data strucDenormalized data structure (no, or low, data ture (redundancy encourredundancy) aged if it increases query performance) Continually updated No update, periodical addition of new data only Query tool dependent of No query tool, the user the data structure (a user interacts directly with the data must know the data structure to query it efficiently) Non-aggregative queries Aggregative queries (lots (little data per transac- of data per transaction, tion, mostly update op- analysis operations) erations) Concepts: table, column, Concepts: dimension, tuple, key member, measure, fact, cube
Geographic knowledge discovery
91
3. The spatial on-line analytical processing (SOLAP) concept and its comparison with GIS OLAP has been defined for the first time as ‘(. . .) the name given to the dynamic enterprise analysis required to create, manipulate, animate and synthesize information from exegetical, contemplative and formulaic data analysis models. This includes the ability to discern new or unanticipated relationships between variables, the ability to identify the parameters necessary to handle large amounts of data, to create an unlimited number of dimensions, and to specify cross-dimensional conditions and expressions’ [15]. The reader is referred to [15] for a detailed description of each data analysis model. Caron [16] proposed another OLAP definition: ‘‘A software category intended for the rapid exploration and analysis of data based on a multidimensional approach with several aggregation levels’’. We must add to this latter definition the fact that the exploration and analysis of data is usually driven by the user with OLAP technology while it is usually automated with data mining technology (and the boundary between the two tends to blur over time). OLAP technology relies on the multidimensional database approach, which introduces concepts that differ from the concepts found in the transactional database approach typical of GIS applications. These multidimensional concepts include: dimensions, members, measures, granularity, facts and data cubes. The dimensions represent the analysis themes, or the analysis axis (e.g. ‘time’, ‘cancer’, ‘territorial subdivisions’). A dimension contains members (e.g. ‘1998’, ‘stomach cancer’, ‘Quebec region’) that are organized hierarchically into levels of granularity (e.g. ‘province’, ‘regional health authorities’, ‘local health authorities’). The members of one level (e.g. months) can be aggregated to form the members of the next higher level (e.g. years). The dimensions can be of different types: temporal, spatial (non-cartographic in the case of a conventional OLAP tool) and descriptive (or thematic). The measures (e.g. standardized rate) are the numerical attributes analyzed against the different dimensions. A measure can then be considered as the dependent variable while dimensions are the independent variables (e.g. the measure ‘standardized rate’ depends on the members of the ‘cancer’, ‘time’, ‘population’ and ‘territorial subdivisions’ dimensions). The different combinations of dimension members and measures represent facts (e.g. the standardized rate of death from stomach cancer
Fig. 13 Differences between typical GIS and OLAP applications with regards to three axes of requirements for spatial decision-support. After [20].
for the year 1998, for the women and for the Quebec region is 4.079). A data cube is a set of measures aggregated according to a set of dimensions [17]. Inside a data cube, the possible aggregations of measures on all the possible combinations of dimension members (the facts) can be pre-computed to increase query performance. Several data cubes can be built from the same sources of data, as they are ‘read-only’ datasets (e.g. several SOLAP applications could import their data from a same GIS). Table 1 presents the differences between a transactional database (also called ‘On-Line Transaction Processing’ (OLTP)) in a relational server (typical of GIS applications) and an analytical database in a multidimensional server (typical of decision-support systems built with OLAP). The general OLAP architecture comprises three components: the multidimensionally structured database, the OLAP server and the OLAP client that accesses the database via the OLAP server. The OLAP client allows the end user to visualize the data using different types of diagrams (e.g. bar charts and pie charts) and tables. It also allows the user to explore and analyze the data using different operators such as drill-down (show a more detailed level inside a dimension), roll-up (show a more general level inside a dimension), drill-across (show another theme at the same level of detail) and swap (interchange visible dimensions in the chart or table). Such system is built especially to navigate within the data cube, i.e. to go from one fact to another in a simple manner and to obtain fast responses. It is commonly found in the literature, as our prototypes have also shown, that the multidimensional approach of analysis is more in agreement with the end user’s mental model of the data than the traditional transactional approach [18]. The
92
Y. Be ´dard et al.
Fig. 14 Current architecture of the ICEM-SE prototypes.
interface of a tool exploiting the multidimensional paradigm, such as OLAP, provides unique capabilities to explore data in an intuitive and interactive way (similar to web hyperlinks). The user can perform simple to complex analyses mostly by clicking on the data being organized in a way that is meaningful [19]. Such easiness and rapidity are two essential conditions for an analyst to maintain a train of thought when exploring or validating hypotheses. Health users already report OLAP abilities to provide timely information and assistance in decision-making, program evaluation, and analysis [1]. This will likely prove to be even more evident with SOLAP.
GIS systems are known to be not very well adapted for decision-making because they have complex query interfaces and spatial operators that are not intuitive for non-specialists (e.g. doctors, epidemiologists), they do not support well aggregate data and processing times may be very long for the complex queries that are typical of strategic decision-making. However, they are very useful for the visualization and manipulation of the cartographic data. Since data visualization facilitates the extraction of insight from the complexity of the spatio-temporal phenomena and processes being analyzed, some authors claim that GIS are decision-support tools. Nevertheless,
Fig. 15 Future architecture of the ICEM-SE applications.
Geographic knowledge discovery
one needs to fully harness the power of multiscales maps with multi-levels multi-themes multiepochs data to reach a better understanding of the structure and relationships contained within the datasets. GIS alone cannot do it in a fast and intuitive manner; one needs SOLAP capabilities on top of, or in tandem with, GIS. In the context of GKD, maps and graphics do more than make data visible; they are active instruments in the endusers thinking process [20] and as such must support spatial navigation operators like spatial drill-down and spatial roll-up as well as thematic operators. Such data manipulation allows access to the intelligence contained in the data. Fig. 13 shows the characteristics of typical SOLAP applications compared with the characteristics of typical GIS applications. Without spatial navigation operators and map visualization, conventional OLAP possess only a limited potential to support GKD [16]. Commercial systems integrating OLAP with spatial display functionalities recently appeared on the market but they have many limitations. The ideal SOLAP tool must offer a level of flexibility not currently offered to meet multidimensional spatio-temporal analysis needs [21].
4. Discussion Fig. 14 shows the current architecture of both prototypes, which are based on different technologies. For the entry-level prototype, the source data are imported in SAS† where the statistical data are calculated. These statistical data are then integrated in a MICROSOFT ACCESS† database and accessed by the prototype user interface. For the high-end prototype, the source data are imported directly from the sources in a temporary data warehouse stored in MICROSOFT SQL SERVER† . Then, multidimensional data cubes are built, in Microsoft Analysis Services† , from the data stored in the warehouse. These data cubes are accessed by the prototype user interface. A future version of the system, to be implemented at the Quebec Ministry of Health and Social Services in 2002, will be based on the architecture presented in Fig. 15. For the high-end application, the source data are imported in a temporary data warehouse stored in MICROSOFT SQL SERVER† . Then, multidimensional data cubes are built, in Microsoft Analysis Services† , from the data stored in the warehouse. These data cubes are accessed by the application. For the entry-level application, the source data are imported directly from the multidimensional data
93
cubes into the relational database. This database is accessed by the application. Both applications will use JMAP† as the mapping and spatial navigation engine. This approach provides a built-in quality control mechanism in the sense that methodological and organizational decisions are done only once in the central unit in charge of the system, by specialists in epidemiology, statistics, computer science, geomatics and confidentiality protection. Crucial concerns such as data validation, choice of appropriate statistical tests and measures, statistical stability of data, restriction of access to personal data, warnings and other similar topics can be addressed in a uniform and state-of-the-art manner, before wider dissemination and use of the data for everyday interventions. This can best be done in the administrative unit responsible for data quality and confidentiality. In North America, this is likely to be the provincial or state public health agency, and at a more aggregate level, the federal agency such as Health Canada, or the Centers for Disease Control and Prevention (CDC). One may also find enough expertise in large metropolitan areas (or even small areas for some categories of data) to apply such a quality-control approach, and it then becomes a matter of internal administrative agreements within an organization. Very little training is necessary to use the above applications. Our tests with end-users have shown that less than an hour of training is sufficient to use the software. The first results have been very well received by future users in Public Health and also by users in other fields of application where similar systems are being developed with new research challenges.
5. Conclusion This paper presented a tool that has been developed to support GKD in the field of environmental health. This tool enhances GIS software with Spatial SOLAP capabilities to better support decision-making for health users who need to analyze interconnections between risk factors, clusters, interventions and outcomes. It also better supports health users who need to rapidly discover/eliminate potential relations between health problems and environmental factors, to better target intervention efforts or medical resources distribution. These are only a few applications in Public Health that can benefit from a technology providing fast access to the detailed and aggregated data, either on maps, tables or charts, and providing database
94
navigation capabilities without the need to learn a query language. Such SOLAP application: / aims at supporting, transparently, the way public health specialists think and analyze; / allows them to focus on the results of the navigation rather than on the analysis process itself (i.e. focus on ‘what to obtain’ rather than on ‘how to obtain it’); / is used without knowing any query language; / provides practically instantaneous response times (the optimal response time for spatiotemporal exploration and analysis being less than 10 s [22]. The two prototypes developed during this project have achieved these objectives. They have been described and an example analysis has been presented for both. The first results have been very well received by future users in Public Health. The final development and implementation of both prototypes for the Quebec Ministry of Health and Social Services should be completed by the beginning of 2003.
Acknowledgements This research has been realized with the financial support of the GEOIDE Network of Centers of Excellence SOC#1 (Cartographic Interface for the Multidimensional Exploration of Environmental Health Indicators) and DEC#2 (Designing the Technological Foundations of Geospatial Decision-Making with the World Wide Web) projects, the Quebec Ministry of Health and Social Services and the Natural Sciences and Engineering Research Council of Canada individual research grant program.
References [1] P. Gosselin, Y. Be ´dard, M. Jerrett, S.J. Elliott, R. Catelan, P. Poitras, A. Gingras, GIS and OLAP in Health Surveillance: Needs Analysis for Successful Integration, Report presented to Health Protection Branch, Health Canada, 2000. [2] D. Be ´langer, P. Gosselin, G. Lebel, Bilan et perspectives en matie `re de surveillance en protection de la sante ´ publique, Rapport de ´pose ´ au ministe `re de la Sante ´ et des Services sociaux du Que ´bec, re ´alise ´ avec la collaboration de l’INSPQ et du Centre de recherche du CHUQ, 2002. [3] W.J. Frawley, G. Piatetsky-Shapiro, C.J. Matheus, Knowledge discovery in databases: an overview, in: G. PiatetskyShapiro, W.J. Frawley (Eds.), Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge, 1991. [4] H.J. Miller, J. Han (Eds.), Geographic Data Mining and Knowledge Discovery, Taylor & Francis, London, 2001.
Y. Be ´dard et al. [5] Y. Be ´dard, T. Merrett, J. Han, Fundamentals of spatial data warehousing for geographic knowledge discovery, in: H. Miller, J. Han (Eds.), Geographic Data Mining and Knowledge Discovery, Taylor & Francis, London, 2001. [6] Y. Be ´dard, Spatial OLAP, Vide ´oconfe ´rence, 2e `me Forum annuel sur la R-D, Ge ´omatique VI: Un monde accessible, Montre ´al, Canada, November, 1997. [7] J. Han, Conference Tutorial Notes: Spatial Data Mining and Spatial Data Warehousing, Paper presented at the Fifth International Symposium on Spatial Databases (SSD’97), Berlin, Germany, 1997. [8] N. Stefanovic, Design and Implementation of On-Line Analytical Processing (OLAP) of Spatial Data, M.Sc. Thesis, Simon Fraser University, Vancouver, Canada, 1997. [9] M.L. Gonzales, Spatial OLAP: Conquering Geography, DB2 Magazine, Retrieved 23 November, 1999, from http:// www.db2mag.com/db_area/archives/1999/q1/ 99sp_gonz.shtml, Spring, 1999. [10] K. Koperski, J. Adhikary, J. Han, Spatial Data Mining: Progress and Challenges, Paper presented at the SIGMOD’96 Workshop on Research Issues on Data Mining and Knowledge Discovery, Montre ´al, Canada, June, 1996. [11] M. Ester, H.P. Kriegel, J. Sander, Spatial data mining: a database approach, in: M. Scholl, A. Voisard (Eds.), Advances in Spatial Databases, Springer, Berlin, 1997, pp. 47 /66. [12] The GEOIDE Network of Centers of Excellence: Geomatics for Informed Decisions, Retrieved September 17, 2002 from http://www.geoide.ulaval.ca [13] Y. Be ´dard, M. Nadeau, M.J. Proulx, A New Tool for Userdriven Geographic Knowledge Discovery with Application to Environment Health Indicators, Paper presented at the General Annual Meeting of the NCE GEOIDE, Fredericton, Canada, June 2001. [14] G. Pelletier, La population du Que ´bec par territoire de CLSC, de DSC et de RSS, pour la pe ´riode 1981 a ` 2016, Rapport pre ´pare ´ par le Ministe `re de la Sante ´ et des Services sociaux, 1996. [15] E.F. Codd, S.B. Codd, C.T. Salley, Providing OLAP (On-Line Analytical Processing) to User-Analysts: an IT Mandate, Hyperion White Paper, 1993. ´tude du potentiel OLAP pour supporter [16] P.-Y. Caron, E l’analyse spatio-temporelle, Me ´moire de maıˆtrise, Universite ´ Laval, Sainte-Foy, Canada, 1998. [17] E. Thomsen, G. Spofford, D. Chase, Microsoft OLAP Solutions, Wiley, New York, 1999. [18] OLAP Council OLAP and OLAP Server Definitions, Retrieved October 10, 1999 from http://www.olapcouncil.org/research/glossaryly.htm, 1995. [19] P. Youngworth, OLAP Spells Success For Users and Developers, Data Based Advisor, December, 1995, pp. 38 /49. [20] A.M. MacEachren, M.-J. Kraak, Research challenges in geovisualization, Cartography and Geographic Information Science 28 (1) (2001) 3 /12. [21] S. Rivest, Y. Be ´dard, P. Marchand, Towards better support for spatial decision-making: defining the characteristics of Spatial On-Line Analytical Processing (SOLAP), Geomatica, Journal of the Canadian Institute of Geomatics 55 (4) (2001) 539 /555. [22] P. Marchand, Y. Be ´dard, G. Edwards, A hypercube-based method for spatio-temporal exploration and analysis, GeoInformatica, July, 2002, in press.