A Data Mining Approach to Study the Air Pollution ...

2 downloads 0 Views 1MB Size Report
Curitiba, Brazil. William S. Rabelo. Course of Civil Engineering. Pontifical Catholic University of Paraná (PUCPR). Curitiba, Brazil. Abstract— During the last two ...
2015 11th International Conference on Natural Computation (ICNC'15)

A Data Mining Approach to Study the Air Pollution Induced by Urban Phenomena and the Association With Respiratory Diseases Fabio T. Souza

William S. Rabelo

Postgraduate Program in Urban Management (PPGTU) Pontifical Catholic University of Paraná (PUCPR) Curitiba, Brazil

Course of Civil Engineering Pontifical Catholic University of Paraná (PUCPR) Curitiba, Brazil

Abstract— During the last two decades, there is an intensive urbanization process and in Brazil, almost nine out of ten people live in urban centers. The high growth rate of cities induces imbalances in weather patterns, with negative consequences to the public health. This paper proposes a data mining approach to aid the local urban management team to better understand the consequences resulting from air pollution and its interference within public health in three cities of the Metropolitan Region of Curitiba, Brazil. The knowledge gained in this study, which is still in the early phase of data collection, can contribute to the debate on social issues and public policies that can benefit the Brazilian urban management.

information for the urban management in different urban spots, i.e. allowing separate regional action plans.

Keywords-component; urban respiratory diseases; data mining

B.

I.

management;

air

pollution;

INTRODUCTION

The integration of urban management and public health is necessary to find solutions for many problems of our society in dense urban environments. The air pollution caused by the emission of gasses from industries and motor vehicles has a negative impact on respiratory health. Among the integration studies, [1] analyzed the association between built environment characteristics and morbidity related to chronic diseases. The reference [2] analyzed public transportation and the negative impacts on public health. The reference [3] analyzed the decision-making process of managers about the health practices and disease prevention measures. The reference [4] identified an association between air pollution daily data between 1996 and 1998 and pneumonia and flu care for the elderly. The reference [5] identified qualitative relationships between the respiratory morbidity of childhood local population and air pollutants in Curitiba. This article describes a scientific research in the initial phase, which focuses on developing a methodology to elucidate the relationship between air quality spatial patterns in urban centers and their impacts on public health. The identification of spatial patterns may provide relevant

978-1-4673-7678-5 ©2015 IEEE

1050

II.

METHODOLOGY

A.

Knowledge Extraction This research emphasizes on extracting knowledge of urban phenomena management in the context of air pollution and respiratory diseases. Different sources of information must be explored from the technical literature, experts and from actual and historical data in large urban centers. Data Acquisition The methodology is based on data analysis by using computational tools in data mining field and geographic information systems (GIS), as well as the inclusion of expert knowledge into computer models. The study explores data from three different cities in the Metropolitan Region of Curitiba (MRC): Curitiba, Araucaria and Colombo. The selection of these cities is due to the resilient inter-institutional relations of the acting government agencies in these cities (see acknowledgments). The selection of Araucaria concerns, particularly on its concentration of industrial plants and known respiratory problems caused by the emission of pollutants from the industrial park. The Colombo selection worries the industrialization process that the city has undergone in the last two decades, constituting a new industrial center in the MRC, and also a significant low-income population concentration within the territory. The Environmental Institute of Paraná (IAP) monitors seven air quality parameters and make it available on their website the daily measurements of total suspended particulates (TSP); smoke; inhalable particles (IP or PM10); sulfur dioxide (SO2); carbon monoxide (CO); Ozone (O3); Nitrogen dioxide (NO2). Table I illustrates the report about air quality data measurements in May 2014 of the Ouvidor Pardinho station. In addition to the measured parameters, the IAP also adds an air quality index (AQI) that categorizes the air quality and the impact on human health as described in Table II.

TABLE I. REPORT WITH AIR QUALITY DATA MEASUREMENTS IN MAY 2014.

Source: Environmental Institute of Paraná (IAP, 2014) TABLE II. AIR QUALITY INDEX (AQI) & IMPACTS ON HEALTH. Air Quality Index Levels of Health Concern

Numerical Value

Meaning

Good

0 to 50

Air quality is considered satisfactory, and air pollution poses little or no risk

Moderate

51 to 100

Air quality is acceptable; however, for some pollutants there may be a moderate health concern for a very small number of people who are unusually sensitive to air pollution.

Unhealthy for Sensitive Groups

101 to 150

Members of sensitive groups may experience health effects. The general public is not likely to be affected.

Unhealthy

151 to 200

Everyone may begin to experience health effects; members of sensitive groups may experience more serious health effects.

Very Unhealthy

201 to 300

Health warnings of emergency conditions. The entire population is more likely to be affected.

Hazardous

301 to 500

Health alert: everyone may experience more serious health effects

Source: Air Quality Index (AQI) Basics < http://airnow.gov/index.cfm?action=aqibasics.aqi>

The present study will also explore climate variables. The temperature and relative humidity are parameters related to air pollution levels and respiratory diseases [5]. The reference [6] describes the mortality from air pollutants and analyzes many atmospheric parameters such as rainfall, the air pressure, temperature, humidity, wind and radiation. The reference [7] related the process of urbanization in Campina Grande with the incidence of respiratory diseases in children and the elderly by analyzing the association with climatological variables including temperature, relative humidity and rainfall in that city. The main focus of this study is on hospitalization data of respiratory diseases from different government levels: Federal (DATASUS), State (SESA) and municipal departments of Curitiba, Araucaria and Colombo. Moreover, the cities characterization concerning different soil use and occupation will also be included on the dataset by computing the proximity of industrial parks and mining activities, and also by discriminating spatial features of motor vehicles. C. Data Preparation This is the most important step of the whole process and that consumes most of the time [8]. In this step, all inconsistencies should be corrected such as treatment of false

1051

values of zero, outliers treatment, missing values replacement (when possible), among others. Once the data are consisted, then they should be inserted into a Geographical Information System (GIS). A preliminary data analyzed in a georeferenced platform provides a clearly spatial relationship among the involved variables, and maps can clarify some urban phenomena patterns which help in the modeling strategy. The reference [9] used geostatistical models to detect "hot spots" concerning the areas of high health risks caused by poor air quality and high temperatures in Aachen, Germany. A similar study was conducted by [10] making connections between exposure to air pollution and mortality in persons, and the identification of high and low exposure areas. The reference [11] applied a georeferenced model called ARMOS to simulate the concentration of pollutants in urban Hangzhou, China. The reference [12] implemented a georeferenced model to calculate and display the geographical distribution of carbon monoxide (CO) nearby a road construction region. The reference [13] successfully assessed the impact in the environment caused by the construction of a highway in the vicinity of Madrid, Spain. The reference [14] simulated the exposure of school children to pollutants during the time journey between their homes and schools. It is considered that the contribution of GIS technique is relevant to this research. D. Data Modeling This research proposes two modeling stages or phases: multivariate analysis and quantitative models. The multivariate analysis consists mainly of identifying qualitative patterns among the involved variables [15], [16]. The clear identification of the relationship among the involved variables helps in the construction of quantitative models. Some of the quantitative models are association rules. An association rule is a simple model that easily explains the cause (IF) and effect (THEN) relationship [17], [18]. These rules may have a predictive character or are classification rules [19], [20].

III.

PRELIMINARY DATA ANALYZES

This research, as mentioned before, is in the early stages of data acquisition and preparation. This article presents a small portion of contribution in how to prepare the air quality data for the following stages of the research. The measured variables and the AQI computed daily values can be seen in the Table I. Also it can be noticed a sparse pattern of data pollutants. Right and below the figure is shown a sketch of the region and the station highlighted in red with the PAR symbol referring to the Ouvidor PARdinho Station. The first taxonomy, therefore, includes air quality indicators and measured in the cities of Curitiba, Araucaria and Colombo. Table III illustrates the thirteen stations distributed in the three cities of the MRC; also provide details about the start and end of the series history; and the percentage of time for each class of the five air qualities (AQI) classified by IAP (2014): GOOD, REGULAR, INADEQUATE, POOR, VERY POOR/SEVERE, as already previously detailed. The initial letters "C" and "A" of the first column of Table III refer to the cities of Curitiba and Araucaria respectively. Although the distance between two stations of air quality called Ouvidor Pardinho and Santa Casa is small (second and fourth raw on the table respectively), separated by a few hundred meters, it can be noticed in the Table III a significant discrepancy in the percentage of GOOD and REGULAR qualities. The expressive difference between the quantitative values of two near stations forces a reflection about which urban elements may influence in these fluctuations, i.e. the intensity of traffic flow in the two distinct regions. TABLE III. AIR QUALITY STATIONS IN MRC

An example of classification rule to predict landslides during rainfall is described below [16]: IF Cumulative rain in the last 6 hours > 43.7 mm (A) THEN (90.6%

LANDSLIDE OCCURRENCE 117

(B)

106 )

Within 6 h of cumulative rain (h_6) measuring above 43.7 mm (presented by the total amount of 117 registers - A), 106 out of 117 registers (confidence of 90.6%) would predict a landslide (B). It can be observed that the rule is easy to understand and usable, i.e., it could be used by the government in alert emissions during rainfall events. Another advantage of rules association and rules classification is the insertion of the expert knowledge into the models [21]. The study must interview experts in urban management, public health, and climatology in order to discover unexpected rules with a high degree of "interest".

Source: Environmental Institute of Paraná (IAP, 2014)

In order to compare AQI patterns among the three cities, three stations from Table III were chosen for further analysis. The low levels of GOOD quality were the criteria for choosing the three stations to represent the three cities (lines in gray in Table III). The Figure I shows a real-time updated bulletin containing the map of the stations installed in Curitiba and Araucaria and also a report of the conditions of the spatial-

1052

temporal dynamics of the IQA, pollutants, wind direction and wind speed.

The graphs show a high complexity of the phenomenon and different seasonal patterns for the three monitoring stations. From the urban management perspective, it can be said that there is an atomized industrialization in Colombo, low paving coverage in the urban road system and intense construction.

Figure II. Permanence in Curitiba Source: Authors Figure I. Boletim de Qualidade do Ar em tempo real.

100 90

Source: Environmental Institute of Paraná (IAP, 2014) Permanence (%)

In order to investigate possible patterns of association among the variables involved in the study some preliminary data analyzes were carried out.

80 70 60 50 40

A_ASSIS_QUALITY_GOOD A_ASSIS_QUALITY_REGULAR

30 20

IV.

PARTIAL RESULTS

10

To understand the temporal dynamics of the phenomenon and to evaluate the seasonal variability were calculated the permanence percentages (% of the time) of two categories (GOOD and REGULAR) to the cities of Curitiba and Araucaria, and also for INADEQUATE variable in Colombo, as shown in Figures II, III and IV respectively.

0

Figure II shows the classes’ frequency in Curitiba sample for the station located in the Industrial City (CIC). The worst conditions are observed during the months of February and September. Both months present higher values of Quality_REGULAR and lower values of Quality_GOOD. The GOOD and REGULAR quality curves, although, are not intersected.

100

Figure III illustrates the percentages for ASSIS station located in Araucaria. It can be noted a worse scenario than Curitiba and the worst period is October when REGULAR class exceeds half the time (> 50%) and the GOOD and REGULAR quality curves are intersected once. Figure IV describes the worst scenario of all the three cities for the station located in Colombo where the GOOD and REGULAR quality curves cross three times. The worst condition is observed in April when significant amounts of quality INADEQUATE is observed. This condition is described in the orange row of Table II and explains that “members of sensitive groups may experience health effects”.

Months

Figure III. Permanence in Araucária.

Permanence (%)

Source: Authors Colombo_QUALITY_GOOD

90

Colombo_QUALITY_REGULAR

80

Colombo_QUALITY_INADEQUATE

70 60 50 40 30 20 10 0

Months

Figure IV. Permanence in Colombo. Source: Authors

Some multivariate analysis tools to analyze the patterns of Curitiba and Araucaria stations (Colombo has very high missing values) were also applied. Figures V´ and V´´ illustrate the coordinates of the main components calculated

1053

for the cities of Curitiba and Araucaria respectively. It can be observed from the results some qualitative association among the measured variables and different patterns for the two analyzed cities. Figures VI´ and VI´´ illustrate the dendrograms of the cities of Curitiba and Araucaria respectively. Curi tiba_CIC 1,0

PM10

CO 0,8

NO2

Fator 2

0,6

IQA

V.

PARTIAL CONCLUSIONS

The article described a methodology for air quality data preparation. This taxonomy is crucial for this study and the data should be consisted. Moreover, up to now data acquisition has been considering taxonomies such air quality, respiratory diseases, urban planning, and climatology. However, due to the complexity of the urban phenomena it is beneficial to increase the number of involved taxonomies, such as the control of the soil use and occupation, urban mobility; mining activities; pollution classification degree of industries and their peculiarities, among others.

0,4 Araucária - ASSIS

SO2

1200

0,2

0,0

-0,2

1000

O3 UMID

800

-0,4 -1,0

-0,8

-0,6

-0,4

-0,2

0,0

0,2

0,4

0,6

0,8

Linkage Distance

TEMP

1,0

Fator 1

Figure (V´) Principal Component Analysis for Curitiba.

400

Araucária - ASSIS

200

1,0 0,8

600

SO2 NO2

0 UMID

0,6

O3

Fator 2

0,2

O3

SO2

NO2

T EMP

Source of the Figures V and VI: Authors

0,0 -0,2

UMID

In addition, in order to better explain the urban phenomena the air quality data series should be extended for the yearly period (IAP technical reports available since 2000).

TEMP -0,4 -0,6 -0,8 -1,0 -1,0

-0,8

-0,6

-0,4

-0,2

0,0

0,2

0,4

0,6

0,8

1,0

Fator 1

Figure (V´´) Principal Component Analysis for Araucária. Curitiba_CIC 800

700

The results at the end of this scientific research (started in November, 2014) may be used by the local government for actions and interventions to minimize the risks of air pollution and improving air quality and people's wellbeing in urban centers. It is expected also that the results can contribute to the planning of urban healthcare and towards improved public policies. ACKNOWLEDGMENT

600

Linkage Distance

IQA

Figure (VI´´) Dendrogram for Araucária.

IQA

0,4

Special thanks to the CNPq by the financial support of the actual research and the institutes which provided the data: Environmental Institute of Paraná (IAP); State Health Secretary (SESA); Institute of Urban Planning of Curitiba (IPPUC); Mary´s Protection Center of Children and Teenagers (CEDIN).

500

400

300

200

REFERENCES

100 UMID

IQA

O3

CO

SO2

PM10

NO2

TEMP

Figure (VI´) Dendrogram for Curitiba. [1]

Again it can be observed different patterns of relationship between the different cities, indicating that the phenomena may be explained for different urban features on those cities or region.

[2]

Hino, A.A.F. et al. Built environment and physical activity for transportation in adults from Curitiba, Brazil. Journal of Urban Health, v. 17, p. 1-17, 2013. Mosquera, J. et al. Transport and health: a look at three Latin American cities. Cadernos de Saúde Pública (ENSP. Impresso), v. 29, p. 654-666, 2013.

1054

[3]

Brownson, R.C. et al. Understanding Administrative Evidence-Based Practices. American Journal of Preventive Medicine, v. 46, p. 49-57, 2014. [4] Martins, L. C. et al. Poluição atmosférica e atendimentos por pneumonia e gripe em São Paulo, Brasil. Revista de Saúde Pública, v. 36, n. 1, p. 88-94, 2002. [5] Bakonyi, S. M. C. et al. Poluição atmosférica e doenças respiratórias em crianças na cidade de Curitiba, PR. Rev Saúde Pública, v. 38, n. 5, p. 695-700, 2004. http://www.scielo.br/pdf/rsp/v38n5/21758.pdf [6] Duchiade, M. P. Poluição do ar e doenças respiratórias: uma revisão. Cad Saúde Pública, v. 8, n. 3, p. 311-30, 1992. [7] De Moura, M. A. C. A Urbanização em Campina Grande e suas relações com a incidência de doenças respiratórias no município e o clima local. UFCG. Programa de Pós-Graduação em Recursos Naturais. 2009. [8] Pyle, Dorian. Data preparation for data mining. Morgan Kaufmann, 1999. [9] Merbitz, H. et al. GIS-based identification of spatial variable enhancing heat and poor air quality in urban areas. Applied Geography 33. 2012. 94-106. [10] Scoggins, A. et al. Spatial analysis of annual air pollution exposure and mortality. Science of the Total Environment, 321. 2004. 71–85. [11] Zhang, Q. et al. GIS-based emission inventories of urban scale: A case study of Hangzhou, China. Atmospheric Environment 42. 2008. 5150– 5165. [12] Wang, X. Integrating GIS, simulation models, and visualization in traffic impact analysis. Computers, Environment and Urban Systems 29. 2005. 471–496.

[13] Moragues, A. & Alcaide, T. The use of a geographical information system to assess the effect of traffic pollution. The Science of the Total Environment. 189/190. 1996. 267-273. [14] Gulliver, J. and Briggs, D. J. Time–space modeling of journey-time exposure to traffic-related air pollution using GIS. Environmental Research 97. 2005. 10–25. [15] Souza, F. T., Koerner, T. C., & Chlad, R. A data-based model for predicting wildfires in Chapada das Mesas National Park in the State of Maranhão. Environmental Earth Sciences, 1-9, 2015: http://link.springer.com/article/10.1007/s12665-015-4421-8. [16] Souza, F.T.; Ebecken, N.F.F. A data based model to predict landslide induced by Rainfall in Rio de Janeiro City. Geotechnical and Geological Engineering, v. 30, n. 1, p. 85-94, 2012: http://link.springer.com/article/10.1007/s10706-011-9451-8. [17] Agrawal & Srikant, Fast algorithms for mining association rules. In: Proc. 20th int. conf. very large data bases, VLDB. 1994. p. 487-499. [18] Agrawal, R.; Imielinski, T.; Swami, A. Mining association rules between sets of items in large databases. In: ACM SIGMOD Record. ACM, 1993. p. 207-216. [19] Liu, B., Hsu, W., Chen, S., MA, Y, Integrating Classification and Association Rule Mining, KDD-98, August, New York, 1998. [20] Souza, F. T. A data-based model to locate mass movements triggered by seismic events in Sichuan, China. Environmental monitoring and assessment, v. 186, n. 1, p. 575-587, 2014: http://link.springer.com/article/10.1007/s10661-013-3400-3. [21] Liu, B. et al. Analyzing the subjective interestingness of association rules. Intelligent Systems and their Applications, IEEE, v. 15, n. 5, p. 47-55, 2000.

1055