Computer Science • 13 (1) 2012
http://dx.doi.org/10.7494/csci.2012.13.1.5
Ondrej Habala ´ Ladislav Hluchy Viet Tran Peter Krammer ˇ Martin Seleng
USING ADVANCED DATA MINING AND INTEGRATION IN ENVIRONMENTAL PREDICTION SCENARIOS
Abstract
Keywords
We present one of the meteorological and hydrological experiments performed in the FP7 project ADMIRE. It serves as an experimental platform for hydrologists, and we have used it also as a testing platform for a suite of advanced data integration and data mining (DMI) tools, developed within ADMIRE. The idea of ADMIRE is to develop an advanced DMI platform accessible even to users who are not familiar with data mining techniques. To this end, we have designed a novel DMI architecture, supported by a set of software tools, managed by DMI process descriptions written in a specialized high-level DMI language called DISPEL, and controlled via several different user interfaces, each performing a different set of tasks and targeting different user group.
data mining, data integration, meteorology, hydrology
5
6
ˇ Ondrej Habala, Ladislav Hluch´ y, Viet Tran, Peter Krammer, Martin Seleng
1. Introduction Modern society with ever-evolving methods of fast transportation, expanding urban centers and increasing population density in previously unpopulated rural areas requires always accurate meteorological predictions not only of general weather conditions, but also of various significant meteorological phenomena [1]. For some of these, there are no accurate physical models, or if they are available, their customization to a particular target area is unfeasible because of its complexity and often missing prerequisites, like past observations or a detailed topological map. To overcome these difficulties, we have performed several experiments by applying data mining techniques to a set of carefully chosen meteorological and hydrological scenarios. While data mining has been used in meteorology for a long time, the scenarios we have chosen have not been previously covered, especially not in the target area we have chosen. They have been designed and evaluated by domain experts, and their design was driven by the current needs of these experts and their employers. In their design we have also used our previous experience in applying information technologies to environmental predictions [6]. These experiments are part of the FP7 project ADMIRE1 , and additionally to serve as an experimental platform for meteorologists and hydrologists, we have used them as a testing platform for a suite of advanced data integration and data mining (DMI) tools, developed within this project. The idea of the project ADMIRE is to develop an advanced DMI platform accessible even to users who are not familiar with data mining techniques. To this end, we have designed a novel DMI architecture [5], supported by a set of software tools, managed by DMI process descriptions written in a specialized high-level DMI language called DISPEL [2], and controlled via several different user interfaces, each performing a different set of tasks and targeting different user group. In this paper we present the results of the project ADMIRE from the point of view of our environmental pilot application. We describe the methods ADMIRE uses to integrate geographically distributed data sets, stream them through a series of filters and processing elements using the OGSADAI platform [9], and deliver the results to the end users who have requested them. The project has successfully finished with a final review in July 2011, and the final platform allows for the easy development of complex DMI scenarios using an existing library of processing elements.
2. Data Intensive Processes in the Environmental Domain Environmental risk management research is an established part of the Earth sciences domain, already known for using powerful computational resources to model physical phenomena in the atmosphere, oceans and rivers [15]. In this paper we show how state-of-the-art tools for data-intensive processes can be applied to the benefit of 1 ADMIRE
– Architectures for Data Intensive Research. http://www.admire-project.eu/
Using advanced data mining and integration (...)
7
meteorological and hydrological experts. We illustrate the possibilities on a simple scenario from the hydro-meteorological domain. The environmental domain makes extensive demands of data management as well as data processing. A significant portion of data input to weather prediction models (both physical and statistical) comes from observations. Only a fraction of these observations can be made remotely (satellite and radar observations are examples of remote weather sensing), and the bulk of the input data comes from local observations of air temperature, humidity, pressure, precipitation, and other parameters. Local observations tend to produce local data sources, and a large number of local observations leads to a large number of local data sources, and a significant level of distribution of the data. Even in a context where there is a national weather management authority, several geographically dispersed data sources are needed for any integrated analysis. For example, in Slovakia we might consider these data sources: • meteorological observation database owned by the Slovak Hydrometeorological Institute (SHMI)2 ; • weather radar observations conducted by the Slovak Hydrometeorological Institute, and stored in a separate data store; • hydrological observations conducted by branches of Slovak Water Enterprise (SWE)3 , and stored locally at the respective branches; • waterworks manipulation schedules, local to the management centre of the respective waterworks (there are four such centres); • local observations by specialized personnel at airports and other installations, where weather is a major operational concern. This list is not complete and the number of sources may be much larger. A traditional approach to data management in this environment is to establish a list of necessary data sources, perform negotiations with their owners, acquire the data (usually in a form of a static database image), assess its quality, prepare it, and then feed it to the model. This process is cumbersome, can take months and is not suited for day-to-day weather management operations. A different approach is to use modern methods of distributed data management: establish a Quality of Service agreement, an on-line data integration process including all necessary data preparation and filtering, and make the process as automated as possible. The main difference is in treating the input data not as a suite of static data sets but rather as a group of converging data streams. Weather does not cease to exist at the moment a snapshot ends. While this approach is still a novelty for domain experts trained in a different context and accustomed to data transfer channels with much smaller bandwidths, it has been recognized as the future of environmental data distribution, as can be seen in the pan-European INSPIRE Directive [4], man2 Slovensk´ y hydrometeorologick´ yu ´stav, Jes´ eniova 17, 833 15 Bratislava, Slovakia, http://www. shmu.sk 3 Slovensk´ y vodohospod´ arsky podnik, ˇst´ atny podnik, Radniˇ cn´ e n´ amestie 8, 969 55 Bansk´ a ˇ Stiavnica, Slovakia, http://www.svp.sk
8
ˇ Ondrej Habala, Ladislav Hluch´ y, Viet Tran, Peter Krammer, Martin Seleng
Table 1 Nature, current size and the approximate rates of increase of the various data sources used in the environmental risk management scenarios. While training the various predictive models makes use of historical snapshots, the live system is designed to work with incoming streams of data from the various sources, some of which change slowly but significantly over the course of a year Data set NWP data Synoptic stations Rainfall measurement Radar imagery Waterworks Hydrology stations †
Source SHMI SHMI SHMI SHMI SWE SHMI
Nature Simulation Sensor Manual Sensor Manual Sensor
Size (MB) arbitrary† 50 100 10,000 300 300
MB/year c. 20,000 1 gribFileSelector.expression;
The operators |- -| denote the beginning and end of a stream of literals – in this case just one literal, the SQL query used to select the proper GRIB file. The operator => then creates a connection of this stream in an input connector (called expression) of the processing element gribFileSelector. This way we feed a stream into a processing element (PE), and we can also connect an output connector of one PE to the input of another PE. So these two operators allow us to create a graph in which the data is streamed and processed by various PEs. For a more thorough explanation of the language please consult the User Manual [11]4 .
Figure 4. A sample of DISPEL code of the ORAVA scenario
This novel approach allows us easily to extend the hydro-meteorological infrastructure to new data providers, by deploying a gateway at the site of the new provider, 4 http://www.admire-project.eu/docs/DISPEL-manual.pdf
14
ˇ Ondrej Habala, Ladislav Hluch´ y, Viet Tran, Peter Krammer, Martin Seleng
and registering it with the other gateways. Then, when a data analysis expert creates a DISPEL document that makes use of one of the capabilities provided by this gateway, it can be accessed and integrated automatically into the overall knowledge discovery workflow. This model provides a clear separation of responsibilities between data-intensive engineers, data analysis experts, and the domain experts of the application. The underlying infrastructure and gateway network is managed by the data-intensive engineers. The data-analysis experts use DISPEL to create full knowledge discovery workflows which utilize the infrastructure without needing to understand it. In turn these workflows are used by the domain experts via specialized domain-specific portals. This approach also provides for a reasonable amount of fault tolerance. If one data centre becomes unavailable, it may conceivably be replaced transparently by a different one, without the final users of our product ever knowing it happened. Some centres and gateways are, of course, irreplaceable in a given network (primary data storage centres for instance), but data filters may be deployed at several locations to enhance redundancy. There may also be several HPC facilities available to a given user, so the temporary inaccessibility of one of them is no issue – the DISPEL description of the required data-oriented solution is entirely agnostic of such things.
5. Conclusion We have presented an approach to hydro-meteorological predictions, which utilizes state-of-the-art data integration and data mining technology developed in the EU FP7 project ADMIRE. The technology allows us to make the whole DMI process much more flexible, and especially the DISPEL language developed in the project makes the experience much more pleasant even for novices to data mining. Once the DMI process is described in a DISPEL program, it can be reused even when the underlying middleware and hardware configuration changes significantly. It is based on the popular framework OGSA-DAI, which enjoys good support by its authors and an extensive user base. We continue to use the results of ADMIRE even after the project has finished. The concept has proven its usefulness and we are already working on porting other application scenarios to the ADMIRE framework, as can be seen for example in [1]. Also there are other works which already utilize ADMIRE technology for different application domains [14, 12, 10].
Acknowledgements This work is supported by projects ADMIRE FP7-215024, APVV DO7RP-0006-08, DMM VMSP-P-0048-09, Project ITMS: 26240220029, ITMS: 26240120029, VEGA No. 2/0054/12.
Using advanced data mining and integration (...)
15
References [1] Bartok J., Habala O., Bednar P., Gazak M., Hluchy L.: Data Mining and Integration for Predicting Significant Meteorological Phenomena. [in:] Proc. of ICCS 2010 — International Conference on Computational Science, vol. 1 of Procedia Computer Science, pp. 37–46. Elsevier Science BV, 2010. [2] Brezany P., Aranda C. B., Corcho O., Janciak I., Woehrer A., Atkinson M.: ADMIRE – Report Defining the Final Iteration of the Model and Language. Deliverable report D1.9, The ADMIRE Project, May 2011. [3] Ciglan M., Habala O., Tran V., Hluchy L., Kremler M., Gera M.: Application of ADMIRE Data Mining and Integration Technologies in Environmental Scenarios. [in:] R. Wyrzykowski, J. Dongarra, K. Karczewski and J. Wasniewski (Eds.), Parallel Processing and Applied Mathematics, Part II, volume 6068 of Lecture Notes in Computer Science, pp. 165–173. Springer-Verlag Berlin, 2010. [4] EU Parliament: Directive 2007/2/EC of the European Parliament and of the Council of 14 March 2007 establishing an Infrastructure for Spatial Information in the European Community (INSPIRE). Official Journal of the European Union, 50(L108), April 2007. [5] Galea M., Atkinson M., Liew C. S., Martin P.: emphADMIRE – Final Report on the ADMIRE Architecture, May 2011. [6] Habala O., Maliˇska M., Hluch´ y L.: Service-based flood forecasting simulation cascade in k-wf grid. [in:] M. Bubak, S. Unger (Eds.), Cracow 06 Grid Workshop : K-Wf Grid, pp. 138–145. Academic Computer Centre CYFRONET AGH, 2007. [7] Hluchy L., Habala O., Tran V., Ciglan M.: Hydro-meteorological scenarios using advanced data mining and integration. [in:] Fuzzy Systems and Knowledge Discovery, 2009. FSKD ’09. Sixth International Conference on, volume 7, pp. 260–264, aug. 2009. ˇ [8] Hluch´ y L., Seleng M., Habala O., Krammer P.: Mining Environmental Data in Hydrological Scenarios. [in:] M. Li, Q. Liang, L. Wang, Y. Song (Eds.), Seventh International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2010, 10-12 August 2010, Yantai, Shandong, China, pp. 1988–2992. IEEE, 2010. [9] Jackson M. J., Antonioletti M., Dobrzelecki B., Hong N. C.: Distributed data management with ogsadai. [in:] S. Fiore, G. Aloisio (Eds.), Grid and Cloud Database Management, pp. 63–86. Springer Berlin Heidelberg, 2011. [10] Jarka M., Podraza R.: Architecture of distributed system for challenging data mining tasks. [in:] D. Ryzko, H. Rybinski, P. Gawrysiak, M. Kryszkiewicz (Eds.), Emerging Intelligent Technologies in Industry, volume 369 of Studies in Computational Intelligence, pp. 197–206. Springer Berlin / Heidelberg, 2011. [11] Martin P., Yaikhom G., et al.: Dispel: Data-intensive systems process engineering language, user manual. Technical report, August 2011. [12] Spinuso A., Trani L., Atkinson M., Galea M.: Infrastructure for data-intensive seismology: Cross-correlation of distributed seismic traces through the admire
16
ˇ Ondrej Habala, Ladislav Hluch´ y, Viet Tran, Peter Krammer, Martin Seleng
framework Geophysical Research Abstracts, 13, 2011. [13] Stackpole J.: The WMO format for the storage of weather product information and the exchange of weather product messages in gridded binary form. U.S. Dept. of Commerce, National Oceanic and Atmospheric Administration, National Weather Service, National Meteorological Center, 1994. [14] Trani L., Spinuso A., Galea M., Main I.: A novel automated approach to ambient noise data processing using the admire framework. Geophysical Research Abstracts, 13, 2011. [15] Yokokawa M., Itakura K., Uno A., Ishihara T., Kaneda Y.: 16.4-tflops direct numerical simulation of turbulence by a fourier spectral method on the earth simulator. SC Conference, 0:50, 2002.
Affiliations Ondrej Habala Institute of Informatics of the Slovak Academy of Sciences, Bratislava, Slovakia,
[email protected] Ladislav Hluch´ y Institute of Informatics of the Slovak Academy of Sciences, Bratislava, Slovakia Viet Tran Institute of Informatics of the Slovak Academy of Sciences, Bratislava, Slovakia Peter Krammer Institute of Informatics of the Slovak Academy of Sciences, Bratislava, Slovakia ˇ Martin Seleng Institute of Informatics of the Slovak Academy of Sciences, Bratislava, Slovakia
Received: 6.12.2011 Revised: 17.01.2012 Accepted: 30.01.2012