ORCHESTRATION SERVICES FOR CHEMICAL

0 downloads 0 Views 222KB Size Report
The scope of the PESCaDO project (Personalized Environmental Service ... Data Retrieval Service), as well as a service responsible for using this data in order.
2.2.89 ORCHESTRATION SERVICES FOR CHEMICAL WEATHER FORECASTING MODELS IN THE FRAME OF THE PESCADO PROJECT V. Epitropou(1), K. Karatzas(1), A. Karppinen (2), J. Kukkonen(2), and A. Bassoukos(1) (1)Aristotle University, Dept. of Mech. Eng., Informatics Systems and Applications Group, Thessaloniki, Greece; (2) Finnish Meteorological Institute, Helsinki Presenting author email: [email protected] ABSTRACT The PESCaDO project aims at providing personalized environmentally-derived information to European citizens using plain, human-language queries and using them to infer the appropriate environmental, ontological, quality-of-life and spatial-temporal context in which to focus. This information is in turn used to provide information about pollution concentration and health risks associated with the action inferred by the query within the areas covered by the service's knowledge base. In order to provide this information, PESCaDO automatically orchestrates and combines several sources of environmental information, among which are Chemical Weather forecasts, which provide high-detail, high-volume pollutant concentration data over most of Europe. Harmonized and uniform access to such Chemical Weather (CW) data is achieved through PESCaDO's integration with the AirMerge CW image parsing engine, which also provides innovative services such as the ability to automatically generate ensemble forecasts and automatically rank CW model providers for reliability and accuracy of results. In the current paper we report on the orchestration of services base on CW forecasting models, as achieved via the integration with AirMerge.

1.

INTRODUCTION

In recent years, the emergence of social media and an increased public awareness of environmental factors that impact Quality of Life (QoL) factors have led to an increase in demand for personalized environmental information. Such services can take into account projected activities (daily transportation, outdoor activities, traveling etc.) and their impact on health, in particular with respect to Air Quality (AQ) and the status of the concentration and the time evolution of air pollutants in a specified region (Karatzas, 2005, Karatzas and Kukkonen, 2009). In a strict technical context, these parameters are known as the Chemical Weather (CW) and the crossings of specific alert thresholds are referred to as “episodes”. The scope of the PESCaDO project (Personalized Environmental Service Configuration and Delivery Orchestration, Wanner et al., 2011), potentially extends over the entire European continent, depending on the contents or the implications of a user's query on air quality status, so providing an answer to a specific query requires gathering data from a wide variety of sources (a process known as “orchestration”). PESCaDO includes functionality modules that perform the semantically separate tasks of discovering and retrieving data from external sources (the Data Retrieval Service), as well as a service responsible for using this data in order to assemble a response as relevant as possible to a particular query, using all available -heterogeneous- data sources, called “Fusion service” in PESCaDO's own terms. The nature, type and quality of these sources is a major consideration and paramount to the overall system's efficiency, as data sources may range from textual information harvested from arbitrary web pages, to data downloaded from specific predetermined databases or repositories using appropriate connectors. The degree of automation with which data sources can be gathered is generally inversely correlated with their relevancy, reliability and accuracy, and both approaches have limits, advantages and disadvantages, from a data fusion point of view. In this perspective, it is important to develop a service that would allow PESCaDO to access and use CW forecasts in an orchestrated way, since they often represent the most information-rich and detailed form of divulgating pollutant concentration data to the public. In this paper, AirMerge (Epitropou et al. 2010, Epitropou et al. 2011), a system which has been developed for addressing a similar problem in the context of the European Chemical Weather Forecasting Portal (Karatzas et al. 2009, Kukkonen et al. 2009, Balk et al. 2011) will be briefly presented and its ongoing integration with PESCaDO's orchestration process, as well as the new types of services that it makes available to users and its integration with PESCaDO's Data Discovery and Fusion services will be discussed. 2.

OVERVIEW OF THE PESCADO ORCHESTRATION SERVICE

PESCaDO’s implementation of the concept of node orchestration is called the Fusion Service, and is tasked with using a pool of known data nodes. In the context of PESCaDO, a “data node” practically means any internet-based resource such as regular websites or dedicated data repositories whose contained knowledge can be extracted by using the appropriate type of analysis (e.g. text mining and ontological analysis for

textual information, optical character recognition (OCR) and reverse engineering for images such as charts and pseudocolor maps), indexed, categorized and synthesized, in order to be used when assembling responses to personalized user queries.However, the majority of data nodes considered are continuously operating websites in the field of weather, chemical and allergen forecasting, and the most usual type of extracting information from them is by automatic textual analysis. This approach has the advantage of being largely automated and achievable with textual analysis and screen scraping techniques; however it is subject to the limitations imposed by the website’s structure, and the imprecise nature of the text content itself. Most importantly, the amount of information contained in purely textual form is usually too generic and vague (e.g. pollution levels for an entire region may be summed up in a single phrase or number), while at least in the field of CW forecasting the preferred method of publishing model results is through pseudocolor image maps, each of which contains the equivalent of tens of thousands of data points with relative geographical precision. Integrating such information-rich data sources into PESCaDO is a challenge, since properly parsing them requires a different approach and algorithmic techniques compared to text-based mining. A system for dealing specifically with this problem was designed (Epitropou et al. 2010), and is currently being integrated into PESCaDO. 3.

OVERVIEW OF AIRMERGE

The AirMerge system is a new addition to PESCaDO, having being developed independently to address the separate –but related- problem of rendering heterogeneous chemical weather data accessible and intercomparable (Kukkonen et al., 2009, Epitropou et al., 2010), a procedure often described as harmonization. In the field of Chemical Weather Forecasting (CWF), it is a common practice to disseminate the produced model results as pseudocolor images representing concentration maps with an arbitrary color scale. These images commonly have an explicit geographical registration with spatial resolutions ranging from 1x1 km to 20x20 km, and with a temporal resolution ranging from a minimum of one hour to an entire day (e.g., Kukkonen et al., 2012). The reported values usually are maximum or average air pollution concentration values for the selected integration time. A typical set of such CW models and the resulting images can be found in the European Open-access Chemical Weather Forecasting Portal (Balk et al. 2010), that has been developed in the frame of COST Action ES0602 (www.chemicalweather.eu) (Kukkonen et al., 2009). A detailed description of the physical and chemical basis of the CW models, their characteristics and differences, can be found in Kukkonen et al., 2012). The principal problem with the web-based graphical form of dissemination is that commonly no direct access to the model’s numerical data is possible, so building additional web applications using the model results directly is not possible. Furthermore, the visualizations themselves do not follow a predefined norm, as they are produced by different modellers and providers. As a result, they are heterogeneous in presentation (geographic region, resolution, concentration/colour scale mapping etc.) thus hindering any straightforward unified presentation, merging or data processing attempt. Also, some of the images are permanently altered with visible watermarks, compression artefacts, blurring, noise, symbols, text, lines etc. The AirMerge system has overcome many of these limitations by automatically harvesting large quantities of such imagebased CW forecasts from different model providers, accounting for technical differences between them and converting them to a common numerical data format and storing them as separate, semantically indexed layers into an internal database (the AirMerge DB) for future processing, including comparison of predictions with past values and between models, as well as the development of database-driven services. This procedure is transparent to the model providers themselves, as it doesn’t require any action on their part other than the continued publishing of their models. Furthermore, AirMerge also collects real-time measurement data from a number of CW monitoring stations in Europe. This data, unlike CW image forecasts which always represent larger-scale averages, is localized (point data) and highly non-uniform in distribution across the European continent, so their usefulness for querying pollutant concentrations in arbitrary locations is limited. However, they are useful, in addition to model evaluation, for fine-tuning certain more advanced aspects of the orchestration services, explained later on. 4.

CONNECTING AIRMERGE AND THE PESCADO FUSION SERVICE

From an architectural standpoint within PESCaDO, AirMerge is treated as a special component which performs its own data node discovery (its “nodes” are the models which are currently integrated into the system) so in that sense it does not require discovery by PESCaDO’s Node Discovery Service in the same manner as for e.g. newly crawled websites. Similarly, its backing database can be queried by the Fusion Service in order to provide direct numerical pollutant concentration data complete with timestamps and precise geographical coordinates, which represents a level of precision unattainable from text mining source

alone, and helps boosting the accuracy of PESCaDO’s generated information considerably.However, in order to interface with the Fusion Service, and to allow itself to be queried regarding its contents by the Node Discovery Service without exposing the AirMerge DB directly, AirMerge has been provided with an external API (the AirMerge API), a RESTful web-based interface which allows performing basic queries such as fetching a list of available layers, pollutant types, or request point-based pollution concentration data. The use of a RESTful interface to interconnect modules providing different functionalities is a requirement within the PESCaDO project. 5.

THE CW FORECAST ORCHESTRATION MODES OF AIRMERGE

The AirMerge DB can be used in several modes and coupled with different processing modules when responding to queries directed to the AirMerge API by part of the Fusion Service or another web application. Generally, in the context of PESCaDO, this is done once the user’s request has been narrowed down to a precise location and time frame, as well as a particular pollutant by the orchestration service. The concept of a particular combination of place, time and pollutant will henceforth be referred to as a “concern”. Once one or more such concerns are identified by the PESCaDO’s front-end, the required information is then extracted from the appropriate data nodes and “fused” together in a best-effort strategy response towards the user. AirMerge, in this context, can readily reply to queries about precise concerns involving CW data, which can be accommodated by its backing AirMerge DB. This is performed through a “best effort” strategy, according to which a query over a particular concern is honored if and only if a) at least one data layer or at least a single monitoring station covers the required location for the requested pollutant and b) the timeframe requested is within that layer’s time coverage range or close enough to a station’s timestamp. If both conditions are not satisfied, then and only then will AirMerge return a void query result. Based on the value of a pollutant’s concentration (if available), the Fusion service can determine whether e.g. hazardous levels of pollutants have been exceeded. Reporting on a particular concern is the simplest of usage modes possible with AirMerge during the orchestration and fusion phases. Other modes include analyzing recent concentration trends (increasing, decreasing, steady, etc.) and taking into account the contribution of multiple models and the associated variance and uncertainty in possible results. Because of estimation errors inherent in the models used to produce the CW forecasts themselves and because of possible geographical area overlap between different models, a particular concern may be covered by more than one model which may also exhibit large divergences in their output values, usually large enough to affect the risk assessment process. In that case, the AirMerge API allows specifying more advanced data retrieval modes that return not only a singular concentration value for a particular area, but also a metric of the relative precision and reliability of this value, the number and type of models that contributed to the result (e.g. by taking the mean concentration value), and their relative precision/relevance for this particular query. This information is reported back to the user as part of extended information about his or her query and is also fed back to AirMerge itself in order to fine-tune future queries and results, using Machine Learning techniques (Epitropou et al., 2012) to create relevance maps and rank models by their accuracy in particular settings, thus affecting future queries and improving AirMerge’s decision-making process in the case of ambiguous or missing data. 6.

AIRMERGE: ENSEMBLE MODELING AND USE IN ORCHESTRATION

In order to reduce control the inaccuracies and errors inherent in CW models, a custom form of ensemble modeling is employed in order to improve the results’ accuracy and estimate the associated numerical ensemble averages. It should be noted that this type of uncertainty is qualitatively and quantitatively very different than the uncertainty introduced by text mining techniques on arbitrary environmental websites: the latter may end up generating ambiguous or spurious results, while AirMerge works on screened quantitative sources whose nature is well-known, and all results are guaranteed to be numerical and relevant to a particular concern, as long as the data in the AirMerge DB supports them. According to the principles of ensemble modeling, assuming independent model structures, a particular linear combination of several models minimizes the error (measured as the statistical variance versus a reference value) compared to that of any of the individual models that make up the ensemble (Potempski et al. 2009, see also a review by Kukkonen et al., 2012). Ensemble forecasting has been successfully applied in particular scenarios where different models were pre-harmonized by mutual agreement between model providers in order to minimize area, time step and spatial resolution differences (Galmarini et al. 2004). Since AirMerge has to operate without the benefit of such agreements, a “best effort” or “fallback” strategy is employed in

this phase. The reference values versus which to perform model ensemble fine-tuning and comparisons may be obtained by real measurements for a region (if available in the AirMerge DB itself) or by assuming that a plain average of available models is closer to the actual value than any of the models, something which often is an acceptable compromise, even if suboptimal (Potempski et al. 2009). Alternative ways of dealing with the paucity of fine-tuning reference values include time-series analysis, spatial-temporal interpolation, and analyzing the trend of previous ensemble weight (if available). 7.

CONCLUSIONS

In this paper, an overview of the AirMerge system and its planned integration within the PESCaDO project was presented and discussed. The immediate benefit of bringing those two systems together will be granting access to a high-quality CW database to PESCaDO, currently unobtainable by other means. The bridging requirements between the two platforms have stimulated development of an external API for AirMerge and the inclusion of new functionalities that will better match PESCaDO’s orchestration and Fusion services’ needs and increase the usefulness of AirMerge itself, and its availability to the CW modeling community 8.

ACKNOWLEDGEMENTS

The authors greatly acknowledge the PEScADO project (FP7-248594). 9.

REFERENCES

Balk, T., J. Kukkonen, Karatzas, K., Bassoukos, A., and Epitropou, V., 2011. European Open Access Chemical Weather Forecasting Portal., Atmospheric Environment, 38, Vol. 45, 6917-6922. Epitropou, V., Karatzas, K., Bassoukos, A., Kukkonen, J., and Balk, T., 2011. A new environmental image processing method for chemical weather forecasts in Europe Edited by Paulina Golinska, Marek Fertsch and Jorge, eds. MarxGomez. Proceedings of the 5th International Symposium on Information Technologies in Environmental Engineering. Poznan: Springer Series: Environmental Science and Engineering, 781-791. Epitropou, V., K. Karatzas, and Bassoukos, A., 2010. A method for the inverse reconstruction of environmental data applicable at the Chemical Weather portal., Edited by A. Car, G. Griesebner and J. Strobl. Geospatial Crossroads @ GI_Forum '10, 58-69, Wichmann Verlag, Berlin: ISBN 978-87907-496-9. Epitropou V., Karatzas K, Kukkonen J., Vira J. 2012. On the degradation of published Chemical Weather data after inverse image-based reconstruction, International Journal of Artificial Intelligence, accepted, under revision. Galmarini S. et al., 2004. ENSEMBLE dispersion Forecasting, Part 1: Concept, approach and indicators. Atmospheric Environment, 38, 28, 4607-4617. Karatzas, K., 2005. A quality-of-urban-life ontology for human-centric, environmental information services, C21: Towntology, WG1: Ontologies and Information Systems Brussels, 12-13 (http://www.towntology.net/Meetings/0512BXL/presentations/C21_towntology_karatzas_brussels.pdf, last accessed 27.01.2012) Karatzas, K. and Kukkonen, J., 2009. COST Action ES0602: Quality of life information services towards a sustainable society for the atmospheric environment, ISBN: 978-960-6706-20-2, Thessaloniki: Sofia Publishers. Kukkonen, J., Klein, T., Karatzas, K., Torseth, K., Fahre, Vik A., San Jose, R., Balk, T. and Sofiev, M, 2009. COST ES0602: Towards a European network on chemical weather forecasting and information systems, Advances in Science and Research Journal, 3, 27-33. Kukkonen, J., Olsson, T., Schultz, D. M., Baklanov, A., Klein, T., Miranda, A. I., Monteiro, A., Hirtl, M., Tarvainen, V., Boy, M., Peuch, V.-H., Poupkou, A., Kioutsioukis, I., Finardi, S., Sofiev, M., Sokhi, R., Lehtinen, K. E. J., Karatzas, K., San José, R., Astitha, M., Kallos, G., Schaap, M., Reimer, E., Jakobs, H., and Eben, K., 2012. A review of operational, regional-scale, chemical weather forecasting models in Europe, Atmos. Chem. Phys., 12, 1-87, doi:10.5194/acp-12-12012 Potempski, S. and Galmarini, S., 2009. Est modus in rebus: analytical properties of multi-model ensembles, Atmos. Chem. Phys. Discuss., 9, 14263-14314, doi:10.5194/acpd-9-14263-2009 Wanner L., Vrochidis S., Tonelli S., Mossgraber J., Bosch H., Karppinen A., Myllynen M., Rospocher M., BouayadAgha N., Bügel U., Casamayor G., Ertl T., Kompatsiaris I., Koskentalo, T., Mille, S., Moumtzidou, A., Pianta, E., Saggion, H., Serafini, L., and Tarvainen, V., 2011. Building an Environmental Information System for Personalized Content Delivery". Proceedings of the 9th IFIP WG 5.11 International Symposium on Environmental Software Systems Frameworks of eEnvironment (ISESS 2011), pp. 169-176, Springer