Multimed Tools Appl DOI 10.1007/s11042-015-2604-7
Environmental data extraction from heatmaps using the AirMerge system Victor Epitropou 1 & Tassos Bassoukos 1 & Kostas Karatzas Ari Karppinen 2 & Leo Wanner 3,4 & Stefanos Vrochidis 5 & Ioannis Kompatsiaris 5 & Jaakko Kukkonen 2
Received: 18 August 2014 / Revised: 19 March 2015 / Accepted: 1 April 2015 # Springer Science+Business Media New York 2015
Abstract The AirMerge platform was designed and constructed to increase the availability and improve the interoperability of heatmap-based environmental data on the Internet. This platform allows data from multiple heterogeneous chemical weather data sources to be continuously collected and archived in a unified repository; all the data in this repository have a common data format and access scheme. In this paper, we address the technical structure and applicability of the AirMerge platform. The platform facilitates personalized information * Victor Epitropou
[email protected];
[email protected] Tassos Bassoukos
[email protected] Kostas Karatzas
[email protected] Ari Karppinen
[email protected] Leo Wanner
[email protected] Stefanos Vrochidis
[email protected] Ioannis Kompatsiaris
[email protected] Jaakko Kukkonen
[email protected] 1
Department of Mechanical Engineering, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
2
Finnish Meteorological Institute, Helsinki, Finland
3
Department of Information and Communication Technologies, Universitat Pompeu Fabra, Barcelona, Catalonia, Spain
4
Catalan Institute for Research and Advanced Studies (ICREA), Barcelona, Catalonia, Spain
5
Centre for Research and Technology Hellas - Information Technologies Institute, Thessaloniki, Greece
services, and can be used as an environmental information node for other web-based information systems. The results demonstrate the feasibility of this approach and its potential for being applied also in other areas, in which image-based environmental information retrieval will be needed. Keywords Heatmaps . Data retrieval . Air quality . Image processing . Web services . GIS
1 Introduction There are many types of environment-related images available on-line, broadly belonging to two categories. These are (a) captured images that are generated as the result of a monitoring activity (either in situ or remotely), and (b) synthetic (i.e., modeled) images that visualize the result of an environmental computation process. In the latter category, heatmaps may be considered as a representative type of synthetic images, and are commonly produced with the aid of models. A special area of application regarding the processing of heatmaps is air quality forecasting (AQF). If we address in particular the regional and continental scales of AQF, the term Chemical Weather (CW) is commonly used. In Chemical Weather Forecasting (CWF), there has been an ever-growing number of forecast providers. These forecasts can cover some regions of Europe multiple-fold [11, 16]. However, the advancements in the number and quality of AQF forecasts have not been associated with similar advancements in publishing those results on-line, or making them available for added value services [15] in interoperable forms. On-line data publishing and divulgation is, in most cases, performed by the use of simple heatmaps, while the numerical data used for the construction of the heatmaps is commonly either not available or its access is severely restricted due to legal or technical constraints. There is no harmonization regarding the (on-line) publishing of heatmaps [19]. Each CWF data provider has therefore chosen to adopt their own heatmap format and publishing parameters. A relatively recent solution for publishing maps, geographical coverages and associated data and metadata has materialized in the form of the various Open Geospatial Consortium (OGC) standards for the publishing and visualization of geospatial information. However, the available implementations and their somewhat inflexible client-server tiered architecture, as well as limited support for time- and elevation- based data ordering [19] have limited their adoption, with map visualization (WMS) being the most popular of OGC services, while the most complex and data-oriented ones (e.g., WFS, WCS, WPS) are lagging behind in terms of acceptance, both because of their relatively niche use (compared to map viewing) and relative difficulty of implementation, in the case of WPS. It can therefore be concluded that CWF data publishing via heatmaps is a field which has, so far, eluded web service integration and convergence of data interoperability, thus not providing fertile technical conditions for the development of personalized or other added value services. In this paper, an integrated platform is presented, named AirMerge, which has been designed to address this problem. It uses heatmaps as the starting point for automatically collecting data from different CWF providers, converting them to a commonly interoperable format, storing and recalling them from a centralized database and making them available to
other value-added services through a common web API. Elements of AirMerge’s technology have first been presented in [8], while ongoing developments, applications and extensions of the system have been presented in [6, 17]. Compared to existing systems with some similar functionality and in the same AQF and CWF domain, like MyAir [1], AirMerge provides with the unique functionality of imagebased environmental information retrieval [7], accompanied by unique functions allowing for image noise removal and feature extraction, map projection transformations, data fusion and mathematical analysis of extracted information as well as providing for external connectivity and component reuse in third-party projects such as PESCaDO [6]. AirMerge is thus an information hub for spatially defined environmental data, which provides a single access point to various CWF data. Although AirMerge is currently used only for chemical weather data, it can deal with any spatially distributed environmental information, provided that a suitable retrieval and parsing subsystem is implemented in the form of an extension or plug-in to the system. Via AirMerge, numerical environmental data can be extracted from heatmaps and made available via a proper web-based API to be used by other services. Even though AirMerge’s focus is on heatmaps, other sources of information can be used as well, such as formatted text data, offline databases, data exchange files, remote data feeds and so on. These secondary data sources typically offer a much reduced areal coverage compared to heatmaps and require more interim interfacing to be integrated, but they may be more precise on a local scale (for example, if they are generated by real-time sensor readings) and can be used for fine-tuning or verification purposes. The aims of this article are (i) to present the processing of CWF heatmaps in the AirMerge system in greater detail than previous publications, and (ii) to discuss AirMerge’s external connectivity and interaction with other systems. In addition, it is intended as a way to comprehensively present the totality of AirMerge’s components and sub-systems, which have so far only been presented separately in application-specific settings.
2 Materials and design requirements 2.1 Air quality heatmaps as input material CWF models predict spatial and temporal concentration data [25]. Such data can be encoded as digital images in the form of visually-interpretable heatmaps. Heatmaps are defined as 2D images with a discrete number of color levels representing different pollutant concentration value ranges over a geographical domain. They are accompanied by auxiliary information concerning the color scale used and their geographical context, as well as descriptions of the pollutants and units being used. The accompanying information may even contain secondary data elements such as line charts, auxiliary heatmaps or tables, which add further levels of complexity and information richness to CWF heatmaps. A heatmap, by design, is intended as an intuitive, human-readable, one-way communication medium, conveying information to various groups of stakeholders. It is a one-way medium primarily because it is not normally meant to be exported, modified or processed by the intended target audience. Generally, heatmaps are not intended as a data interchange format. However, they are widely available, and they contain a large amount of usable data (relative to most other potential data sources available online under the same terms). As an intuitive example, a
heatmap of a CWF representing a geographical area on a fixed grid, with grid dimensions of 300×200 elements, with a grid resolution of 20×20 km, covers an area of 6000×4000 km (roughly a pan-European coverage), and contains 60,000 data points [16]. Considering that a forecast provider will typically offer at least 24-h coverage (same-day predictions), and that each data point evaluates to a real-valued number (after heatmap interpretation), it is obvious that the potential volume of recoverable environmental data is rather high. In addition, heatmaps carry their own geolocation information and have a large continuous coverage area, offering a combination of high area coverage, relatively high spatial resolution and a fair temporal resolution. Similar amounts and typologies of data can be extracted by remote sensing imagery (RSI) [2, 26], but with the added complication of having to perform more advanced image processing and feature extraction before yielding usable results. Techniques relying on harvesting information from other sources like social media have to deal with high noise-to-signal content, low accuracy, and complications caused by bringing factors such as language semantics and ontologies into play [22]. In the case of heatmaps, the problem lies almost entirely within the signal processing domain, allowing for a more direct approach.
2.2 Availability and accessibility of CWF data Initially, heatmaps were studied in terms of their availability, publishing patterns, data semantics as well as their structural characteristics. Specifically, we addressed heatmaps that were within the European Open-Access Chemical Weather Portal [3, 14]. This preliminary analysis led to some important conclusions, regarding the state-of-the art in CWF publishing. An important conclusion was that the existence of multiple CWF providers on the internet can lead to the simultaneous existence of multiple contradictory forecasts, even if they refer to the same geographical areas, time spans and pollutant types [11, 13]. This means that the average user wishing to consult more than one CW forecast will have to evaluate their accuracy and reliability for themselves, by exploring different provider’s websites, and with very limited options when it comes to simultaneous display and comparison of different CWF sources. Up to date, there have been only a few efforts between data providers to standardize the output format of their models and spatial, temporal and qualitative coverages [10, 12]. In addition, access remains mostly non-numerical. The adoption of Open Geospatial Consortium standards for presenting maps and datasets, such as WMS and WPS, has also remained very limited in the domain of CWF. The MyAir project [9] does offer a direct data WCS (Web Coverage Service) access option [10], but this is the exception, rather than the norm. This has resulted in the current situation, where there is a significant quantity of new CWF data published on-line every day in Europe, but access to this data in its original resolution and precision is limited. In addition, there is no unified repository of regional CWF data focusing on the collection of recently published data.
2.3 Operational premises The platform was created around the following premises, which are based on the extensive background research conducted (according to Subsection 2.2):
& & &
New heatmaps representing CWF forecasts are published daily or at least fairly regularly with a predictable pattern and at predictable web URLs, to make them worthwhile harvesting, both for practical and for informational gain reasons. Heatmap formats and publishing patterns such as update hours, frequency, etc. may change without warning. It is the platform’s maintainers’ responsibility to keep up with them. It is possible to reconvert heatmaps (or, in general, any sufficiently processed remote sensing or synthetic image [2, 26]) back into numerical data, albeit with limitations, and it is possible to archive and post-process any data recovered by the reconversion process [7].
2.4 Heatmap characteristics A typical example of a CWF heatmap image is presented in Fig. 1. Those characteristics and isolated elements are also coded in AirMerge’s parsing configuration with specific keywords, which are associated with each characteristic. The parsing subsystem utilizes XML-based configuration files which contain descriptive fields for all of the above elements. Some of those elements are fixed for a given provider’s heatmaps, while others indicate variable/mutable characteristics, such as the type of pollutant used. An example of such a configuration file is given in Fig. 2, which contains the instructions for parsing heatmaps with the structure of the one in Fig. 1.
2.5 Map region Heatmaps contain at least one rasterized map region, indicated with the < region > tag in the AirMerge’s XML parsing subsystem. Regions contain the color-coded data of interest, with a specific raster height and width. One pixel of this raster map corresponds to one data or grid point, though a single data point doesn’t necessarily uniquely map to a single geographic coordinate, due to map projection considerations. The map projection is described in the < projection > node.
2.6 Color legends and pollutants Each CWF forecast provider usually publishes a series of pollutants using identical map layouts, but different color scales and value ranges. Hierarchically, color scales are considered a sub-feature associated with each pollutant. Color legends, color scales or simply Blegends^ are color look-up tables (LUTs) or palettes siding the map region, which indicate the relationships between the colors used in the heatmap, and the numerical pollutant concentration value ranges that they represent. In the scripting subsystem, the totality of the pollutants offered by a CWF provider are grouped under the (unique) < pollutants > XML tag, while individual pollutants are found under the (multiple) < pollutant > XML tags, as shown in Fig. 3, which shows the description employed by AirMerge for parsing the CO pollutant from the heatmap in Fig. 1. The position of the color legend as well as the value ranges are entered manually, but the actual color sampling points are determined automatically. The order of color parsing is left-to-right for horizontal color scales, and bottom-totop for vertical ones.
Pixel Origin
Auxiliary text
Title text Top-left boundary pixel
Non-data pixels
Latitude references
Map Region
Bottomright boundary pixel
Lon/Lat origin
Color-value ranges legend
Longitude references
Secondary heatmaps
Fig. 1 An example of a heatmap with its important areas and structures highlighted. This particular example also contains secondary heatmaps and text areas, which are normally not used
2.7 Determining map geometry and bounding box AirMerge utilizes the following simple method for determining the relationship between image pixels and geographical coordinates: two points p1 ={x1,y1}andp2 ={x2, y 2 } are selected on the map region itself, and their geographical coordinates g1 ={λ1,φ1},g2 ={λ2,φ2} are determined by using the (usually present) geographical grid of the image. The two selected points should be as far as possible in terms of longitude and latitude as the map allows, in order to minimize discretization and distortion errors. These points are chosen in the < pixelpinning > and < geopinning > tags in AirMerge’s XML configuration scripts, as shown in Fig. 2. Then, the pixel/ longitude ratio rλ and pixel/latitude ratio rφ are easy to reconstruct with the following formulas:
FMI http://silam.fmi.fi/AQ/operational/europe equirectangular input equirectangular output …
Fig. 2 Part of the XML-based configuration script used by AirMerge in order to parse heatmaps with a specific structure. In this case, instructions on how to isolate the map region from heatmaps of the type used in Fig. 1 are given
...
Fig. 3 Part of the XML-based configuration script used by AirMerge in order to parse the CO gas pollutant from heatmaps of the type used in Fig. 1
rλ ¼
x2 −x1 y −y ; rφ ¼ 2 1 ; λ2 −λ1 φ2 −φ1
ð1Þ
The formulas can be applied verbatim under the following conditions:
& & &
p1 is located South-West (SW) of p2 g1 and g2 are both located in the northern hemisphere (0°≤φ1,2 ≤90°) Distances are always computed positive, moving from p1 to p2 in the North-East (NE) direction. If p2 is more than 180° radians East of p1, then the large arc is considered (0°≤λ1,2 ≤360°).
After these ratios have been computed, it is trivial to determine the geographical offsets of the map (position of the SW point) as well as its maximum geographical extension, and therefore construct a complete, four-point geographical bounding box. Selecting appropriate points is done manually by the AirMerge’s operator, though it is possible to partially automate the process [18, 27]. If no usable geographical reference grid is supplied with the heatmap, it is still possible to use the known geographical coordinates of two easily identifiable landmarks.
2.8 Non-data elements On several heatmaps, there will almost always be visual elements that do not represent pollutant concentrations, but instead mark boundaries, form longitude/latitude reference lines, signify land-water interfaces, indicate urban areas or simply represent geographical areas which, even if physically present on the map, are not part of the model’s output. Such areas are called Bvoid areas^, and can for example be seen as the white, uncovered map areas in Fig. 4.
Fig. 4 An example of a heatmap with both numerous geomarkers and extended void regions
Some of those elements may be useful during configuration phases (for determining the bounding box and map projection parameters, for instance) but in general they are undesirable in processed data, and AirMerge uses several techniques to minimize their presence in the data it stores in its database. AirMerge automatically classifies any pixel with a color not among those described in the legend as a non-data element, and considers its position as noise in the data, forming a gap. Data gaps formed by boundary lines and geomarkers are usually thin (one or two pixels), and are dealt with by using simple interpolation and noiseremoval algorithms [7], resulting in seamless, continuous images from which to recover data.
2.9 Handling of borderline cases Ideally, heatmaps should contain only the colors appearing in their associated legends, and any extraneous color should clearly indicate a geomarker element to remove. Also, geomarkers and void/uncovered geographical areas in the map should use different colors than those used in areas containing valid data, and the heatmaps themselves should only be delivered in lossless image formats. However, in practice the following problems do arise:
&
& & &
Though no CWF provider of those represented in AirMerge actually delivers their heatmaps in an actually lossy image format (e.g., JPEG), some heatmaps show signs of having been submitted to a lossy process at some point. This may create unwanted noise, visible patterns and color artifacts which, if treated indiscriminately like noise, would result in a too extended data loss. Colors that differ slightly from those defined in the legend may appear in the map region, or there might be more shades and hues than those implied by the legend. The legends themselves may contain noise or off-key colors which may differ slightly than those appearing in the map region, making their use as absolute color references problematic. Usually, geomarkers and void areas use different colors, and it is easy to distinguish between the two. However, sometimes the same color is used for both, making it impossible to distinguish them based on color/hue alone.
To counter these occurrences, the parsing subsystem has a built-in configurable tolerance factor when parsing colors. This allows for gap filling to be turned on and off, specifying which colors to treat as void and using a special gap filling mode which takes into account the existence of ambiguous noise and void in the same heatmap. An example of how this subsystem is configured in AirMerge’s XML based configuration script, in node < colorscalespecs > is given in Fig. 5.
2.10 The AirMerge system AirMerge has been designed and implemented to process heatmaps via adopting a resultsoriented approach. The AirMerge system was developed to include the following components, visually illustrated in Fig. 6:
rgb true 0 true
Fig. 5 Part of the XML-based configuration script used by AirMerge in order to configure the color scale and map region color-based parsing subsystem
& &
& &
A script-driven CWF heatmap fetching sub-system, which can be configured to fetch all heatmaps from a given set of CWF providers. This subsystem makes use of tags to describe all heatmap features that are required for fetching, processing and archiving. A scheduler subsystem, which initiates the fetching scripts daily at prefixed intervals and also handles networking errors such as connection failures, missing resources and script execution failures by the fetcher subsystem, notifying the system’s administrator in case non-automatic intervention is necessary. A heatmap-to-data conversion subsystem, which performs all the necessary image to data conversions, map projection transformations and image cleanup. A database back-end, which stores both the raw and processed data from the heatmap processing subsystem, according to a schema which allows searching and retrieval by several fine-grained criteria.
Scheduler
Config Script
Fetcher
Fetching from CWF providers
Heatmap Processing
Processed heatmaps
Database
Database connector
Visualization module
External API
CWF providers Users Fig. 6 Structure of the AirMerge platform
Third-party services
& &
& &
A RESTful API, which allows accessing the data stored in the database using simpler commands than accessing the database directly. Several post-processing modules designed to operate on processed data, either on a particular coordinate or on an area. These modules offer various statistical and geoprocessing functions such as computing concentration value averages, producing ensemble (composite) forecasts or comparing and combining with observations, even though the intention is to enable external services to implement such functionality through the use of the AirMerge APIs. Third party extensions or special linkage modules which allow accessing those modules through a simplified interface, such as the ones used for interconnecting with the PESCaDO project’s framework [6]. A visualization module offering direct in-browser user interaction.
All items except the post-processing modules and third-party extensions form part of AirMerge’s core functionality, and are designed to be as generic and data-agnostic as possible, thus being applicable to any heatmap image processing task. By using all of these functionalities together, AirMerge constitutes an environmental data collection repository, which can be extended and used for the creation of third-party services, which can then extract environmental knowledge from AirMerge’s processed data.
2.11 CWF gathering workflow and sub-system Since collecting and processing CWF heatmaps is the primary goal of AirMerge, the first step in its workflow is to gather the heatmaps themselves. In order for a particular providers’ heatmaps to be successfully parsed and classified by AirMerge, their URLs must follow a regular pattern, with a fixed base form and variable parts in their names which should be indicative of time, pollutant and other relevant parameters. In other words, it is a necessary precondition that the heatmap URLs themselves carry clearly structured classification metadata. Being able to uniquely identify heatmaps and infer some of their variable aspects via their URL patterns is the key to AirMerge’s functionality. An example of an URL and its structure can be seen in Fig. 7: The URL pattern, as well as its constituent elements, are defined in the XML code snippet shown in Fig. 8. The URL’s structure is encoded in the < formatString > node, while its variable parts such as the pollutant, elevation layer etc. are indicated by tokens surrounded by hashes. Some tokens like #prognosis# have different names from the XML nodes they are iterated over, e.g., #prognosis# takes values from the < forecasts > node, while others are more self-explanatory. It has been shown that automating the extraction of some of a heatmap’s metadata and structural information is possible [6, 27], by using OCR and text/image processing techniques,
http://silam.fmi.fi/AQ/operational/europe/acid/000/CO_gas_srf_009.png URL base form
Pollutant
Fig. 7 An example of the structural parts of a heatmap URL
Elevation
Hour of day
urlbase#/#pollutant#_#layer#_#prognosis#.png ... prognosis layer pollutant urlbase yy/MM/dd
Fig. 8 Part of the XML-based configuration script used by AirMerge in order to configure the URL sequencer
but the efficacy of such techniques is limited by the fact that individual heatmaps do not always contain all of the necessary information, and operating on individual heatmaps makes it hard to detect the existence of generalized/common schemas between a series of similar heatmaps produced by the same CWF provider.
2.12 Image processing module After images have been fetched, they are transferred to the image processing module, whose task is to convert raw bitmap data into numerical data, by taking into account each heatmap’s format and characteristics. The map region portion of the heatmap is cropped, and each of its pixels is scanned individually. Depending on its color and on how closely it matches one of the colors already present in the legend, it is assigned to a specific classification bin, according to the following (simplified) pseudocode: Inputs:
& &
a [m×n] image I containing RGB color 3-ples a color legend C containing k unique RGB color 3-ples Outputs:
&
an [m×n] array Q containing integer values. for all pixels p∈I for all colors c∈C if p≅c then Q[p]=index(C,c) end for if p∈C then Q[p]=gap_marker if isTransparent(C,p) then Q[p]=void_marker end for
where index(C,c) is an integer-valued function which returns the zero-based index of a color c∈C. Pixels that do not manage to be classified as one of the colors existing in the legend C, are assigned the special gap_marker value, which means that they are considered as invalid data/undesirable noise. An exception to that is if they meet the criteria of the Boolean-valued function isTransparent(C,p), which determines whether a pixel is to be classified as transparent or Bvoid^, according to the setup of the color legend C. Those Bvoid^ pixels are not considered as either valid or invalid data and will be ignored during any successive computations. The classification condition, indicated with B≅^ (almost equal) is used to represent the fact that color classification is performed by using a tolerance threshold function. Most heatmaps are parsed by using the three-dimensional RGB (Red, Green, Blue) color space for classification. In this color space, two colors c1 ={c1,r,c1,g,c1,b} and c2 ={c2,r,c2,g,c2,b} are considered equal for the purpose of classification when their Euclidean distance drgb is less than a set threshold ε: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 ð2Þ c1;r −c2;r þ c1;g −c2;g d rgb ðc1 ; c2 Þ ¼ kc1 −c2 k ¼ Usually this threshold ε is set to a sphere of radius 10 in the RGB color space (assuming 8 bits or an intensity range of [0,255] per color channel), which proved to be a good allaround empirical value after extensive testing. Using a threshold rather than an exact match adds robustness against off-color and noisy graphics. It is also possible to use alternate color spaces such as HSV (Hue, Saturation, Value), which make it easier to remove certain kinds of semi-transparent geomarkers without further data losses, but make it more difficult to discern between certain hues, so using them depends on the characteristics of the heatmaps to process.
2.13 Cleanup module Before being stored in the database or subjected to further processing, the indexed data generated from the image processing step is subjected to the gap-filling or cleanup procedure. Its goal is to substitute all pixels marked as Bgaps^ with valid values, taken from the legend C. The algorithms used to do this have been detailed in [8]. The quality of the data recoverable from heatmaps at the end of the cleanup procedure, compared to the original model’s data before web publishing by data providers, has been explored in [7].
2.14 Database persistence module After parsing and gap filling, the now cleaned-up retrieved forecast data is stored in the form of an indexed image, along with region, projection and color scale information which are derived from the scripting-configuring system, according to the UML database schema shown in Fig. 9. This schema associates each harvested forecast with a unique Blayerimage^ entity, which is the top entity in AirMerge’s database hierarchy, while other associated information such as geographical region, legend, pollutant etc., can be reused and shared between multiple layerimage entities through the Bregionlayer^ entity. By exemplifying, it could be said that the regionlayer entity models all possible parameter combinations and variants of a forecast that are available through a provider.
providerfetcher PK
fetcherid
FK1
isactive xmlparsingspec dailystarttime refetchinterval lastexecution dailyavailabilitytime providerid
region
layerimage PK,FK2
imageid
FK1
dateretrieved originalurl lstmodifiedheader originalmd5 etagheader datepublished daterepresentend daterepresentstart regionlayerid
PK
regionid
FK1
deprojectstring height name width south north east west providerid
provider PK legendrow legend PK
legendid
FK2 FK1
name providerid legendrowid
PK
legendrowid
FK1
index color lowmark highmark legendid
regionlayer PK
regionlayerid
FK1 FK2 FK3 FK4
elevation offsethours legendid pollutantid predictiontypeid regionid
pollutant PK
pollutantid
FK1
name fullname units categoryid
groupname selng swlng nelng url nelat selat modelname nwlng description name nwlat swlat
pollutantcategory PK
categoryid description name
imagedata PK
providerid
imageid predictiontype imagedata PK
predictiontypeid description hours name
Fig. 9 An exemplified version of AirMerge’s database schema, indicating how forecasts are stored and how their metadata and auxiliary information can be retrieved. Primary keys are indicated with the PK note, foreign keys with the FKn note
These variants can be numerous, but they are always finite in number and eventually tend to repeat, and so are not unique. The layerimage entities, on the other hand, always have unique timestamps, as they represent actual unique instances of a forecast issued in a particular moment in time. Since the schema exposes geographical coverage information for each regionlayer entity, it is possible to select multiple layerimage entities covering the same or overlapping geographical areas. It is of course possible to filter out layerimages according to time coverage as well as recentness/relevance criteria. For example, for a given time of the day a forecast issued on that same day will contain more reliable and up-to-date information than one issued 24 or 48 h before, even if it nominally refers to the same time and day (the Brepresented time^ attribute), hence the most recent forecast will also be more relevant. This way, conditions of multiple forecasts covering the same time and space can be resolved.
2.15 Geographical transformation modules To avoid unnecessary data loss, AirMerge stores all forecasts in their original raster’s resolution and map projection, without any permanent alterations. However, it is often necessary to transform a forecast from one type of geographical projection to another, for example, during a computation or visualization in a map projection other than its native one. For this reason, AirMerge contains a variety of map projection and transformation modules, which can be chained together to form a complete deprojection and reprojection workflow for any stored forecast. In general, a forecast is associated with up to four different projection rules:
& & & &
Input Pixel Projection: A projection from the forecast’s raster space (pixels) to the associated transformed (linear) coordinates’ space. Input Geographical Projection: A projection from the transformed coordinates’ space to a common geographical coordinates’ space Output Geographical Projection: A projection from the forecast’s geographical coordinates’ space to a transformed linear coordinates’ space. Output Pixel Projection: A projection from the forecast’s transformed coordinates’ space into an image’s or data array’s raster space (pixels).
This apparent redundancy and complexity, is required because there are three coordinate spaces to consider:
& & &
The m,n coordinate space of the forecast’s raster itself: (m,n)∈ℕ2, also called Bpixel space^. The transformed x,y coordinate space the forecast: (x,y)∈ℝ2, also called Blinear space^. The geographical λ,φ (longitude, latitude) coordinate space of the forecast: (λ,φ)∈[−π,π]2 also called Bgeo space^.
The linear space is an intermediate 2D real-valued space which appears when using nonlinear projections, e.g., conical, polar stereographic or spherical [24]. The relationship with the pixel space will generally be linear (hence the name) and straightforward, but the relationship with actual geographical coordinates can be quite complex. An example of such a space are the UTM coordinates, which are expressed in km from a set point of origin [23], and which are not trivially convertible to latitude and longitude with a simple linear relationship. In practice, pixel and geo projections are used in pairs, masking the existence of the transformed space, which therefore remains hidden and not used directly in computations or projections. This way, all computations and queries have to deal only in terms of pixels and absolute geographical coordinates. Thus, in practice only the geo and pixel spaces are used. In Fig. 10, the typical transformation workflow is illustrated. During input, pixel or raster data is deprojected to the geographical coordinates’ space. The input/deprojection function is actually performed in two steps, but as far as AirMerge is concerned, the net result is a conversion from pixels to geographical coordinates. This allows queries about single geographical locations or areas to be expressed in intuitive and broadly supported and understood geographical coordinates, while the deprojection and reprojection mechanisms take care of the complex transformations required. In Fig. 11, an example of a cleaned-up and projectioncorrected heatmap is shown. Each type of projection function used in AirMerge is also implemented to be fully reversible, so it is also possible to convert geographical coordinates back into an array of pixels or other types of rasterized data. This array may be of the same type and size as the original input (in which case there would be a closed-loop processing), or of a different type (for example, visualizing several different types of input heatmaps in a common map projection). The implementation of map projections in AirMerge is done by custom code, which allowed for a more lightweight codebase, direct control and less dependence on external frameworks. Nevertheless, it is still possible to use adapters to external map transformation engines such as the one in GeoTools [20], in order to expedite the integration of new map projections.
Deprojection
INPUT Linear (x,y)
Data array
Pixel (m,n)
Pixel Projection
Geo Projection
Geo (λ,φ) Processing
Image Pixel (m,n)
Pixel Projection
Geo Projection
Geo (λ,φ)
Linear (x,y)
OUTPUT Reprojection
Fig. 10 Representation of de- and re-projection workflow
2.16 Forecast ensembling module In addition to retrieving, parsing and storing heatmap data, AirMerge can also perform several mathematical and statistical operations on groups of two or more layers, retrieved from different forecasts. Those operations are very common in the domain of ensemble forecasting [21], and find applications in forecast modeling refinement and big data processing. In order to perform such operations, it is necessary that multiple heterogeneous layers can be translated to a common coordinate system and reference grid. This ability of AirMerge is displayed in Fig. 12, where three forecasts from different providers are averaged to one composite (averaging) forecast, and differences in scale and resolution are also leveled. AirMerge can perform these operations either on a point-to-point basis (for a specific geographical coordinate), or on a geographically bounded area basis.
Fig. 11 Cleanup and reprojection of a heatmap using a conical map projection
A
C B
Fig. 12 Heatmap retrieval (A), cleanup (B), reprojection and combination (C) (ensembling) workflow
2.17 AirMerge public API In order to make AirMerge’s harvested heatmaps and data available to third-party services and researchers, a public API was designed, available as a RESTful web service that responds to HTTP GET methods [4]. Currently the API is in a testing phase, and is unauthenticated and publicly accessible. Its original purpose was to allow interfacing with the PESCaDO node orchestration service allowing it to request chemical weather data for specific geographical coordinates (point queries), from one or multiple layers, and also performing a compositing/ensemble forecasting of multiple source layers into a single result, according to the principles of ensemble forecasting [21].
2.18 AirMerge visualization module The visualization module is not considered an essential AirMerge component, as the system can continue functioning even without it. It is, however, a convenient way of accessing the currently stored layers of CWF data and visualizing them in a harmonized way, on a common Google Maps background. This module was used during development and is currently not actively maintained, as its replacement with an OGC WMS-based solution is scheduled in the future.
2.19 Use of AirMerge in the PESCaDO project The PESCaDO [28] service system has been developed with an express purpose quite relevant to that of AirMerge, by being oriented towards discovering new environmental data sources on the Web and integrating them in a centralized repository. In contrast to AirMerge however, emphasis has been placed in automatic discovery, retrieval and classification of informational nodes, including elements of Machine Learning and ontological data organization, while allowing for extensions through auxiliary external functionality. In this context, AirMerge has been interfaced to PESCaDO with two distinct roles/functionalities, that of an environmental data node, and as a provider of forecasts ensembling and fusion.
In PESCaDO, the concept of environmental data node encompasses every kind of usable online data source, adopting a philosophy similar to AirMerge, but more focused on sources broader than heatmaps (i.e., websites providing weather, air quality and pollen forecasts and historical data). Normally, PESCaDO includes data discovering and fetching mechanisms designed to operate on readable text contents, data feeds, air quality bulletins, and, more in general, on textual web resources and websites, inferring context and contents by the use of semanticontological text analysis techniques. However, in the PESCaDO’s system early design phases, it was realized that important environmental information is included also in non-textual data (e.g., heatmaps) and therefore having access to AirMerge’s mechanisms and its already-harvested data would be advantageous. For non-textual sources, such as heatmaps, PESCaDO can make use of specialized connectors that allow it to either access pre-organized external databases, or utilize specialized computation and processing modules that allow it to make use of non-textual data sources, thus adding even non-textual resources to its knowledge database. AirMerge has found use within the PESCaDO project in three separate ways. First, access to its already harvested data was granted through the AirMerge API, whose initial development was stimulated and shaped precisely by the needs of PESCaDO. Using the AirMerge API, the PESCaDO service was capable of performing point-based queries all over Europe, receiving precise numerical responses. Second, AirMerge performs a type of localized, point-based ensembling [21] when a particular geographic location is covered by more than one CWF provider, as well as producing an uncertainty metric of the final reported result. This allows PESCaDO not only to supply its users with numerical, rather than qualitative information, but also to provide an estimate of the underlying data’s reliability and precision. This is achieved by keeping the AirMerge platform running as usual, with the PESCaDO platform performing its queries remotely through the API, without any implementation details of the two platforms being exposed [5]. A third and more direct involvement of AirMerge in the PESCaDO platform, was achieved by the almost direct reuse of AirMerge’s heatmap parsing component by part PESCaDO. This component can be used autonomously, and even off-line, provided a suitable configuration script is supplied, describing how to parse a specific heatmap. The configuration script can be generated either manually or automatically using a dedicated annotation tool [17], and is similar to the examples already shown. The component utilized by PESCaDO is a stripped-down version of the parsing subsystem. It utilizes the same configuration scripting as AirMerge system minus some features like an URL sequencer and harvester, support for provider-specific configurations, multiple map regions, and in general without any features meant to process sequences of similar heatmaps. The scripting language is instead reduced to describe how to parse a single specific heatmap, rather than an entire class of heatmaps bound by some common characteristics. While this may seem restrictive at first, it is actually a customization for fitting with PESCaDO’s semi-automatic heatmap configuration subsystem, which attempts to automatically determine the characteristics of a heatmap like resolution, geographical bounding, color scale and values, etc. based on OCR and text processing techniques. Then, this information is validated and further edited by an administrative user with the aid of a dedicated Annotation Tool. This information is then used to generate an AirMerge heatmap parsing XML script,
which is fed to the parsing AirMerge component, according to the scheme in Fig. 13, where the AirMerge component is indicated as the BHeatmap Processing^ block [17]. In the future, this same subsystem can be made publicly available via WPS, thus eliminating the need to provide an implementation library to use in situ.
2.20 Conclusions and future developments In order to design an information system that uses heatmaps as its input and produces high quality environmental information as its output, a precise knowledge of the heatmaps’ structure and characteristics is required, as well as designing a streamlined and coordinated process for data retrieval, handling, information extraction and system operation. AirMerge was designed from the ground-up according to this knowledge, adopting a bottom-up approach and following a results-oriented development strategy, in order to deal with any encountered sub-problems when treating heatmaps. The AirMerge system is not meant to be static, but it evolves and is upgraded along with changes in the CWF publishing scene where it is currently applied. New providers, heatmap formats and map projections are being added as necessary, while changes in current providers’ publishing patterns are being followed. AirMerge focuses on daily updates starting from published heatmaps, rather than one-off exchanges of historical model or station data, though AirMerge can also function as a historical repository. AirMerge can be considered as filling a niche between long-term historical and statistical presentation of regional air quality data, and short term CWF, allowing for its database to grow as new CWF are published. AirMerge has been evaluated both in the roles of a CWF data repository and a supplier of specialized chemical weather processing services both as a standalone research tool, as well as a component on a third-party value-added service (PESCaDO). In the way of making AirMerge more interoperable and more readily accessible by other third parties, as well as being more readily utilizable as a base for building CWF-related services [28], the implementation of Open Geospatial Consortium standards is considered, to work alongside or even entirely supersede the custom AirMerge API for most tasks. In particular, visualization of harvested heatmaps could be performed through the OGC Web Map Service, while downloadable numerical data could be supplied through on-the-fly generation of NetCDF files or other suitable formats by an OGC Web Coverage Service. In addition, certain extra processing functions offered through the AirMerge API could be better exposed as OGC WPS (Web Processing Service) processlets. In general, future efforts will be directed on making AirMerge more accessible to third parties through the use of well-established GIS standards, rather than providing custom access and visualization interfaces, in order to turn AirMerge into an attractive, standards-compliant and solid foundation for the development of CWF-related web services.
Fig. 13 Overall heatmap content distillation architecture in PESCaDO
Acknowledgments AirMerge was developed in the frame of COST Action ES0602, and was financially supported by the FMI during the years 2010–2012 and co-funded by the PESCaDO project during 2012– 2013. This publication was supported by the BIKY Fellowships of Excellence for Postgraduate Studies in Greece—Siemens Program^ at the time of writing.
References 1. Aalto A. (2012) Scalability of Complex Event Processing as a part of a distributed Enterprise Service Bus. [Internet]. Espoo, Finland [cited 2014 Jul 31]. Available from: http://www.cleen.fi/en/SitePages/Public% 20deliverables.aspx?fileId=780&webpartid=g_4859b5f8_884d_4432_8aab_2e4c3e4f17dd 2. Armenakis C, Savopol F (2014) Image processing and GIS tools for feature and change extraction. In: Proc. of the XXth ISPRS Congress. Istanbul, p. 605–610 3. Balk T, Kukkonen J, Karatzas K, Bassoukos A, Epitropou V (2011) A European open access chemical weather forecasting portal. Atmos Environ 45:6917–6922 4. Bassoukos A (2013) AirMerge Remote API Overview. [Internet]. [cited 2014 Jul 31]. Available from: https:// docs.google.com/document/d/10z_B-Vxd1YJbKADVdsM30OuSoXszpfyRnuVNf4-qUio/edit?usp=sharing 5. Epitropou V, Johansson L, Karatzas K, Bassoukos A, Karppinen A, Kukkonen J, Haakana M. (2012) Fusion Of Environmental Information For The Delivery Of Orchestrated Services For The Atmospheric Environment In The PESCaDO Project. [Internet]. Leipzig, Germany [cited 2014 Aug 4]. Available from: http://www.iemss.org/sites/iemss2012//proceedings/D2_1012_Johansson_et_al.pdf 6. Epitropou V, Karatzas K, Karppinen A, Kukkonen J, Bassoukos A (2012) Orchestration services for chemical weather forecasting models in the frame of the PESCaDO project. In: 8th International Conference on Air Quality—Science and Application; Athens, 19–23 7. Epitropou V, Karatzas K, Kukkonen J, Vira J (2012) Evaluation of the accuracy of an inverse image-based reconstruction method for chemical weather data. Int J Artif Intell 9(12):152–171 8. Epitropou V, Karatzas K, Bassoukos A (2010) A method for the inverse reconstruction of environmental data applicable at the Chemical Weather portal. In: Proceedings of the GI-Forum Symposium and exhibit on applied Geoinformatics; p. 58–68 9. European Earth Observation Programme (2012) PASODOBLE project. MyAir PASODOBLE project homepage. [Internet]. [cited 2014 Aug 4]. Available from: http://www.myair.eu/airsheds/ 10. European Environment Agency (2014) AirBase - The European air quality database. [Internet]. [cited 2014 Aug 4]. Available from: http://www.eea.europa.eu/data-and-maps/data/airbase-the-european-air-qualitydatabase-7 11. Galmarini S, Kioutsoukis I, Solazzo E (2013) E pluribus unum: ensemble air quality predictions. Atmos Chem Phys 29:7153–7182 12. Horálek J, Tarrasón L, de Smet P, Malherbe L, Schneider P, Ung A, Corbet L, Denby B (2013) Evaluation of copernicus MACC-II ensemble products in the ETC./ACM spatial air quality mapping. Technical Paper 2013/9. European Topic Centre on Air Pollution and Climate Change Mitigation 13. Karatzas K, Kukkonen J (2009) COST Action ES0602: Quality of life information services towards a sustainable society for the atmospheric environment. Sofia Publishers, Thessaloniki 14. Karatzas K, Kukkonen J, Bassoukos A, Epitropou V, Balk T (2011) A European chemical weather forecasting portal. In: Steyn GD, Trini SC (eds). Air pollution modeling and its applications XXI. Springer, NATO Science for Peace and Security Series C: Environmental Security: p. 239–243 15. Khan FH, Javed MY, Bashir S, Khan A, Sikandar M, Khiyal H (2010) QoS based dynamic web services composition & execution. Int J Comput Sci Inf Secur 16. Kukkonen J, Olsson T, Schultz DM, Baklanov A, Klein T, Miranda AI, Monteiro A, Hirtl M, Tarvainen V, Boy M et al (2012) A review of operational, regional-scale, chemical weather forecasting models in Europe. Atmos Chem Phys 12:1–87 17. Moumtzidou A, Epitropou V, Vrochidis S, Karatzas K, Voth S, Bassoukos A, Mossgraber J, Karppinen A, Kukkonen J, Kompatsiaris I (2014) A model for environmental data extraction from multimedia and its evaluation against various chemical weather forecasting datasets. J Ecol Inf 23:69–82 18. Moumtzidou A, Epitropou V, Vrochidis S, Voth S, Bassoukos A, Karatzas K, Moßgraber J, Kompatsiaris I, Karppinen A, Kukkonen J (2012) Environmental data extraction from multimedia resources. In: Proceedings of the 1st ACM international workshop on Multimedia analysis for ecological data (MAED 2012); Nara, Japan. p. 13–18 19. Open Geospatial Consortium (2013) OGC best practice for using web map services (WMS) with timedependent or elevation-dependent data [Internet]. [cited 2014 Jul 30]. Available from: http://external.
20. 21. 22.
23. 24. 25. 26. 27.
28.
opengeospatial.org/twiki_public/pub/MetOceanDWG/MetOceanWMSBPOnGoingDrafts/12-111r1_Best_ Practices_for_WMS_with_Time_or_Elevation_dependent_data.pdf OSGeo Foundation (2014) GeoTools Infosheet. [Internet]. [cited 2014 Jul 31]. Available from: http://www. osgeo.org/geotools Potempski S, Galmarini S (2009) Est modus in rebus: analytical properties of multi-model ensembles. Atmos Chem Phys 9(24):9471–9489 Riga M, Karatzas K (2014) Investigating the relationship between social media content and real-time observations for urban air quality and public health. In: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS ‘14); New York, USA. p. 59:1–7 Snyder J (1987) Map Projections—A Working Manual. Professional Paper: 1395. USGS Publications Warehouse Snyder J (1989) An album of map projections. Professional Paper: 1453. USGS Publications Warehouse Sofiev M, Siljamo P, Valkama I, Ilvonen M, Kukkonen J (2006) A dispersion modelling system SILAM and its evaluation against ETEX data. Atmos Environ 44(4):674–685 Verstraete MM, Pinty B (2013) Environmental information extraction from satellite remote sensing data. In: Kasibhatla P (ed) Inverse methods in global biogeochemical cycles, vol 1., pp 125–137 Vrochidis S, Epitropou V, Bassoukos A, Voth S, Karatzas K, Moumtzidou A, Moßgraber J, Kompatsiaris I, Karppinen A, Kukkonen J (2012) Extraction of environmental data from on-line environmental information sources. In: IFIP Advances in Information and Communication Technology; p. 361–370 Wanner L, Vrochidis S, Rospocher M, Mossgraber J, Bosch H, Karppinen A, Myllynen M, Tonelli S, Bouayad-Agha N, Bugel U, et al (2012) Personalized environmental service orchestration for quality of life improvement. In: artificial intelligence applications and innovations, IFIP advances in information and communication technology, 3rd Intelligent Systems for Quality of Life information Services (ISQL 2012); Halkidiki, Greece. p. 351–360
Victor Epitropou has received his MEng degree in Electrical Engineering and Computer Science from the Democritus University of Thrace (DUTh) in 2007 with a thesis on the parallelization of algorithms in an objectoriented context, and his MSc degree in Digital Image and Signal processing from the DUTh in 2011, after serving in the Hellenic Army as a reserve officer of the Research & IT corps from 2007 to 2009. He has worked at the Informatics Systems & Applications Group of the Aristotle University of Thessaloniki since late 2009 on several European research projects in the domain of air quality and personalized web services, and is currently a PhD candidate at the Aristotle University of Thessaloniki. His research and work interests also include embedded software development, neural networks, wireless sensor networks, and desktop applications development.
Tassos Bassoukos is a Senior Software Engineer with the Informatics Systems & Applications Group of the Aristotle University of Thessaloniki, handling the group’s Software Engineering needs. He has been carrying the main software development burden of the group for several years. He has been working with Web technologies since HTML 2.0 and Java web applications since 1999. He has participated in several European research projects and has seen them to successful completion. Additional areas of professional interest include Open Source frameworks, Web-based Content management systems, inter-language integration, web mapping systems, information representation and Human-Computer Interfaces.
Kostas Karatzas is an Asst. Professor for Informatics Systems & Applications at the Dept. of Mechanical Engineering, Aristotle University of Thessaloniki (AUTh), Greece, where he leads the Informatics Applications and Systems Group (ISAG). The Group is specialized in Environmental Informatics, and is conducting data– oriented analysis and modelling at a raw, processed, and e-service level, for citizens, authorities and industry, employing Computational Intelligence. ISAG has a long expertise in web-based applications, web portals and services, as well as in applications for mobile devices and smart phones, with an emphasis on participatory environmental sensing. Kostas Karatzas is the author and co-author of more than 150 scientific publications, and is serving as a scientific committee member for international conferences and as an advisory board member of int. publications, with emphasis on Environmental Informatics and Computational Intelligence.
Dr. Ari Karppinen has worked as a research scientist at the Finnish Meteorological Institute since 1984. His expertise is on mathematical modeling, atmospheric physics and chemistry; particularly evaluation of urban air quality, the dispersion of pollution from traffic. His MSc thesis (1987) dealt with the description and application of a system for calculating radiation doses due to long range transport of radioactive releases and his Licentiates’s thesis (1998) studied the effective choice of NOx - emission control measures. His doctor’s thesis (2001) dealt with the meteorological pre-processing and atmospheric dispersion modeling of urban air quality and applications in the Helsinki metropolitan area Doc. Karppinen is the author of approximately 200 scientific publications; 37 of these in refereed international journals. He has given over 50 lectures and presentations at scientific conferences.
Leo Wanner is an ICREA Research Professor in the Department of Information and Communication Technologies at Universitat Pompeu Fabra (UPF). He earned his Diploma degree in Computer Science from the University of Karlsruhe, Germany, and his Ph.D. in Linguistics from the University of The Saarland, Saarbrücken, Germany. Leo works in the field of computational linguistics. His research foci include automatic multilingual report generation, automatic summarization of written material and paraphrasing, computational lexicology and lexicography. Throughout his career, Leo has been involved in various large-scale national, European, and transatlantic research projects. He has published five books and over 100 refereed journal and conference articles.
Stefanos Vrochidis received the received the Diploma degree in electrical engineering from Aristotle University of Thessaloniki, the MSc degree in radio frequency communication systems from University of Southampton, and the PhD degree in electronic engineering from Queen Mary, University of London. He is a postdoctoral researcher with ITI-CERTH. His research interests include semantic multimedia analysis, indexing and retrieval, semantic search, multimedia search engines and human interaction, as well as environmental applications and patent search.
Ioannis Kompatsiaris received the Diploma degree in electrical engineering and the Ph.D. degree in 3-D model based image sequence coding from Aristotle University of Thessaloniki in 1996 and 2001, respectively. He is a Senior Researcher with ITI-CERTH and director of its Multimedia Knowledge Laboratory. His research interests include multimedia content processing, multimodal techniques, multimedia and Semantic Web, multimedia ontologies, knowledge-based analysis, and context aware inference for semantic multimedia analysis, personalisation and retrieval.
Jaakko Kukkonen received the Ph.D. degree in physics from the University of Helsinki in 1990. He is currently Research Professor at the Finnish Meteorological Institute. He is also Docent (Adj. Prof.) of Physics at the University of Helsinki and Visiting Professor at the University of Hertfordshire (U.K.). He has worked on atmospheric physics and chemistry, including especially the development, evaluation and applications of mathematical atmospheric models.