A model for environmental data extraction from

8 downloads 0 Views 6MB Size Report
extensively evaluated by comparing data extracted from a variety of chemical weather heat maps against .... In this context, this paper addresses the extraction of air quality and .... by applying image processing techniques to identify unique patterns, ... (ECWFP1), while an overview of the first version of this portal has been.
ECOINF-00416; No of Pages 14 Ecological Informatics xxx (2013) xxx–xxx

Contents lists available at ScienceDirect

Ecological Informatics journal homepage: www.elsevier.com/locate/ecolinf

A model for environmental data extraction from multimedia and its evaluation against various chemical weather forecasting datasets Anastasia Moumtzidou a,⁎, Victor Epitropou b, Stefanos Vrochidis a, Kostas Karatzas b, Sascha Voth c, Anastasios Bassoukos b, Jürgen Moßgraber c, Ari Karppinen d, Jaakko Kukkonen d, Ioannis Kompatsiaris a a

Information Technologies Institute, Centre for Research and Technology Hellas, Greece Informatics Systems and Applications Group, Aristotle University of Thessaloniki, Greece Fraunhofer Institute of Optronics, System Technologies and Image Exploitation, Germany d Finnish Meteorological Institute, Helsinki, Finland b c

a r t i c l e

i n f o

Article history: Received 31 January 2013 Received in revised form 12 July 2013 Accepted 20 August 2013 Available online xxxx Keywords: Air quality Heatmap Image processing OCR Environmental Multimedia

a b s t r a c t Environmental data analysis and information provision are considered of great importance for people, since environmental conditions are strongly related to health issues and directly affect a variety of everyday activities. Nowadays, there are several free web-based services that provide environmental information in several formats with map images being the most commonly used to present air quality and pollen forecasts. This format, despite being intuitive for humans, complicates the extraction and processing of the underlying data. Typical examples of this case are the chemical weather forecasts, which are usually encoded heatmaps (i.e. graphical representation of matrix data with colors), while the forecasted numerical pollutant concentrations are commonly unavailable. This work presents a model for the semi-automatic extraction of such information based on a template configuration tool, on methodologies for data reconstruction from images, as well as on text processing and Optical Character Recognition (OCR). The aforementioned modules are integrated in a standalone framework, which is extensively evaluated by comparing data extracted from a variety of chemical weather heat maps against the real numerical values produced by chemical weather forecasting models. The results demonstrate a satisfactory performance in terms of data recovery and positional accuracy. © 2013 Elsevier B.V. All rights reserved.

1. Introduction The analysis of environmental data and the generation, combination and reuse of related information, such as air pollutant concentrations, is of particular interest for people. Environmental status information (in particular, the concentration of certain pollutants in the air) is considered to be correlated with a series of health issues, such as cardiovascular and respiratory diseases, it directly affects several outdoor activities (e.g. commuting, sports, trip planning, agriculture) and therefore it is strongly related to the overall quality of life. In addition, the analysis of environmental information is often a prerequisite for the fulfillment of legal mandates on the management and preservation of environmental quality, according to the EU's and other legal frameworks (Karatzas and Moussiopoulos, 2000). With a view to offering personalized decision support services for people based on environmental information ⁎ Corresponding author at: Centre for Research and Technology Hellas, Information Technologies Institute, 6th km Charilaou-Thermi Road, P.O. Box 60361, 57001 Thermi, Thessaloniki, Greece. Tel.: +30 2311257746. E-mail addresses: [email protected] (A. Moumtzidou), [email protected] (V. Epitropou), [email protected] (S. Vrochidis), [email protected] (K. Karatzas), [email protected] (S. Voth), [email protected] (A. Bassoukos), [email protected] (J. Moßgraber), ari.karppinen@fmi.fi (A. Karppinen), jaakko.kukkonen@fmi.fi (J. Kukkonen), [email protected] (I. Kompatsiaris).

regarding their everyday activities (Wanner et al., 2012) and supporting environmental experts in air quality preservation tasks, there is a need to extract, combine and compare complementary and competing environmental information from several resources in order to generate more reliable and cross-validated information on the environmental conditions. One of the main steps towards this goal is the environmental information extraction from heterogeneous resources. Environmental observations are automatically performed by specialized instruments, hosted in stations established by environmental organizations, while the forecasts, which are used to foretell weather conditions, the levels of pollution or pollen concentration in areas of interest, are provided by environmental prediction models, the output of which are gridded numerical data, henceforth referred to as ‘actual’ or ‘original’ data. In practice only a few of the data providers make available to the public some means of access to their actual (numerical) forecast data, while the majority publishes the results in the form of preprocessed images, that address specific environmental pressures (like air pollution concentrations), for specific temporal scales (usually in the order of hours or days), and for specific geographical areas of interest. However, even if the original data values of environmental information had been available, these would commonly be presented in various technical formats, using various coordinates and spatial resolutions, different units, and several other choices (e.g., Kukkonen et al.,

1574-9541/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.ecoinf.2013.08.003

Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003

2

A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxx–xxx

2012). It can therefore be a laborious task to convert these data files to the same harmonized format, for inter-comparison purposes. Consequently, the main sources of environmental information for everyday use are web portals and sites, which provide a variety of information of diverse spatial and temporal nature. Although the weather forecasts are usually presented in textual format (Moumtzidou et al., 2012b), important environmental information such as the air quality and pollen forecasts is encoded in multimedia formats (Karatzas, 2005). Specifically, the vast majority of such environmental data are published as static heatmaps (i.e. graphical representation of matrix data with colors), or as sequences of heatmaps (time-lapse animations). A characteristic example of a heatmap is presented in Fig. 1 (generated by the SILAM model, courtesy of FMI). However, since this information comes from different providers and is presented in a variety of not intercomparable and compatible visual forms, it is not possible to directly combine them and compile a synthetic service that takes into account all available data sources. In order to deal with this problem, it is necessary to design and develop a model that is capable of extracting environmental information from heatmaps and translate them to a structured numerical format. The processing of images for their conversion into numerical data would comprise the core of environmental data recovery techniques, at least in the air pollution and the pollen concentration domains. In this context, this paper addresses the extraction of air quality and pollen forecasts from heatmaps, by proposing a semi-automatic framework, which consists of three main components: an annotation tool for administrative user intervention used for generating configuration templates for each heatmap, an Optical Character Recognition (OCR) and text processing module used for fetching text information embedded in the image and making the necessary corrections, as well as the AirMerge heatmap processing module (Epitropou et al., 2011) that allows for the automatic harvesting, annotation, harmonization and reconversion of heatmaps into numerical data. The framework is

evaluated against the AirMerge system and various chemical weather forecast datasets. It should be highlighted that the AirMerge system, per-se, does not include an automated annotation process, therefore any heatmap harvesting and parsing procedure must be manually scripted, even though the programmatic generation of certain types of highly repetitive scripts e.g. to handle series of images from one same provider, is possible. On the contrary, the proposed framework aims at automating this scripting process, by generating the configuration scripts required by AirMerge on a per-case basis via optical heatmap analysis, the use of graphical templates and machine automation. The results of the resulting scripts are then compared to those obtained by using the best manually configured AirMerge scripts for a given heatmap template, and the differences in their setup and final data extraction results are discussed. The contribution of this work is a novel framework that integrates multimedia annotation and processing modules, in order to allow for the semi-automatic extraction of air quality and/or pollen forecast data presented in heatmaps. Specifically, this framework integrates multimedia configuration components (annotation tool), advanced systems for heatmap image processing (AirMerge) and optimized OCR techniques. These modules are integrated in a standalone, user-based interface that allows for template-based customization of heatmaps and thus assists in handling several formats of heatmaps. This paper substantially extends the works presented in Moumtzidou et al. (2012a) and Vrochidis et al. (2012), which have demonstrated the initial results of this framework, by providing an extensive evaluation, which includes a comparative study of the proposed framework against the manually configured AirMerge system and real numerical data provided by forecast models for a variety of providers. This paper is structured as follows: Section 2 presents the previous research on heatmap analysis and content extraction, Section 3 describes the results of studies on the presentation format of environmental

Fig. 1. An example of an air quality heatmap: the forecast of NO2 concentrations (μg/m3) at 8 UTC time of 6 December 2012, using the SILAM chemical transport model.

Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003

A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxx–xxx

information as well the a typical heatmap. Section 4 presents the problem and its requirements, while Section 5 describes the overall architecture, the involved modules (i.e. annotation tool, text extraction and processing module and heatmap processing module) and a short comparison of the proposed system and AirMerge. The evaluation results are presented in Section 6 and finally, Section 7 concludes the paper. 2. Previous research The task of map analysis strongly depends on the map type and the information we need to extract. Depending on the application, a straightforward requirement would be to perform semantic image segmentation (e.g. rivers, forests, etc.), while in the case of heatmaps it is to transform color into numerical data. In general, the discriminating factors between map types are reflected in their scale, colorization, quality, accuracy, topology and many other aspects. In the case of air quality (or chemical weather) maps there are mainly two types of information covered by the map data: • Geographical information: points and lines describing country frontiers or other well-known points of interests or structures (e.g. sea, land, capitals) in a given coordinate system. These features can often be used as cues for manually or automatically identifying the geographical registration of a map. • Feature information: forecasted or measured data of any kind (e.g. average temperature or pollutant concentration), which are coded via a color scale representing the measured values. Single values are referenced geographically by a color value at the corresponding geographical point. Chemical weather maps often use raster map images to represent measured or forecasted data. There are several approaches to extract and digitalize this image information automatically. Musavi et al. (1988) describe the process of the vectorization of digital image data. Hereby the geographical information, in form of lines, is extracted and converted to digital storable vector data. In another work (Desai et al., 2005), the authors describe an approach to efficiently identify street maps among several other images by applying image processing techniques to identify unique patterns, such as street lines, which differentiate them among all other images. For the identification of street maps, the Law's texture classification algorithm is applied in order to recognize the unique image patterns such as street lines and street labels. Finally, the authors use GEOPPM, an algorithm for automatically determining the geocoordinates and scale of the maps. In another similar work (Henderson and Linton, 2009), the authors use the specific knowledge of the known colorization in USGS maps, to automatically segment these maps based on their semantic contents (e.g. roads, rivers). Chiang and Knoblock (2006) propose an algorithm using 2-D Discrete Cosine Transformation (DCT) coefficients and Support Vector Machines (SVM) to classify the pixels of lines and characters on raster maps. In Michelson et al. (2008), the authors present an automatic approach to mine collections of maps from the web. This method harvests images from the web and then classifies them as maps or non-maps by comparing them to previously classified map and non-map images using methods from Content-Based Image Retrieval (CBIR). Specifically, a voting, k-Nearest Neighbor classifier is used as it allows exploiting image similarities without explicitly modeling them compared to other traditional machine learning techniques such as Support Vector Machines. Finally, Cao and Tan (2002) improve the segmentation quality of text and graphics in color map images, to enhance the results of the following analysis processes (e.g. OCR), by selecting black or dark pixels from color maps, cleaning them up from possible errors or known unwanted structures (e.g. dashed lines), to get cleaner text structures. In addition, a specific attempt for map recognition was realized within the context of TRECVID workshops (Smeaton et al., 2006). Specifically,

3

the ‘maps’ concept was evaluated in the high level concept feature extraction task of TRECVID 2007 (Kraaij et al., 2007). The best performing system for the map concept was Yuan et al. (2007), which is based on supervised machine learning techniques on several fused visual descriptors. In another approach evaluated in TRECVID 2007 (Ngo et al., 2007), the authors explore the upper limit of bag-of-visual-words (BoW) approach based upon local appearance features and evaluate several factors which could impact their performance. The proposed system is based on the fusion of Support Vector Machine classifiers that use BoW, spatial layout of keypoints, edge histogram, grid based color moment and wavelet texture features. In this context, Chang et al. (2007) developed a cross-domain SVM (CDSVM) algorithm for adapting previously learned support vectors from one domain to help the classification in another domain. However, these algorithms were tested generally on maps and no testing was realized on heatmaps. Although research work has been conducted towards the automatic extraction of information in maps, very few works address the automatic extraction of information from chemical weather maps or environmental maps in general. However, such an extraction method has been included in the European Open Access Chemical Weather Forecasting Portal (ECWFP1), while an overview of the first version of this portal has been presented by Balk et al. (2011). In Epitropou et al. (2011, 2012), a method to reconstruct environmental data out of chemical weather images is described and developed (AirMerge system). First, the relevant map section is scraped from the chemical weather image. Then, disturbances are removed and a color classification is used to classify every single data point (pixel), to recover the measured data. With the aid of the known geographical boundaries, given by the coordinate axis and the map projection type, the geographical position of the measured data point can be retrieved. In the case of missing data points, a special interpolation algorithm (based on a novel Artificial Neural Network algorithm developed by the authors) is used to close these gaps. The authors in Moumtzidou et al. (2012a) and Vrochidis et al. (2012) propose a framework that integrates the system of Epitropou et al. (i.e. AirMerge system) and aims at automating and thus facilitating its use. In both works the proposed system is evaluated only against the AirMerge system (semi-automated versus manual configuration), while in the current work a more extensive evaluation is realized by using the original numerical values that were generated by the corresponding forecast models as the ground truth.

3. Study and description of forecasted chemical weather heatmaps In this section we present insights into the presentation of environmental information, focusing on air quality and pollen forecasts. The results of a study we have conducted on more than 60 environmental websites (dealing with weather, air quality and pollen), as well as the findings of previous works (Karatzas, 2005) revealed that a considerable share of environmental content, almost 60%, is encoded in images and specifically heatmaps. Overall, it can be said (Balk et al., 2011) that the chemical weather forecasting information is usually presented in the form of images representing pollutant concentrations over a geographically bounded region, typically in terms of maximum or average concentration values for the time scale of reference, which is usually the hour or day (Epitropou et al., 2011). These providers present air quality forecasts almost exclusively in the form of preprocessed images with a color index scale indicating the concentration of pollutants. In addition, they individually choose the image resolution and the color scale employed for visualizing pollution loadings, the covered region, as well as the geographical map projection. The mode of presentation varies from simple web images to AJAX, Java or Adobe Flash viewers (Kukkonen et al., 2009) and while this representation is informative for the casual user (e.g. compared to a table with numerical values), it 1

http://www.chemicalweather.eu/Domains.

Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003

4

A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxx–xxx

Fig. 2. Problem statement and steps involved.

has the drawback that the data are being presented in a wide range of highly heterogeneous forms, which makes it very complicated to extract and compare their results. Moreover, some of the images are permanently marked with visible watermarks, text, lines etc. that would make the extraction phase even more challenging. In general, the heatmaps that contain chemical weather information are commonly static bitmap images, which represent the coverage data (e.g. concentrations) in terms of a color-coded scale over a geographical map. A characteristic example of such a heatmap, obtained from the SILAM FMI2 website, is depicted in Fig. 1. In general, the information that can be embedded in a heatmap image, is the geographical coordinates of the map, the type of environmental aspect (e.g. ozone, birch pollen), the date/time information of the meaningful information and the color scale. After a careful observation of numerous heatmaps, we conclude that the information that is considered of importance besides the geographical coordinates and the concentrations, is the type of physical property (i.e. concentration of NO2), the date/time information (i.e. 2012-12-6 1268 08:00) and the color scale. Summarizing, the main parts of information that need to be extracted and/or processed from all images are the following: • Heatmap: map depicting a geographical region with colors representing the values of an environmental quantity. • Color scale: range indicating the correspondence between value and color. • Coordinate axes (x, y): indicate the geographical longitude and latitude of every map point for a specific geographic projection. On some heatmaps, the coordinates and their scale are explicit, while for others they must be deduced differently, e.g., by using known landmarks. • Title: contains information such as the type of measured physical property, the time and date of the forecast, and additional information such as type of measurement procedure (e.g. hourly average or daily maximum). • Additional information: watermarks, border and coastal lines, wind fields superimposed to concentration maps and any other information that can be useful for visual interpretation and geographical 2

http://silam.fmi.fi/AQ_forecasts/Regional_v4_9/index.html.

registration purposes. However this type of information is categorized as “noise” in terms of influencing the information content and representation value of the specific heatmap. 4. Problem statement and requirements After having described the format of heatmaps and the type of the encoded information (i.e. geographical and color information), we will briefly describe the problem we address and the steps towards its solution. The problem description is summarized into the following lines: retrieval of the concentrations of air pollutants' (or other environmental aspects such as birch pollen concentration) numerical values and geographical coordinates out of a heatmap by taking into consideration that the original values have been quantized in order to allow their visualization and thus no one-to-one mapping is possible. The proposed procedure towards the solution of this problem is a four step process and is depicted in Fig. 2. The steps that reflect the requirements of the proposed framework are the following: 1) Removal of noisy elements (e.g. border and coastal lines) 2) Retrieval of the heatmap's raster grid's coordinates and mapping them to actual geographical coordinates 3) Mapping of the heatmap's pixel color to a range of values according to the color scale 4) Retrieval of the final result, i.e. coordinates and pollutant values. 5. Overall architecture of the heatmap processing model The architecture of the proposed framework draws upon the requirements that were set in the previous section. The idea is to employ image analysis and processing techniques to map the color variations on the images on specific categories, which directly correspond to ranges of values, in order to further automatize the process supported by the AirMerge system. Normally, the latter relies on manually or programmatically prepared scripts to perform this task, but, its modular architecture allows for automating it, making AirMerge suitable for use in an automated service. Such automation is crucially needed for the use

Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003

A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxx–xxx

5

Fig. 3. Overall heatmap content distillation architecture.

of this system in an open access portal, such as the ECWFP. To this end, optical character recognition techniques need to be employed for recognizing text encoded in image format, such as image titles, dates, environmental information and coordinates, while an annotation tool is required to support the intervention of an administrative user. Due to the fact that there is a large variation of images and many different representations, there is a need for optimizing and configuring the algorithms involved. Specifically, the intervention of an administrative user is required, in order to annotate and manually segment different parts of a new image type (like data, legend, etc.), which need to be processed by the content extraction algorithms. Ideally, the goal is to construct a complete configuration template – with metadata – for AirMerge in a more automatic way and thus limiting the user input. The proposed system workflow and the involved modules are depicted in Fig. 3. In order to facilitate this configuration through a graphical user interface as already discussed, we have implemented the “annotation tool” (AnT), which is tailored for dealing with heat maps. The output of this tool is a configuration file that holds the static information of the image. The second module is the “text extraction and processing”, which uses the information of the configuration file to extract data from the corresponding image. More specifically, it retrieves and analyzes the information captured in text format using text processing techniques and OCR. The third module is the “heatmap processing”, which uses information both from the output of the “text processing” module and the configuration file to process the heatmap located inside the image. The input of the framework is an image containing a heatmap and the output is an XML file, in which each geographical coordinate of the initial heatmap is associated with a value (e.g., pollutant concentration or air quality index). 5.1. Annotation tool To facilitate the annotation process for the user, an annotation tool (AnT) was developed which can be easily used to annotate heat maps interactively. The annotation tool was realized in C++ and based on the QT framework, which allowed for creating a platform independent tool. To ensure expandability, the MVC (model/view/controller) pattern was used in the software design. Based on this pattern two different interaction methods were implemented on two different data views, at which both are derived from one data structure (the loaded XML template). The first data view was implemented as simple TreeView, which represents the underlying XML data structure and its entries as traversable tree. The second data view was implemented as a GraphicsView, which is capable of interpreting and viewing the selected datasets graphically. This view is used to draw regions of interests (ROIs) or point of interests (POIs) as overlays over the heatmap. The initial drawing of the data is triggered by the selection of the data element (e.g. ROI element for the legend) in the TreeView. Fig. 4 shows the AnT tool with an already loaded heatmap from the SILAM FMI website (Fig. 1). The left section of the AnT user interface shows the loaded heatmap. The loaded image consists of the following elements: a) the dyed map (i.e. heatmap), b) the x and y axes of the map, c) the color legend

with its corresponding d) measurement values, and e) the title and description of the heatmap. The smaller heatmaps on the bottom left and right are secondary information heatmaps present in this particular instance of a published chemical weather image, which however are not being considered in this particular example. After selecting a ROI element from the pre-defined basic template, the ROI is drawn over the heatmap as a red rectangle. Then, the user has the ability to manipulate the ROI directly by moving the rectangle boundaries with the mouse, or alternatively by manipulating the values in the TreeView through direct text input. Both input methods record their changes to the same XML template data structure and update the other data views. 5.2. Text extraction and processing module This module is driven by the configuration file produced by the AnT tool and focuses on retrieving the textual information captured in the image using text extraction and processing techniques through a twostep procedure. The first step (i.e. text extraction) includes the application of OCR on the following parts of the input image: title, color scale, map x and y axes, searching for potential text strings containing relevant information to the heatmap itself. The OCR software that is used is Abbyy Fine Reader,3 though any text processing module could be, in theory, plugged in. It should be noted that the OCR step is not expected to be error free and thus a second step (i.e. text processing) for text correction is required. In this step, we apply text processing based on heuristic rules, in order to correct to a certain extent, extract and understand the semantic information encoded in the aforementioned locations. It should be noted that each of these locations was treated in a different way. The module produces two output files: the first one is used as input for the heatmap processing and holds information concerning the color scale and the map geographical coordinates, while the second captures general information, such as the date and the type of environmental aspect. In the sequel, we describe these two steps by applying them on the characteristic heatmap example of Fig. 1 and present the results. It should be noted that this example is very demanding, since especially the resolution of the text that describes the x and y axes is of very low quality. 5.2.1. OCR on title, color scale, axes Based on the study on heatmaps (see Section 4), considerable part of the meaningful information can be extracted from the text surrounding the image. More specifically, the color scale and the map axes are essential elements that provide information about the values and the geographical area covered. On the other hand, the title (usually) contains information about the environmental physical property measured and the corresponding date/time. The location of the aforementioned image parts needs to be captured in the configuration template. Therefore, we apply OCR on the aforementioned parts of heatmap depicted in Fig. 1. Tables 1, 2, 3 and 4 contain the input and output of 3

http://www.abbyy.com.gr/.

Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003

6

A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxx–xxx

Fig. 4. Annotation tool (AnT) user interface.

OCR for the title, the color scale and, x and y axes respectively. The values in bold indicate the errors produced by OCR. It should be noted that for the cases of the color scale and the x and y axes, we also retrieved the exact position of the text, in order to relate the latter with the corresponding colors and geographical coordinates. This is done on the grounds that it is reasonable to assume that e.g. a number under a horizontal line in the image most likely represents a longitude value, while a number located under (or at the side, in case of vertical scales) the beginning of a color region in the color scale most likely represents the minimum or starting value for that color. A careful observation of Tables 1 and 2 shows that the text in the title and the color scale was identified accurately compared to that of the axes. Especially the results after processing the text on the y axis contain a lot of errors. This is due to the fact that the resolution of the figures along y axis is particularly low, which makes it difficult even for the human eye to recognize them successfully. We will attempt to correct as much as possible these errors in the second step. 5.2.2. Text processing on OCR results The next step includes the application of heuristic rules that accrue from the study of the sites containing heatmaps and aim at correcting and understanding the semantic information encoded in the aforementioned places. Each of these segments is treated in different ways, since the type of the semantic information included is different.

Table 1 Title — input image (top) and OCR output (bottom). Forecast for NO2. Last analysis time: 20121206_00 Concentration, μg N/m3, 08Z06DEC2012 Forecast for NO2. Last analysis time: 20121206_00 Concentration, μg N/m3, 08Z06DEC2012

5.2.2.1. Title. The title usually contains the name of the environmental aspect, the measurement units and the date/time. Regarding the measurement units, these are usually standard depending on the measured environmental aspect and therefore we will not attempt to extract them. The date/time is considered as the most complex element given that it is presented in several different formats, which need to be handled separately, using a trial-by-error and maximum likelihood strategy. In order to correct possible errors in the textual format of the month, day and aspect, we apply the following procedure: 1) Construct manually three English vocabularies, which are used as ground truth datasets. These vocabularies hold all the possible values of the aforementioned elements that is the month (e.g. January, Jan.), the day (e.g. Monday, Mon.) and the environmental aspect (e.g. O3, ozone); 2) Split the text returned by OCR into words; 3) Compare the words returned from OCR with the each one of the manually constructed English ground truth sets using the Levenshtein Table 2 Color scale — input image (top) and OCR with text position output (bottom), expressed in pixel coordinates (horizontal and vertical positions, with upper-left origin (0,0)). 0.1

0.2

0.4

0.8

1.5

2.5

4

7

15

25

Position (left, top, right, bottom): 47, 5, 90, 30 — value: 0,1 Position (left, top, right, bottom): 149, 5, 195, 30 — value: 0,2 Position (left, top, right, bottom): 248, 5, 294, 30 — value: 0.4 Position (left, top, right, bottom): 347, 5, 393, 30 — value: 0,8 Position (left, top, right, bottom): 452, 5, 495, 30 — value: 1,5 Position (left, top, right, bottom): 548, 5, 594, 30 — value: 2,5 Position (left, top, right, bottom): 662, 5, 681, 24 — value: 4Position (left, top, right, bottom): 764, 5, 780, 30 — value: 7 Position (left, top, right, bottom): 857, 5, 891, 31 — value: 15 Position (left, top, right, bottom): 953, 5, 990, 30 — value: 25

Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003

A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxx–xxx Table 3 Coordinates of x axis — input image (top) and OCR (with position) output (bottom), expressed in pixel coordinates (horizontal and vertical positions, with upper-left origin (0,0)). 6E

9E

12E

15E

18E

21E

24E

27E

30E

Position (left, top, right, bottom): 18, 3, 44, 24 — value: 6E Position (left, top, right, bottom): 147, 3, 173, 24 — value: 9E Position (left, top, right, bottom): 273, 3, 311, 24 — value: 1iE Position (left, top, right, bottom): 401, 3, 441, 24 — value: 16E Position (left, top, right, bottom): 531, 3, 570, 24 — value: 1AE Position (left, top, right, bottom): 659, 2, 684, 24 — value: 21 Position (left, top, right, bottom): 689, 2, 702, 24 — value: E Position (left, top, right, bottom): 788, 2, 831, 24 — value: 24E Position (left, top, right, bottom): 918, 2, 960, 24 — value: 2?E Position (left, top, right, bottom): 1050, 3, 1088, 24 — value: 3u£

distance metric (Levenshtein, 1966); the Levenshtein distance is a string metric for measuring the difference (distance) between two sequences (words in our case); specifically, the Levenshtein distance metric is calculated as the minimum number of single-character edits required to change one word into the other; 4) Correct the initial OCR result by considering the word from the ground truth dataset that has the minimum distance from it. In the specific example of Fig. 1, the OCR module recognized correctly the date/time and aspect parameters and thus no corrections were required. The information we obtained from the title is the following: Date/time: 2012-12-06 08:00:00, aspect: NO2. 5.2.2.2. Color scale. The color scale holds the mapping between color variations in the map and aspect values. The extraction of information from the color scale is a two step procedure. During the first step, the results of OCR (i.e. values and positions) are corrected, while in the second we associate values with colors. Regarding the first step, it should be noted that in case the scale values change in a linear way, the most common difference among them is calculated and then the scale values are adapted accordingly. Otherwise, we do not proceed on such adaptations, since it is possible that the resulting values will not be corrected. The information regarding the linearity of color scale values is provided by the administrative user through the AnT tool. Then, the correlation of values to colors is achieved by taking into consideration the orientation of the color scale and by using the pixel coordinates given by OCR. In the specific example, the values 0.8–1.5 are mapped to the color found at the (268, 447) coordinates of the initial image. It should be noted that since the scale values do not increase in a linear way, no attempt is made to modify them. 5.2.2.3. X and Y axes. In order to deal with x and y axes, similar processing techniques are applied, since they both represent the geographical coordinates of the map. Specifically, at least two points of the map (giving two distinct geographical coordinates), as well as their position with respect to the map's raster (giving two distinct pixel coordinates) need to be resolved, in order to successfully identify all the point coordinates through a geographical bearing extrapolation procedure. The procedure Table 4 Coordinates of y axis — input image (left) and OCR with position output (right), expressed in pixel coordinates (horizontal and vertical positions, with upper-left origin (0,0)). 70N 68N 66N 64N 62N 60N 58N 56N 54N

Position (left, top, right, bottom): 36, 69, 77, 86 — value: TON Position (left, top, right, bottom): 39, 168, 77, 189 — value: WN Position (left, top, right, bottom): 39, 270, 78, 287 — value: E4N Position (left, top, right, bottom): 39, 369, 78, 387 — value: G4M Position (left, top, right, bottom): 38, 467, 78, 489 — value: &2N Position (left, top, right, bottom): 39, 570, 77, 587 — value: 60N Position (left, top, right, bottom): 36, 668, 78, 690 — value: & N Position (left, top, right, bottom): 35, 770, 78, 788 — value: 56N Position (left, top, right, bottom): 36, 885, 77, 891 — value: −c4fl

7

followed includes again two steps: a) correction of the errors produced by OCR and b) use of the coordinates' elements. In a similar way to the color scale processing, in order to correct the values in both axes we estimate the most common difference among the axis values and adjust the others accordingly, since the values in this case change in a linear way. For the specific example of Fig. 1, after correcting OCR results, we associated the geographical coordinates (9°, 70°) and (18°, 68°) to the image map pixels (162, 130) and (292, 164) respectively. It should be noted that for the specific site a lot of processing and several assumptions were required since the OCR results for the coordinate axes (especially for y axis) were not satisfactory. 5.3. Heatmap processing module In this section, we present the heatmap processing module that extracts data from different models and coordinate systems. This is realized by the AirMerge engine, which is a complex processing framework with its primary purpose being the extraction of environmental data from heatmaps, by using image segmentation, scraping and processing algorithms. Even though it was initially designed to be used for the extraction of chemical weather forecasting data, its methodology is generalizable to any type of heatmaps, provided that it can be algorithmically processed. In addition, AirMerge implements auxiliary functionality such as automatic harvesting of heatmaps, batch processing of large numbers of heatmaps, and persistence of processing results (database storage). The most important component of AirMerge, a derivative of which is also reused (under license) by the proposed framework, is the AirMerge Core Engine, which performs the conversion from image data (heatmaps) to numerical gridded data. The functionality and performance of this engine has been described in Epitropou et al. (2011, 2012) and Karatzas et al. (2011). The Core Engine performs the extraction of data from heatmaps using a processing chain that consists of two main procedures: a) the screen scraping procedure, where raw RGB pixel data are extracted from heatmaps, and classified according to a color scale in order to be mapped to ranges of numerical values; finally, this procedure includes a linear deprojection phase, where the images' raster is interpreted as a geographical grid in a specified geographical projection, centered on reference keypoints; b) the reconstruction of missing values and data gap procedure, which deals with noisy elements on heatmaps. 5.3.1. Screen scraping procedure This step handles the cropping of the original image to a region of interest and parsing of it into a 2D data array directly mapped to the original image's pixels. Also, it deals with the association of the color to minimum/maximum value ranges of the air pollutant concentration levels, which is often implied by the color scale associated with the original image. It should be noted that the information about where to crop, where each color on the legend is, to which index it should correspond, etc. is provided by the configuration template of the AnT tool in the proposed system. In this phase, the mapping of the images' raster to a specific geographical grid is performed, since the images themselves represent geographical region. The configuration options of AirMerge allow for choosing between the most commonly encountered geographical projections (equirectangular, conical, polar stereographic etc.) and choosing keypoints in the image to allow for precise pixel-coordinate mapping. Regarding the pixel-coordinated mapping, while the selection of keypoints is performed manually when using AirMerge as standalone tool, in the proposed work this functionality is realized in an automatic way with the aid of the “text processing and extraction” module. 5.3.2. “Reconstruction of missing values and data gaps” procedure This step is introduced to deal with unwanted elements such as legends, text, geomarkings and watermarks, as well as regions that are

Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003

8

A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxx–xxx

not part of the forecast area, which might be present after the screen scraping phase. The image pixels are classified into three main categories: valid data (with colors that satisfy the color scale's classification), invalid data (with colors not present in the color scale), and regions containing colors that are explicitly marked for exclusion, and which are considered void during processing. Such marked regions are not considered as part of the forecast, and thus do not undergo data correction. However, regions containing unmarked invalid data are considered as regions with correctable errors or “data gaps”, which can be filled-in. This distinction is due to their different appearance patterns: void regions are usually extended and continuous (e.g. sea regions not covered by the forecast, but present on the map), while invalid data regions are usually smaller but more noticeable (e.g. lines, text, watermarks etc.) and with more noise-like patterns, and thus it is more compelling to remove them by using gap-filling techniques. These techniques include traditional grid and pattern-based interpolation techniques using neural networks. In order for the Core Engine module to function, it must be guided through all the relevant details of the heatmap (position, dimension, colors, geographical projection etc.). Normally, this is achieved via an XML scripting subsystem, which is used as AirMerge's configuration template. Each distinct type of heatmap needs its own scripting/configuration file, although similar heatmaps can use the same configuration with no or only minor variations. Generally speaking, whenever a new source of environmental heatmaps is added to AirMerge's list of tasks, a new configuration template/script (using XML syntax) must be created by hand, though it is possible to partially customize this template so that a series of templates will be automatically produced from it. For example, the pattern of the URLs used by a model provider to publish their own heatmaps can be encoded in the template, and used to automatically produce variations of the template only for the parts that vary e.g. resolution may be constant for all images from a given provider, but color scale may be different for every available pollutant, and there might be several different time series available (e.g. 48 or 72 h) for the same pollutant and region. The proposed framework aims at automating the creation of these configuration scripts, which can be quite time consuming and require technical skills, and thus any comparisons are drawn primarily between the accuracy achievable by a technically skilled human operator that knows how to classify heatmaps and create appropriate scripts, and a semi-automated system, which instead relies on cues contained in the heatmaps themselves and a guidance by environmental operators, who do not possess technical skills. 5.4. Comparison of the proposed system and AirMerge Given that both the proposed framework and the AirMerge component can be employed to perform the same task it would be useful to list their advantages and disadvantages in order to make clearer which limitations of AirMerge attempt the proposed architecture to overcome and the errors that are introduced when limiting the user intervention. The advantages of using a manually configured system (i.e. AirMerge) are that in general, it is a very accurate system, if spot-on information

Table 6 OCR error in websites. Site

FMI Pollen FMI GEMs LAPS AOPG

Longitude

Latitude

Original degrees

Estimation

Absolute error

Original degrees

Estimation

Absolute error

5 5 5 2 2

4.98775 4.97516 5.0634 1.9924 1.9937

0.01225 0.02484 0.0125 0.004 0.003

5 5 5 1 1

4.98404 4.96523 4.9776 1.0236 0.9958

0.01596 0.03477 0.0045 0.023 0.004

(i.e. latitude and longitude lines and their values) is available and that it allows a skilled operator to detect optimizations and cues that are difficult for an automated system to realize e.g. template redundancy and reuse (“master templates”), the use of unusual map projections, images with little or no geographical cues etc. However, the main disadvantage is that the manual configuration of the system is a laborious, time consuming and error prone task, while specific expertise and technical skills are required. On the other hand, the proposed framework automates further the data extraction procedure from heatmaps by relieving human operators from the tedious task of manual configuration and allows the usage by administrative users (i.e. environmental experts), who do not have technical skills. However, this automation does not come without a cost, since it is possible that error is introduced during the second module, which includes the OCR and coordinate mapping step. Although both systems have pros and cons, they could serve different application needs. For instance the proposed framework could be more useful for administrative environmental experts, without technical skills, while AirMerge could certainly be used by technically qualified personnel to provide quality measurements. Table 5 contains a brief overview of the advantages and disadvantages of the manually and the semi-automatic configured systems. 6. Results The evaluation of the framework is carried out into two steps with different focuses. The first step deals with evaluating the text extraction and processing module (i.e. OCR and text processing using heuristic rules) by presenting a visual assessment of the output. The final XML output of the system (i.e. mapping of geographic coordinates to forecast values) is not provided, since its visual presentation is not informative, compared to the reconstructed image, which derives from this representation and is more appropriate for visual inspection. The second step presents a direct comparison of the results of the proposed framework with the ones of the AirMerge system, as well as with the numerical values obtained from the corresponding forecast models. 6.1. Qualitative evaluation The tests that have been performed during this step focus on the recognition of the x and y axes and evaluate the mapping of pixels to

Table 5 Advantages and disadvantages of manually and semi-automatic configured systems. Manually configured system (AirMerge)

Semi-automatic configured system (framework)

• Relieves human operators from a potentially tedious task • Potentially very accurate, if spot-on information is available • A technically skilled operator can detect optimizations and cues that are difficult for an • Significant step towards the creation of completely automated systems automated system to realize e.g. template redundancy and reuse (“master templates”). • Can automatically deal with unknown/unlisted types of heatmaps • Usable in a completely automated service • Certain types of heatmaps do not contain enough cues for an automated Disadvantages • Creating proper templates is laborious and error prone system to completely analyze without manual intervention. • Incorrect assumptions by part of the operator can lead to sub-optimal templates • Introduction of error during the geographical mapping procedure • The template configuration requires technical skills and it cannot easily be used by environmental experts Advantages

Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003

A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxx–xxx

9

Fig. 5. Original image retrieved from Pollen FMI site representing the fraction of birch.

geographical coordinates. Given the fact that in this case we are not aware of the original forecasted data that were used for constructing the heatmap values, we assess the results by visual comparison of the original image and the one produced by the proposed framework. The images tested are extracted from the following sites: • FMI Pollen, Pollen Finnish Meteorological Institute site (http:// pollen.fmi.fi) It contains forecasts measurements for several types of pollen such as birch and grass for Europe in general. • FMI, SILAM Finnish Meteorological Institute site (http://silam.fmi.fi/) It contains forecasted measurements for several air pollutants such as nitrogen oxides and fine particles for Europe and for the Northern European countries.

Fig. 6. Reconstructed image.

• GEMS, Global and regional Earth-system Monitoring using Satellite and in-situ data project site (http://gems.ecmwf.int/d/products/raq/) It contains outputs from several state-of-the-art chemistry and transport models for Europe. • LAPS, Laboratory of Atmospheric Physics of the Aristotle University of Thessaloniki site (http://lap.physics.auth.gr/forecasting/airquality.htm) It contains regional air quality forecasts for Greece. • AOPG, Atmospheric and Oceanic Physics Group site (http://www.fisica. unige.it/atmosfera/bolchem/MAPS/) It presents the results of BOLCHEM numerical model that simulates the composition of the atmosphere for Italy.

Fig. 7. Original image retrieved from FMI site representing the NO2 forecast concentration from 500 m using SILAM model.

Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003

10

A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxx–xxx

position of the coordinates on the heatmap to define the pixel coordinate matching. In the following, we present in detail the results for each website. 6.1.1. FMI Pollen website Fig. 5 is the original image retrieved from the site and represents the fraction of birch (%). Fig. 6 depicts the reconstructed image produced from the proposed system after visualizing the XML output. The reconstructed figure is almost identical and in addition any noise (e.g. black lines) was removed. Moreover, the absolute error both for of the latitude and longitude step is very low (around 0.3%).

Fig. 8. Reconstructed image.

Table 6 contains the error introduced by the text extraction and processing module (called “absolute error”), during the process of recognizing the values and the position of the horizontal (latitude) and vertical (longitude) axes. This error is introduced mostly due to the inability of OCR to perfectly identify the position of each coordinate on the map axes and it is calculated as the average difference of the OCR estimation (e.g. 4.98775 in the first line) compared to the initial degrees range (e.g. 5 in the first line) for two consecutive lines and represents the error in the latitude and longitude step (i.e. the difference between two subsequent degrees in the map). It should be noted that the pixel coordinate matching is based on how well the OCR recognizes the position of each coordinate axis value and how well the alignment of this value and the coordinate lines (or ticks) is. Since several heatmaps do not include grid lines, this approach relies only on the

6.1.2. FMI website In case of the FMI site, based on visual assessment, the reconstructed image (Fig. 8) is almost identical to the initial one (Fig. 7). The original image depicts NO2 forecast concentrations for a height of 500 m as estimated by SILAM model. The absolute geo-coordinate error is very low (around 0.6%) for both latitude and longitude and thus the error introduced by OCR is not significant. 6.1.3. GEMs website The images capture O3 forecast concentrations using the EURAD-IM model. Figs. 9 and 10 depict the original and reconstructed image by the AirMerge system, which are almost identical, and any noise (e.g. black lines) is removed. In both cases the error is very low. 6.1.4. LAPS site In a similar way we present the initial and the reconstructed image of this website in Figs. 11 and 12. The results are reported in Table 6 and the error is again very small. The original image was produced using the Fifth Generation Penn/State Mesoscale Model, MM5, and the Eulerian photochemical air quality model CAMx and represents the maximum concentration of NO2.

Fig. 9. Original image from GEMS site representing the O3 forecast concentration using the EURAD-IM model.

Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003

A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxx–xxx

11

Fig. 10. Reconstructed image.

6.1.5. AOPG site The results of the last provider report an average error of 0.35%. The initial and the reconstructed maps are illustrated in Figs. 13 and 14. It should be noted that the white region in Fig. 13 is treated as “void space” in Fig. 14, and considered as a distinct case than national border

Fig. 12. Reconstructed image.

Fig. 11. Original image from LAPS site representing the maximum forecast concentration of NO2 using the Fifth Generation Penn/State Mesoscale model, MM5, and the Eulerian photochemical air quality model CAMx.

Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003

12

A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxx–xxx

Fig. 13. Original image from AOPG site representing the forecast concentration of PM10 using the BOLCHEM model.

lines, which are instead treated as unwanted noise and filled-in. Regarding the original image, it represents the concentration of PM10 pollutant as predicted by the BOLCHEM model. 6.2. Quantitative evaluation The quantitative evaluation focuses on comparing the performance of the AirMerge system with the proposed framework against the real numerical data. This is realized by comparing the reconstructed data from the published images from both systems with the original forecast

data as produced by the forecast model. In this way, we can calculate more accurately, compared to the first evaluation step, how significant is error introduced by OCR and the quality of the final results. The tests are performed on a set of 108 images, which are extracted from the FMI site4 and their reconstructed data from these images are compared with the original data provided in a NetCDF format file by FMI. These images are selected so that multiple air pollutants and time/dates are covered. The selection of diverse input data aims at retrieving a variety of images and thus testing the systems with as different as possible input images. Specifically, the dataset is created using the following restrictions: • • • •

6 pollutants are handled (i.e. CO, NO, NO2, PM10, PM2.5 and SO2) 3 h per day (i.e. 8:00, 16:00, 24:00) 6 days, a weekend and 4 weekdays were selected Surface height was used exclusively.

Table 7 contains the following results for each pollutant separately: a) the number of images, b) the absolute average latitude and longitude step differences, which indicate the error introduced by the proposed framework for 5° in each axis, c) the average percentage of pixels with correct values (i.e. compared with the original numerical values produced by the SILAM model), d) the average error (er) introduced in each pixel from AirMerge (AM) during data extraction from the heatmaps (the mathematical formula for er, which is presented later, is based on the formula used for estimating the relative error), e) the average error (er) introduced in each pixel from the framework (FW) due to OCR and thus misalignment of the coordinates, f) the mean squared error per pixel for Fig. 14. Reconstructed image.

4

http://silam.fmi.fi/AQ_forecasts/Regional_v4_9/index.html.

Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003

A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxx–xxx Table 7 Results comparing the proposed framework and AirMerge system with the original numerical values produced by SILAM model. Pollutant CO Number of images Latitude step difference between AM and FW Longitude step difference between AM and FW Average percentage of pixels without error in value (AM) Average percentage of pixels without error in value (FW) Average error per pixel (AM) Average error per pixel (FW) RMSE per pixel (AM) RMSE per pixel (FW)

Total NO2

18 18 8.72 · 10−4

NO

PM10

PM2.5

SO2

18

18

18

18

108

1.33 · 10−4 74.9%

83.2%

89.7%

85.6%

86.6%

77.2%

82.9%

72.1%

76.4%

89.6%

80.3%

81.5%

69.5%

78.3%

19.857

0.283

0.025

0.712

0.622

0.188

3.490

20.566

0.3426

0.029

0.831

0.717

0.238

3.657

36.218 38.497

0.473 0.638

0.219 0.250

1.156 1.520

0.922 1.193

0.454 0.618

6.574 7.120

the AirMerge (AM) system and g) the root mean squared error of AM and FW respectively which has the same units as the quantity being estimated. The error er is calculated as:

er ¼

Xn v ‐ev  i  i  i¼0  v i

n

;

where n is the total number of pixels, vi is the original data from the specific geographical coordinates and evi is the value of AirMerge with manual configuration or the value of the proposed framework for the specific coordinates. The mathematical formula for Mean Squared Error (MSE) is: MSE ¼

n 1X 2 ðv ‐evi Þ ; n i¼0 i

where n, vi and evi stand for the same parameters as in the error er. Finally, the Root Mean Squared Error (RMSE) is defined as the square root of MSE: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u n pffiffiffiffiffiffiffiffiffiffi u1 X 2 ðv ‐evi Þ : RMSE ¼ MSE ¼ t n i¼0 i Based on Table 7, it is evident that the latitude and longitude errors are common for all pollutants, since the map types considered are similar. They are both quite low and thus the proposed framework could identify rather well the position of the horizontal and vertical lines on the map. The percentage of pixels with correct values is satisfactory for both systems with that of the framework being slightly lower due to the OCR error. Regarding the error introduced in each pixel value, it is in general quite low except the case of CO, where the error is higher. This is probably due to the fact that the values between sequential pixels varied more coarsely compared to the other cases, a phenomenon which was also observed in Epitropou et al. (2012) and attributable to the use of a linearly spaced, but coarse and sparse color scale, as well as to the higher average magnitude of the observed values. The same applies for the MSE and RMSE errors. In general, it is evident that the proposed framework introduces an additional error to the original values compared to AirMerge. However the error introduced is not significant and shows that a manually configured extraction system could be substituted by a semi-automatic one, which could facilitate the tasks of environmental administrators.

13

7. Conclusions In this paper, we have proposed a framework for environmental information extraction from air quality and pollen forecast heatmaps, combining image processing, template configuration, as well as textual recognition components. The proposed framework overcomes the limitation of not having access to the raw data, since it only considers information in form of heatmaps that are publicly available on the Internet, and estimates the original numerical forecasted data by using the reconstructed data of the heatmaps. The evaluation revealed that the proposed semi-automatic configured system has almost similar results (i.e. the estimated values are rather close to the original ones) to the manual one, since in most of the cases no significant error is introduced by the OCR. Potential uses for the proposed framework include supporting environmental systems that provide either air quality information from several providers for direct comparison or orchestration purposes or decision support on everyday issues (e.g. travel planning) (Wanner et al., 2012), and in general providing a way to access sufficiently usable numerical environmental data for a host of applications involving the processing of the latter, without requiring explicit data publishing policy changes by part of environmental data providers, thus creating more flexibility. Future work includes evaluation with images in different projections (such as conical) and an effort to further automate the procedure. This can be achieved by applying segmentation techniques on the original image, which will result to the automatic recognition of its element (heatmap, color scale, axis) boundaries. Towards this direction we plan to investigate and apply segmentation techniques that are based only on rough image features (Hoenes and Lichter, 1994), on Voronoi diagrams (Kise et al., 1998) and on connected components (Bukhari et al., 2010). Acknowledgments This work was supported by PESCaDO project (FP7-248594). References Balk, T., Kukkonen, J., Karatzas, K., Bassoukos, A., Epitropou, V., 2011. A European open access chemical weather forecasting portal. Atmos. Environ. 45, 6917–6922. Bukhari, S., Al Azawi, M.I.A., Shafait, F., Breuel, T.M., 2010. Document image segmentation using discriminative learning over connected components. Proceedings of the 9th IAPR International Workshop on Document Analysis Systems (DAS '10). ACM, New York, NY, USA, pp. 183–190. Cao, R., Tan, C., 2002. Text/graphics separation in maps. In: Blostein, D., Kwon, Y.-B. (Eds.), Fourth IAPR Workshop on Graphics Recognition. Lecture Notes in Computer Science, vol. 2390. Springer, Berlin, pp. 167–177. Chang, S., Jiang, W., Yanagawa, A., Zavesky, E., 2007. Columbia University TRECVID 2007 high-level feature extraction. Proceedings of TREC Video Retrieval Workshop (TRECVID 07). Chiang, Y.Y., Knoblock, C.A., 2006. Classification of line and character pixels on raster maps using discrete cosine transformation coefficients and support vector machine. Proceedings of the 18th International Conference on Pattern Recognition, pp. 1034–1037. Desai, S., Knoblock, C.A., Chiang, Y.-Y., Desai, K., Chen, C.-C., 2005. Automatically identifying and georeferencing street maps on the web. Proceedings of the 2005 Workshop on Geographic Information Retrieval (GIR '05). ACM, New York, NY, USA, pp. 35–38. Epitropou, V., Karatzas, K.D., Bassoukos, A., Kukkonen, J., Balk, T., 2011. A new environmental image processing method for chemical weather forecasts in Europe. Proceedings of the 5th International Symposium on Information Technologies in Environmental Engineering, Poznan. Epitropou, V., Karatzas, K., Kukkonen, J., Vira, J., 2012. Evaluation of the accuracy of an inverse image-based reconstruction method for chemical weather data. International Journal of Artificial Intelligence 9 (S12), 152–171. Henderson, T.C., Linton, T., 2009. Raster map image analysis. Proceedings of the 2009 10th International Conference on Document Analysis and Recognition (ICDAR '09). IEEE Computer Society, Washington, DC, USA, pp. 376–380. Hoenes, F., Lichter, J., 1994. Layout extraction of mixed mode documents. Mach. Vis. Appl. 7, 237–246. Karatzas, K., 2005. Internet-based management of environmental simulation tasks. In: Farago, I., Georgiev, K., Havasi, A. (Eds.), Advances in Air Pollution Modelling for Environmental Security, pp. 253–262 (NATO Reference EST.ARW980503, 406 p.).

Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003

14

A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxx–xxx

Karatzas, K., Moussiopoulos, N., 2000. Urban air quality management and information systems in Europe: legal framework and information access. J. Environ. Assess. Policy Manag. 2 (Νο. 2), 263–272. Karatzas, K., Kukkonen, J., Bassoukos, A., Epitropou, V., Balk, T., 2011. A European chemical weather forecasting portal. In: Steyn, Douw G., Trini Castelli, Silvia (Eds.), 31st ITM NATO/SPS International Technical Meeting on Air Pollution Modelling and Its Application, Torino, 28 Sept. 2010. Published in Air Pollution Modeling and Its Applications XXI, Springer, NATO Science for Peace and Security Series C: Environmental Security, pp. 239–243. Kise, K., Sato, A., Iwata, M., 1998. Segmentation of page images using the area Voronoi diagram. Comput. Vis. Image Underst. 70 (3), 370–382. Kraaij, W., Over, P., Awad, G., 2007. TRECVID-2007 high-level feature task: overview. Online Proceedings of the TRECVID Video Retrieval Evaluation Workshop. Kukkonen, J., Klein, T., Karatzas, K., Torseth, K., Fahre Vik, A., San José, R., Balk, T., Sofiev, M., 2009. COST ES0602: towards a European network on chemical weather forecasting and information systems. Adv. Sci. Res. J. 1, 1–7. Kukkonen, J., Olsson, T., Schultz, D.M., Baklanov, A., Klein, T., Miranda, A.I., Monteiro, A., Hirtl, M., Tarvainen, V., Boy, M., Peuch, V.-H., Poupkou, A., Kioutsioukis, I., Finardi, S., Sofiev, M., Sokhi, R., Lehtinen, K.E.J., Karatzas, K., San José, R., Astitha, M., Kallos, G., Schaap, M., Reimer, E., Jakobs, H., Eben, K., 2012. A review of operational, regionalscale, chemical weather forecasting models in Europe. Atmos. Chem. Phys. 12, 1–87. Levenshtein, V.I., 1966. Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 707–710. Michelson, M., Goel, A., Knoblock, C.A., 2008. Identifying maps on the World Wide Web. In: Cova, Thomas J., Miller, Harvey J., Beard, Kate, Frank, Andrew U., Goodchild, Michael F. (Eds.), Proceedings of the 5th International Conference on Geographic Information Science (GIScience '08). Springer-Verlag, Berlin, Heidelberg, pp. 249–260. Moumtzidou, A., Epitropou, V., Vrochidis, S., Voth, S., Bassoukos, A., Karatzas, K., Mossgraber, J., Kompatsiaris, I., Karppinen, A., Kukkonen, J., 2012a. Environmental data extraction

from multimedia resources. Proceedings of the 1st ACM International Workshop on Multimedia Analysis for Ecological Data (MAED 2012), November 2, Nara, Japan, pp. 13–18. Moumtzidou, A., Vrochidis, S., Tonelli, S., Kompatsiaris, I., Pianta, E., 2012b. Discovery of environmental nodes in the web. Proceedings of the 5th IRF Conference, Austria, Vienna, July 2–3. Musavi, M.T., Shirvaikar, M.V., Ramanathan, E., Nekovei, A.R., 1988. Map processing methods: an automated alternative. Proceedings of the Twentieth Southeastern Symposium on, IEEE Computer Society, System Theory, pp. 300–303. Ngo, Ch., et al., 2007. Experimenting VIREO-374: bag-of-visual-words and visual-based ontology for semantic video indexing and search. Proceedings of TREC Video Retrieval Workshop (TRECVID 07). Smeaton, A.F., Over, P., Kraaij, W., 2006. Evaluation campaigns and TRECVid. Proceedings of 8th ACM International Workshop on Multimedia Information Retrieval, California, USA, pp. 321–330. Vrochidis, S., Epitropou, V., Bassoukos, A., Voth, S., Karatzas, K., Moumtzidou, A., Mossgraber, J., Kompatsiaris, I., Karppinen, A., Kukkonen, J., 2012. Extraction of environmental data from on-line environmental information sources. Artificial Intelligence Applications and Innovations. IFIP Advances in Information and Communication Technology, volume 382 361–370. Wanner, L., Rospocher, M., Vrochidis, S., Bosch, H., Bouayad-Agha, N., Bugel, U., Casamayor, G., Ertl, T., Hilbring, D., Karppinen, A., Kompatsiaris, I., Koskentalo, T., Mille, S., Mossgraber, J., Moumtzidou, A., Myllynen, M., Pianta, E., Saggion, H., Serafini, L., Tarvainen, V., Tonelli, S., 2012. Personalized environmental service configuration and delivery orchestration: the PESCaDO demonstrator. Proceedings of the 9th Extended Semantic Web Conference (ESWC 2012), Heraclion, Crete, Greece. Yuan, Y., et al., 2007. THU and ICRC at TRECVID 2007. Proceedings of TREC Video Retrieval Workshop (TRECVID 07).

Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003