A Real Time Data Extraction, Transformation and Loading Solution for ...

A Real Time Data Extraction, Transformation and Loading Solution for Semi-structured Text Files Nuno Viana1, Ricardo Raminhos1, and João Moura-Pires2 1

UNINOVA, Quinta da Torre, 2829 -516 Caparica, Portugal {nv,rfr}@uninova.pt http://www.uninova.pt/ca3 2 CENTRIA/FCT, Quinta da Torre, 2829 -516 Caparica, Portugal [email protected] http://centria.di.fct.unl.pt/~jmp

Abstract. Space applications’ users have been relying for the past decades on custom developed software tools capable of addressing short term necessities during critical Spacecraft control periods. Advances in computing power and storage solutions have made possible the development of innovative decision support systems. These systems are capable of providing high quality integrated data to both near real time and historical data analysis applications. This paper describes the implementation of a new approach for a distributed and loosely coupled data extraction and transformation solution capable of extracting, transforming and perform loading of relevant real-time and historical Space Weather and Spacecraft data from semi-structured text files into an integrated spacedomain decision support system. The described solution takes advantage of XML and Web Service technologies and is currently working under operational environment at the European Space Agency as part of the Space Environment Information System for Mission Control Purposes (SEIS) project.

1 Introduction The term “Space Weather” [1, 2], (S/W) represents the combination of conditions on the sun, solar wind, magnetosphere, ionosphere and thermosphere. Space Weather is not only a driver for earth’s environmental changes but also plays a main role in the performance and reliability of orbiting Spacecraft (S/C) systems. Moreover, degradation of sensors and solar arrays or unpredicted changes in the on-board memories can often be associated with S/W event occurrences. The availability of an integrated solution containing Space Weather and specific S/C onboard measurements’ data, would allow performing of online and post-event analysis, thus increasing the S/C Flight Controllers’ ability to react to unexpected critical situations and indirectly, to enhance the knowledge about the dynamics of the S/C itself. Although important, this integrated data service is currently unavailable. At best some sparse data sub-sets exist on public Internet sites running on different locations and with distinct data formats. Therefore, collecting all the relevant information, transforming and interpreting it correctly, is a time consuming task for a S/C Flight Controller. C. Bento, A. Cardoso, and G. Dias (Eds.): EPIA 2005, LNAI 3808, pp. 383 – 394, 2005. © Springer-Verlag Berlin Heidelberg 2005

384

N. Viana, R. Raminhos, and J. Moura-Pires

To provide such capabilities, a decision support system architecture was envisaged – the Space Environment Information System for Mission Control Purposes (SEIS) [3, 4], sponsored by the European Space Agency (ESA). The main goal of SEIS is to provide accurate real-time information about the ongoing Space Weather conditions and Spacecraft onboard measurements along with Space Weather predictions (e.g. radiation levels predictions). This platform assures the provision of distinct application services based on historical and near real-time data supported by a common database infrastructure. This paper details the Data Processing Module – DPM, (with a special focus on the extraction and transformation component - UDET – Uniform Data Extractor and Transformer) used in SEIS, responsible for the retrieval of all source files (semistructured text files) from external data service providers, “raw” data extraction and further transformations into a usable format. Extensive research work has been also accomplished in both the conceptual Extraction, Transformation and Loading (ETL) modeling[5] and demonstrative prototypes[6, 7]. Given the number of already existing commercial1 and open source ETL tools2, the first approach towards solving the specific data processing problem in SEIS, was to identify which ETL tools could potentially be re-used. Unfortunately, after careful assessment, it soon became obvious that the existing solutions usually required the development of custom code in order to define the specificities of extraction, transformation and loading procedures in near real-time. Due to the high number of Provided Files (please refer to Table 1 for a list of data service providers and files) and their heterogeneity in terms of format, it was not feasible to develop custom code to address all files. In addition, gathering the entire file processing logic at implementation level would raise severe maintainability issues (any maintenance task would surely cause the modification of source code). Also, the analyzed tools did not provide cache capabilities for data that although received from different files referred to the same parameter (for these files, duplicate entries must be removed and not propagated forward as the analyzed solutions suggested). The only option was in fact to develop a custom, but generic data processing mechanism to solve the problem of processing data from remote data service providers into the target databases while also taking into account scalability and maintainability factors and possible reutilization of the resulting solution on other projects. This paper will address the design and development of the data processing solution (with a special focus on the extractor and transformer component), which fulfils the previous mentioned requisites. The paper is organized in five sections: The first section (the current one) describes the motivation behind the data processing problem in the frame of the SEIS project as well as the paper’s focus and contents. Section two highlights the SEIS architecture focusing mainly in the Data Processing Module. The third section is dedicated to the UDET component and presents a comprehensive description of how files are effec1

IBM WebSphere DataStage (http://www.ascential.com/products/datastage.html) SAS Enterprise ETL Server (http://www.sas.com/technologies/dw/etl/index.html) Data Transformation Services (http://www.microsoft.com/sql/evaluation/features/datatran.asp) Informatica PowerCenter (http://www.informatica.com/products/powercenter/default.htm) Sunopsis ELT (http://www.sunopsis.com/corporate/us/products/sunopsis/snps_etl.htm) 2 Enhydra Octopus (http://www.octopus.objectweb.org/) BEE Project (http://www.bee.insightstrategy.cz/en/index.html) OpenDigger (http://www.opendigger.org/)

A Real Time Data ETL Solution for Semi-structured Text Files

385

tively processed. Section four provides the reader, an insight on the technical innovative aspects of UDET and finally, section five provides a short summary with achieved results and guidelines for future improvements on the UDET component. Table 1. List of available data service providers, number of provided files and parameters

Provided Provided Files Parameters

Data Service Provider

Type

Wilcox Solar Observatory Space Weather Technologies SOHO Proton Monitor data (University of Maryland) Solar Influences Data analysis Center Lomnicky Peak’s Neutron Monitor National Oceanic and Atmosphere Administration/National Geophysical Data Centre National Oceanic and Atmosphere Administration/Space Environment Centre US Naval Research Laboratory World Data Centre for Geomagnetism European Space Operations Centre Multi Mission Module Total

Space Weather Space Weather Space Weather Space Weather Space Weather

1 1 2 2 1

1 2 6 5 2

Space Weather

1

1

Space Weather

35

541

Space Weather Space Weather Spacecraft Space Weather

1 1 19 13 77

1 1 271 118 949

2 Data Processing in the Space Environment Information System This section will initially provide a global view of the SEIS system and will focus afterwards on the Data Processing Module. The Uniform Data Extractor and Transformer component will be thoroughly addressed in section 3. 2.1 SEIS System Architecture SEIS is a multi-mission decision support system capable of providing near real-time monitoring [8] and visualization, in addition to offline historical analysis [3] of Space Weather and Spacecraft data, events and alarms to Flight Control Teams (FCT) responsible for Integral, Envisat and XMM satellites. Since the Integral S/C has been selected as the reference mission, all SEIS services – offline and online – will be available, while Envisat and XMM teams will only benefit from a fraction of all the services available for the Integral3 mission. The following list outlines the SEIS’s core services: - Reliable Space Weather and Spacecraft data integration. - Inclusion of Space Weather and Space Weather effects estimations generated by a widely accepted collection of physical Space Weather models. - Plug-in functionalities for any external “black-box” data generator model (e.g. models based on Artificial Neural Networks - ANN).

3

Following preliminary feedback after system deployment, it is expected that other missions (XMM and Envisat) in addition to the reference one (Integral) would like to contribute with additional data and therefore have access to the complete set of SEIS services.

386

-


Near real-time alarm triggered events, based on rules extracted from the Flight Operations’ Plan (FOP) [9] which capture users’ domain knowledge. Near real-time visualization of ongoing Space Weather and Spacecraft conditions through the SEIS Monitoring Tool [10]. Historical data visualization and correlation analysis (including automatic report design, generation and browsing) using state-of-art Online Analytical Processing (OLAP) client/server technology - SEIS Reporting and Analysis Tool [3].

In order to provide users with the previously mentioned set of services, the system architecture depicted in Fig. 1 was envisaged.

5. Metadata Module Metadata Repository

External Data Service Providers

1. Data Processing Module

2. Data Integration Module

File Cache

4. Client Tools Alarm Engine

(c)

(a)

UDAP

UDET

UDOB

(b)

(d)

(e)

Operational Data Storage

Monitoring Tool Alarm Editor

3M Engine

ANN Engine

Data Warehouse

Data Marts

Reporting And Analysis Tool

3. Forecasting Module

Fig. 1. SEIS system architecture modular breakdown, including the Data Processing Module which is formed by several components: (a) External Data Service Providers, (b) Uniform Data Access Proxy (UDAP), (c) File Cache, (d) Uniform Data Extractor and Transformer - UDET (the focus of this paper) and (e) Uniform Data Output Buffer (UDOB).

As clear in Fig. 1, SEIS’s architecture is divided in several modules according to their specific roles. -

-

-

Data Processing Module: Is responsible for the file retrieval, parameter extraction and further transformations applied to all identified data, ensuring it meets the online and offline availability constraints, whilst having reusability and maintainability issues in mind (further detailed on section 2.2). Data Integration Module: Acts as the system’s supporting infrastructure database, providing high quality integrated data services to the SEIS client applications, using three multi-purpose databases (Data Warehouse (DW)[11], Operational Data Storage (ODS) and Data Mart). Forecasting Module: A collection of forecast and estimation model components capable of generating Space Weather [12] and Spacecraft data estimations. Interaction with any of these models is accomplished using remote Web Services’ invocation, which relies on Extended Markup Language (XML) message-passing mechanisms.


-

-

387

Metadata Module: SEIS is a metadata driven system, incorporating a central metadata repository, that provides all SEIS applications with means of accessing shared information and configuration files. Client Tools: The SEIS system comprises two client tools, which take advantage of both the collected real time and historical data – the SEIS Monitoring Tool and the SEIS Reporting and Analysis Tool, respectively.

2.2 Data Processing Module As previously highlighted, one of the objectives of SEIS is to provide reliable Space Weather and Spacecraft data integration. This is not a trivial task due to the numerous data formats (from “raw” text to structured tagged formats such as HTML – Hyper Text Markup Language) and to the communication protocols involved (e.g. Hyper Text Transfer Protocol – HTTP and File Transfer Protocol - FTP). Since SEIS has near real-time data availability requirements, the whole processing mechanism should not take longer than 5 minutes to output its results into the UDOB) (i.e. the system has explicit knowledge – according to Metadata - on data refreshing time intervals for each remote Data Service Provider). Thus, several factors may interfere with this time restriction, from available network bandwidth, Round Trip Times (RTT), Internet Service Providers (ISP) availability, network status from SEIS and Data Service Provider sides, remote data service providers services’ load and the number of concurrent request for processing file requests. Since Data Service Providers are not controlled within SEIS but by external organizations according to their internal priorities, funding allocation and even scientists “good-will”, server unavailability information is not accessible in advance (e.g. detection occurs only when the service actually fails and data stops being “pumped” into the data repositories). For similar reasons, text files comprising relevant parameters, contain structured data, whose arrangement may evolve. I.e. as time passes, new parameters may be added, deleted or updated into the file, thus making the format vary. Once again, notification about format change is inexistent and has to be inferred by our system and/or users. To address this issue, the DPM incorporates knowledge on the active File Format Definition (FFD) applied to a given file within a specific timewindow. 2.3 UDAP, UDET and UDOB As depicted in Fig. 1, the Data Processing Module is composed by three subcomponents: UDAP, UDET and UDOB. The UDAP component is responsible for the retrieval of all identified files from the different data service providers’ locations, has the ability to handle with remote service availability failures and recover (whenever possible) lost data due access unavailability. UDAP is also in charge of dealing with both Space Weather and Spacecraft data estimations outputs generated by the estimation and forecasting blocks, namely the Mission Modeling Module (3M) block and ANN models, through data files which are the results of capturing the models’ outputs. Communication with these components is achieved using Web Services interfacing layers developed between UDAP’s and each of the models’ side.

388


All retrieved data is afterwards stored into a local file cache repository (to ease cached file management, a simple MS Windows Network File System (NTFS) compressed file system was used), from which is later sent for processing. By moving all data files to a local cache before performing any actual file processing, not only a virtual file access service is provided (minimizing possible problems originated by external services’ failures), but also required storage space is reduced. Since all Data Processing Module components are Metadata driven, UDAP configuration and file scheduling definitions are stored in the centralized Metadata Repository. In addition, UDAP provides a Human Machine Interface (HMI), which allows users to issue commands such as thread “start”/”stop”, configuring UDET server instances (to be further discussed on the next section) and managing the request load on external data service providers and UDET engines. Once data has been moved locally (into the UDAP’s cache) preparation tasks in order to extract and transform identified parameters contained in the files may be performed. After being processed by UDET all the data will be finally loaded into the UDOB temporary storage area (implemented as relational tables) and thus made available to both the ODS and DW.

3 The Uniform Data Extraction and Transformer Component The main goal of UDET is to process all data provided files received from UDAP. These files hold textual information structured in a human readable approach. Each provided file has associated two temporal tags; start and end dates that determine the temporal range for which the parameter values concern. These temporal tags may exist either explicitly in the file header or implicitly, being inferred from the parameter entries. Three types of parameters are available in the input files: numerical, categorical and plain text. Most of these parameter values have a temporal tag associated, although some are time independent, containing general information only. Provided files can also be classified as real-time or summary (both types contain a temporal sliding window of data). While real-time files (e.g. available every 5 minutes) offer near real-time / estimation data for a very limited time window, summary files (e.g. available daily) offer a summary of all measures registered during that day (discrepancies between real-time and summary files contents are possible to find). Since summary data is more accurate than real-time, whenever available, the first shall replace the real-time values previously received. Fig. 2 presents the DPM processing pipeline from a high-level perspective, with special focus on the UDET component. After receiving a semi-structured text file from UDAP, UDET applies a set of ETL operations to the same file, according to definitions stored in an external file FFD, producing a set of data chunks as result. Each data chunk is characterized as a triplet, containing a global identifier for a parameter, a temporal tag and the parameter value. The size of a data chunk varies and is closely related with the nature of the data that is available in the file (e.g. Space Weather and Spacecraft data are stored in different data chunks). Depending on the UDET settings, these data chunks can be delivered to different containers (e.g. in SEIS data chunks are delivered to UDOB – a set of relational tables).


389

File Format Definition (FFD) Data Chunks

UDET

Semi Structured Text File

Chunk generation

Loading

Fig. 2. UDET’s processing model

The following sub-sections highlight UDET’s main requirements; the model employed in SEIS and also how the ETL process is applied to the input files received from UDAP. Finally, UDET’s architecture is described in detail, unfolding its main components and the existing relations between them. 3.1 Main Requirements Since real-time files mainly hold repeated data (when compared with the previously retrieved real-time file), only the new added entries will be stored after every file processing. An output cache mechanism is then required, which is capable of improving the system’s load-factor considerably on UDOB by avoiding duplicate entries in near real-time files. In order to accomplish the SEIS near real-time requirement, data should not take more than 5 minutes to be processed (from the moment it is made available in the Data Provider until it reaches UDOB). In this sense, the performance of the DPM is fundamental to accomplish this condition and especially for the UDET component, which is responsible for most of the computational effort within the DPM. Due to the high number of simultaneous file transfers it is not feasible to sequence the file processing. Thus, a parallel architecture is required in order to process several input files simultaneously. As previously mentioned, Data Service Providers do not provide a notification mechanism to report changes on the format of Provided Files. Thus, UDET needs the inclusion of data quality logic mechanisms, which describe the parameter data types, and possibly the definition of ranges of valid values. Furthermore, maintenance tasks for the correction of format changes must have a minimum impact in the system architecture and code, in order not to compromise maintenance. Finally, data delivery should be configurable in a way that data resulting from the extraction and transformation process can be exported into different formats (e.g. XML, Comma-Separated Values - CVS, relational tables) although without being tied to implementation details that may restrict the solution’s reusability (e.g. if a solution is based on a scheme of relational tables it should not rely directly in a specific communication protocol). 3.2 Designed Model and Developed Solution The developed solution relies on declarative definitions that identify which operations are necessary during the ETL process, instead of implementing this logic directly at

390


code level. These declarative definitions are stored in FFD files and their contents are directly dependent on the Provided File format, data and nature. So, it is necessary to create a dedicated FFD for each Provided File, holding the specific ETL logic necessary to process any file belonging to a Provided File class (file format detection is currently not implemented, but considered under the Future Work section). FFD are stored in XML format since this is a highly known World Wide Consortium (W3C) standard for which exists a wide range of computational efficient tools. In addition, the format is human readable, enabling an easy validation without recurring to a software translation tool to understand the content logic. File Format Definition files holds six distinct types of information: (1) General Information – Global data required to process an input file, such as: end of line character, start and end dates for which the file format is valid (for versioning purposes), decimal and thousand separator chars and any user comment. (2) Section Identification – Gathers the properties responsible for composing each of the sections present in an input file (e.g. headers, user comments, data). A section can be defined according to specific properties such as absolute line delimiters (e.g. line number) or sequential and contiguous lines sharing a common property (e.g. lines that “start”, “end” or “contain” a given string). In addition, it is possible to define relative sections, through other two sections, which enclose a third one (“enclosed section”) using the “Start Section After Previous Section End” and “End Section Before Next Section Start” properties. (3) Field Identification – Contains the definitions of all existing extractable fields, where each field parameter is associated to a given file section. Fields can be of two types: “single fields” and “table fields”. The specification of single fields can be performed by defining an upper and lower char field enclosing delimiters or alternatively, through a regular expression. Additionally, several metainformation related with the field is included, such as the field name, its format, and global identifier. The specification of table fields is accomplished through the capturing of table columns using several definition types, according to the files’ intrinsic format (typically, the more generic definition which best extracts the data from the columns should be chosen). Other available options include the capability of extracting columns: based on column separators (with the possibility of dealing with consecutive separator chars as a new column or not); based on fixed column positions (definition of column breaks); based on regular expression definition. Similar to single fields, meta-information about the table columns is also available, such as global identifications, column data formats, column names and missing value representations. The use of regular expressions should be limited as much as possible to advanced users acquainted with definition of regular expressions (although its direct use usually results on considerable speed gains). (4) Transformation Operations – Hold a set of sequences containing definitions of transformation operations, to be applied to single and table fields, transforming the original raw data in a suitable format. A large collection of transformation operations both valid for single and table fields (e.g. date convert, column create from field, column join, column append) is available.


391

(5) Data Quality – Contains information required for the validation of the data produced as result of the extraction and transformation process. Validation is accomplished through the association of data types, data thresholds and validation rules (e.g. 0

A Real Time Data Extraction, Transformation and Loading Solution for ...

A Real Time Data Extraction, Transformation and Loading Solution for ...

Suggest Documents

A Real Time Data Extraction, Transformation and ... - Google Sites

a hybrid approach for health data extraction, transformation and loading

Extraction, Transformation, and Loading (ETL) Module for Hotspot ...

Real-Time Data Warehouse Loading Methodology - Semantic Scholar

A solution for real time monitoring and auditing of

Shadow Multiplexing for Real-Time Silhouette Extraction

Real-Time Signal Extraction - Core

Real-Time Signal Extraction - Dialnet

iRFP Is a Real Time Marker for Transformation ... - Semantic Scholar

Real-Time Information Extraction from Big Data

Real-Time Information Extraction from Big Data - Defense Technical ...

Real-Time Information Extraction from Big Data - Defense Technical ...

DESRM: a disease extraction system for real-time ... - Semantic Scholar

A low-cost single-board solution for real-time,

Real time Extraction of sentiment and analysis for ... - IJARCSSE

A Unified Real-Time Feature Extraction and ...

A Real time Data Acquisition and Monitoring Device for ... - CiteSeerX

REAL-TIME EXTRACTION OF LOCAL PHASE ... - CiteSeerX

REAL-TIME EXTRACTION OF LOCAL PHASE ... - CiteSeerX

NEAR REAL-TIME ROAD CENTERLINE EXTRACTION

A Conceptual Solution for Representing Time in Data ... - CiteSeerX

Real-Time Data Mining for Event Streams

MONITORING REAL-TIME DATA: A SONIFICATION ... - SMARTech

Advanced Solution Monitoring: Near Real-time Analysis