Heterogeneous Data Integration Architecture ...

2 downloads 946 Views 218KB Size Report
Heterogeneous Data Integration Architecture-Challenging Integration. Issues .... 3Talend Open Studio, Pentaho Data Integration is transactional and therefore ...
ANNALES UMCS INFORMATICA

AI XV, 1 (2015) 7 - 11

DOI: 10.1515/umcsinfo-2015-0001

Heterogeneous Data Integration Architecture-Challenging Integration Issues Michal Chromiak 1

1∗

, Marcin Grabowiecki

Institute of Computer Science, Maria Curie-Sklodowska University, pl. M. Curie-Sklodowskiej 5, 20-031 Lublin, Poland

Abstract  As of today, most of the data processing systems have to deal with large amount of data originated in numerous sources. Data sources almost always dier regarding its purpose of existence. Thus model, data processing engine and technology dier intensely. Due to current trend for systems fusion there is a growing demand for data to be present in common way regardless of its legacy. Many systems has been devised as an answer to such integration needs. However, the present data integration systems mostly are dedicated solutions that brings constraints and issues when considered in general. In this paper we will focus on the present solutions for data integration, their aws originating in their architecture or design concepts and present an abstract and general approach that could be introduced as an answer to existing issues. The system integration is considered out of scope for this paper, we will focus particularly on ecient data integration. Keywords : grid integration model, heterogeneous integration, distributed architecture, data integration, big data, distributed transaction, warehouse, ETL, OLAP

1

Data Integration Role

data Integration plays a signicant role, as the data puzzle is meaningful only while it is considered in particular

Integrating distributed data and service resources is

context. Therefore the integration is so crucial for mod-

an ultimate goal of many current technologies, includ-

ern systems. Present information systems must deal with

ing distributed and federated databases, brokers based

large amount of data.

on the CORBA standard [1], Sun's RMI, P2P technolo-

becoming particularly important. The data itself in most

gies, grid technologies, Web Services [2], Sun's JINI [3],

cases is scattered, uses dierent models or query engines,

virtual repositories [4], metacomputing federations [5, 6]

is stored in dierent locations and administered by inde-

and perhaps others. The distribution of resources has de-

pendent administrators. In scope of data integration this

sirable features such as autonomic maintenance and ad-

situation is undesirable. At present, despite of many ded-

ministration of local data and services, unlimited scala-

icated solutions for data integrations (see section 3) or

bility due to many servers, avoiding global failures, sup-

global caching [9], most of them are not considered gen-

porting security and privacy, etc.

On the other hand,

eral enough to provide an abstract layer that could unify

there is a need for global processing of distributed re-

all of the data integration on ground of general solution.

sources that treats them as a centralized repository with

The rest of the paper is organized as follows. We focus

resource location and implementation transparency. Such

on problems during data integration and common models

integration is needed especially when considering domains

that address them in Section 2, existing data integrations

(e.g. spatial) with large amounts of data that is required

solutions and their issues in Section 3. We also propose

for further analysis [7] and optimisation [8], but is origi-

an abstract approach that would solve the issues present

nated from many sources. Distributed resources are often

in Section 4. Section 5 concludes.

Thus, the big data concepts are

developed independently (with no central management) thus with the high probability they are heterogeneous,

2

that is, incompatible concerning, in particular, local data-

The Models of Integration (Enterprise Design Patterns)

base schemas, naming of resource entities, coding of values and access methods. There are methods to deal with heterogeneity, in particular, federated databases, brokers

Integration may continue using multiple techniques

based on the CORBA standard and virtual repositories.

utilizing their each of their particular advantages where

The integrated data context is essential in terms of under-

needed. However, it is believed that asynchronous mes-

standing the domain. The more the data is isolated from

saging technique plays an increasingly important role

colligated data, the less informative it is. In other words,

among other styles. Apart from asynchronous messaging, there are other approaches that solve the same problem,

[email protected]



7

ANNALES UMCS INFORMATICA

AI XV, 1 (2015) 7 - 11

DOI: 10.1515/umcsinfo-2015-0001

In

Since 1992 [19] the concept of data warehouse has been

general approaches to integration can be considered in

proposed, and the database vendors have rushed to imple-

order of evolution:

ment the functionalities for constructing data warehouses.

each has its distinct advantages and disadvantage.



• •



shared le - one party writes to a le and other

That made on-line analytical processing (OLAP) emerge

reads from it. File contact details, format and

as a technology for decision support systems.

access timing must be agreed between parties

problems may arise in building a data warehouse with

using it.

pre-existing data, since it has various types of heterogene-

shared database - one database is a data source

ity. This integration scheme for data warehouse however

and store for many applications

had to focus on some conicts, namely value-to-value,

Remote Method Invocation - one party exposes

attribute-to-attribute and table-to-table. These conicts

to the world its interface to inner procedures

are not exclusive, they may occur in any pair of relations

that can be later executed form outside of this

at the same time.

party. The communication is real-time and syn-

in two distinct pre-existing databases,

chronous.

databases are designed by dierent designers or driven

asynchronous messaging - one party publishes

by dierent assumptions. This conicts are resolved only

1 messages to a common message channel and the

Such heterogeneity occurs frequently when dierent

partially or are considered out of scope. The idea of the presented projects is aimed to solve all

other parties can read those messages at later time from the channel.

However,

of the issues and prepare monolithic solution.

However, the channel

While in

case of published achievements the results can be accessed

and the message format must be agreed.

and classied freely there is still a considerable amount of

3

closed, enterprise solutions that are mainly dedicated for

The Collation of Issues With Data

commercial DBMS. In this paper we try to challenge and

Integration Solutions

solve problems we further elaborate in following sections. OLAP OLAP is a decision supporting software that

Current research in the eld of database integration is

gathers

focused around detailed areas without general approach

cubes).

that would be a complete (i.e. independent, extensible by

nancial, etc.)

design) and tested concept. The solutions that we propose

populated

in this paper bring the exibility and abstraction layer

multiple vendor and models (databases, pliki, services,

enabling unrestricted design for networking and schema

etc.).

customization.

ETL tools.

The existing research in the domain of

data

in

multidimensional

structures

(hyper-

It also enables the analysis (statistical, sale, of collected data.

with

data

from

OLAP solutions are

heterogeneous

sources

i.e.

Data acquisition and transformation is done by The design concept of OLAP is to collect

integration of heterogeneous databases has provided a di-

analyzed data fast and eectively.

verse array of solutions. The main heterogeneous DB in-

is no formally unied query language for OLAP imple-

tegration issues that are being answered in this research

mentations1.

eld has been briey classied in the past 1995 in [10] and

- ROLAP (relational), MOLAP (multidimensional) and

[11]. However, some attempts has take place even earlier

HOLAP (hybrid). The indisputable advantage of OLAP

in 1983 [12]. At present, the main area of interest regard-

is fast analysis of historical data, collected periodically

ing the integration techniques is basing on XML solutions.

from data sources and aggregated in form of data storing

It is believed, as mentioned in [13], that the XML has be-

structures.

Unfortunately there

There are three types of OLAP systems

come the indisputable standard both for data exchange and content management, and moreover that is about to

However, it must not be forgotten that data acquisi-

become the lingua franca of data interchange. This point

tion and transformation is a tedious and time consuming

of view can be justied by immense number of research

process.

papers, like [14, 15, 16] and many more, trying to uti-

regardless of actual needs of analytical process client.

lize the XML technology as the tool for database scheme

The fact of often unnecessary data collecting from data

and data representation. The XML has also become part

sources bring overhead to networking and contributing

of many commercial products supported by giants of the

systems when considering big data.

software industry like Microsoft [17], Oracle [18] or IBM.

would be the fact that the OLAP analytical processes

Another aspect of signicance of the integration re-

This

utilize

hardware

resources

Another problem

almost always are based on outdated data.

search should be considered in the eld of enterprise.

intensively

This is due

to continuous data changes in contributory systems and periodical data acquisitions to OLAP. Therefore OLAP

1Message is data structure - such as a string, a byte array, a

client is reported with historical results.

record, or an object. It can be interpreted simply as data, as the description of a command to be invoked on the receiver, or as the description of an event that occurred in the sender

It is not a

problem in all systems but it is a serious aw when

8

ANNALES UMCS INFORMATICA

AI XV, 1 (2015) 7 - 11

considering current data requests.

DOI: 10.1515/umcsinfo-2015-0001

is transactional and therefore reliable and consistent but also requires large resource request per transaction.

ETL Modern enterprise solutions, mostly base on ETL

tools.

Modern information systems often require input

4

from multiple data sources (databases, xml les, csv les,

Proposed Abstract Integration Solutions

web services, etc.). What is more, the integration of data brings possibilities of data transformation and analysis.

Answering the need of a general and abstract architec-

Therefore, the data needs to be extracted, transformed

ture for data integration, we introduce some of the tech-

and loaded to the destination system which usually is a data warehouse. those needs.

niques that we have used to overcome the issues men-

ETL (Extract Transform Load) meets

tioned in previous section, while still preserving the key

There are numerous ETL solutions origi-

nating in enterprise

2 and open source3 vendors, all shar-

functionalities of exemplary solutions. The proposed architecture depict in the Figure 1 is based on processing

ing the same model. The Extract phase is challenging

metadata instead of the actual data itself. The architec-

as needs to handle multiple data sources. The databases

ture brings the following advantages by design:

themselves are dierent regarding their model: relational, object, document, graph, column-oriented, etc. Only re-

(1) eciency - metadata instead of real data

lational databases include many SQL dialects and what

(2) manageable - easy to manage due to only structures manipulation

is more the NewSQL is coming. Finally, the web applications can sometimes be also the data source (web spider-

(3) trac optimization - less network connections

ing). The Transform phase modies the extracted data.

(4) reliability - less data to transfer; not relying on network reliability

Simple tasks such as selection of specic columns, column

(5) optimized - uses only native queries; brings the

joining, data ltration (selecting data from formula based

most of native data source optimizations

ranges), formula based data transformations, aggregation,

(6) no distributed transactions

table joining, sorting, etc. Some more complicated tasks are also pivoting or disaggregation of repeating columns

Below we discuss the reasons for applying the following

into a separate detail table. At this stage the data types

assumptions and justify their usage in the architecture

are processed (eg. data interpretation from string values).

depicted in Figure 1.

The last phase is data Load to data warehouse. Depending on strategy, new data can be periodically overwritten

4.1

or new data is written while providing timestamp of data aggregation.

The ETL issues discussed above can be overcome

The ETL plays a great role for the specic architecture that it represent.

Solution Proposal for the ETL issues

thanks

However, ETL is not general or

to

a

general

architecture

thanks to its data-less model.

particularly

design

The key is to provide

abstract and moreover, has some quite signicant issues.

the data that are up to date with the source state.

Let us focus on the conceptual problems of ETL. The

achieve this i.e. obtain the data same as the data stored

fundamental issue with ETL is that the model, where the

in data source, we have devised an architecture that

data collecting is periodical, what brings the danger that

provides not the data itself but the fast access method

when the data goes to data warehouse it is already out-

for the requested data. The Fast Access Method (FAM)

dated or will soon become such.

ETL ignores fact that

would be the data source native query, that would pull

the data sources keep changing on regular basis. Collect-

the data in the fastest possible way the data source

ing data takes place only at particular time moments i.e.

can provide.

once a day, once a week, etc.

models.

In result data warehouse

most of the time contains historical data.

To

Such query would dier across dierent

For instance, in relational model the fastest

For applica-

method for reaching the data is calling them by their

tions where the data can be historical this is not an issue.

primary key, in object database it would be the OID

However, when the online data processing is required ETL

(object ID), etc.

is unacceptable.

icated metamodel [20] that is based not on the data

Basing on this idea, we utilize a ded-

Another architectural concept of ETL is unavoidable

itself but on the complex metadata that also includes the

data pulls, regardless whether the data are to be viewed

FAM. The metamodel for representing the integration

or analyzed or not. This can generate redundant network

patterns is presented in [21] and is out of scope of this

chatter and increase data source systems load, often un-

paper, however, it enables the means to obtain a pure

necessarily. One needs to remember that often data load

native query (FAM) from a metadata model to pull the actual data from the data source.

2Oracle Data Integrator, SQL Server Transformation Services,

Such data are not

only current but are also pulled eectively. For example.

IBM InforSphere DataStorage 3Talend Open Studio, Pentaho Data Integration

Let us assume the goal of extracting about a billion of

9

ANNALES UMCS INFORMATICA

AI XV, 1 (2015) 7 - 11

DOI: 10.1515/umcsinfo-2015-0001

Figure 1. Proposed architecture for the heterogeneous data integration grid

records using not optimized query.

same schema.

This would bring

The integrator sends native updating re-

major latency issues and could even cause system crash,

quest to all three sources. Now, the problem is when one

which is not acceptable.

of the three sources is not able to update its state and thus

A native query basing on the is the most

cause potential inconsistency in integrated grid. In regu-

eective way to query arbitrary data source. This is due

lar distributed system, in such condition, one would have

to the native nature of query. The model utilizing native

to apply one of the following solutions. First approach is a

query can use all of the optimization techniques that the

distributed two phase commit which means all actions be-

data source provide.

best row ID (primary key, object ID, etc.)

This is done transparently as the

ing a part of the same transaction. This implies tying up

native optimization methods (i.e. indexing) are managed

all of the systems in voting process until the unsuccessful

by the database engine that processes each native query.

update on failed source is going to be handled or until re-

The general architecture actually has no knowledge of

maining two sources will rollback the distributed update.

the actual optimization happening at the data source

Both cases bring major ineciency forcing entire system

side. The native query is simply send to the datasource

to halt until the problem is resolved. Another, more ad-

and the result is the requested data. Such an approach

vanced solution, would be dividing the distributed trans-

results in up-to-date and reliable data on demand. This is

action into distinct transactions but in an asynchronous

particularly important in online applications that require

queue. This way you can send the database update mes-

current data state. The metamodels advantage over the

sage by enqueuing the update message in one transaction

data model for i.e. ETL, is that it is only metadata about

and forget about it. This is actually not really distributed

the data.

No data dependent constraints are necessary.

due to the fact that close location of the queue broker and

In fact such issues are going to be considered and handled

the message sender is required. This means that such so-

at the target data source with its best optimization of

lution requires the queue broker and the database to be

native query engine and without any eort at global

collocated, in the same datacenter with fast connectivity

integration scale.

and high availability. The queuing system takes care of everything by utilizing reliable messaging. The target sys-

4.2

tems then processes update message locally in transaction

Solution Proposal to OLAP issues

from the queue. So each dequeue of message and update

As the OLAP systems are based on ETL populating the

to a data source is a separate transaction. Now, in case

OLAP with source data, the downsides of ETL discussed

when one of the data sources will not make an update

above have become also part of OLAP systems. Due to

from the queue, there is no way of instantaneous rollback

this inherent drawback from ETL on ground of current

on remaining sources, that has already accomplished this

data acquisition, the proposed solution is solving the issue

task with success. A compensation process is required in

by delivering up-to-date content that is pulled on demand.

4.3

such case. This is done by sending message from the system that failed to process the transaction to the remaining

No distributed transactions

systems, either to rollback or to handle this situation in

The discussed architecture, due to metadata utilization,

other arbitrary way. The big advantage of this solutions

can also store simple updates apart from selecting queries.

is that transactions happen quickly, asynchronously and

However, this requires further elaboration. In case of pro-

do not tie up entire system over one transaction. On the

posed data integration solution, the data stored at data

other hand it is not really distributed due to collocation

sources can be updated with extra records. This brings

requirement and while intense system load this can bring

us to a point when a data replication in a distributed grid

major system slow down due to need of compensation

environment must be handled. Let us assume then, that

even with fast connectivity.

the system is going to add one record that needs to be replicated in three integrated databases controlling the

10

AI XV, 1 (2015) 7 - 11

ANNALES UMCS INFORMATICA

In our solution a transaction is not required.

DOI: 10.1515/umcsinfo-2015-0001

[2] http://en.wikipedia.org/wiki/list_of_web_service_specications, 2010. [3] http://www.jini.org/wiki/jini_architecture_specication, 2009. [4] K. Kuliberda, R. Adamus, J. Wislicki, K. Kaczmarski, T. M. Kowalski, and K. Subieta. A generic proposal for a transparent integration of distributed data by an autonomous layer in a virtual repository. an International Journal (MAGS), 3(4):393 410, 2007. [5] M. W. Sobolewski. Computing and metacomputing intergrid.

In the

exemplary case, when two out of three data sources accept the update and the third one does not, we simply change the metamodel of integrator to store access methods pointing only to the rst two parties where the update has succeeded. Now, in case of asking integrator for all updated records, only the data sources that has made updates with success (i.e. data source one and two) will be made available to pull data. The third data source will be

Proc. 10th International Conference on Enterprise Informa-

considered only while querying for data that it has man-

tion Systems, Barcelona, Spain, 2008. [6] M. W. Sobolewski. Federated collaborations with exertions. pages 127132, 2008. due to metamodel presented in [20]. [7] R. Grycuk et al. Content-based image indexing by data clustering and inverse document frequency. Communications in Computer and Information Science, pages 374383, 2014. 4.4 Architecture open to optimization [8] M. Lupa and A Piórkowski. Spatial query optimization based on transformation of constraints man-machine interactions. The key aspect of integration relies on networking. 242, 2014. High availability is clue to integrate resources eciently. [9] P. Leszczynski and K. Stencel. Update propagator for joint scalable storage. Fundam. Inform., 119(3-4), 2012. Considering intensive networking trac the optimization [10] P. Hepner. Integrating heterogeneous databases: An overview, of retrieving replicated data is crucial to eective and school of computing and mathematics. Deakin University, Geeresponsive communication. Therefore, we have prepared long, Victoria, Australia, 1995. architecture for replaceable load balancing algorithms [11] V. D. Gilgor and G. L. Luckenbaugh. Interconnecting heteroge[22] covering access to replicated data. The architecture neous database management system. IEEE Comp. Soc. Press, 17(1), 1983. at the level of integrator is also prepared for applying [12] S. E. Madnick. A taxonomy for classifying commercial apoptimizations that has already been used for ORM soproaches to information integration in heterogeneous environlutions like Hibernate [23, 24, 25, 26]. The results of ments. Database Engineering - Special Issues on Database the mentioned optimization are published in separate Connectivity, 13(2), 1999. paper. In [27] the authors has proven that exemplary [13] V. Rajeswari et al. Heterogeneous database integration for web applications. International Journal on Computer Science and optimization technique basing on query rewriting and Engineering, 1(13):227234, 2009. order dependencies can be applied to arbitrary data [14] S. C. Tseng Frank. Heterogeneous database integration using source without need to interfere with the database engine xml. and thus regardless of its origin. This is due to eective [15] S. Wei-Jung and H. Minng-Ying. An interactive tool based on design of the proposed architecture and its exibility and xml technology for data exchange between heterogeneous erp systems. Journal of CIIE, 22(4):273278, 2005. agility of appliance. [16] P. Rodriguez-Gianolli and J. Mylopoulos. A semantic approach to xml-based data integration conceptual modeling. [17] Microsoft TechNet. http://technet.microsoft.com/en5 Conclusions and Future Work us/library/ms151835.aspx. [18] J. Basu Nirav Chanchani. Heterogeneous xml-based data inteExisting solutions that has been widely used tend to gration. solve specic problems not even trying to generalize due [19] B. Inmon. Building the data warehouse. to their constrained applications. In this paper we have [20] M. Chromiak and K. Stencel. A data model for heterogeneous presented the ways that we believe can challenge the probdata integration architecture. 424, 2014. [21] M. Chromiak and K. Stencel. The linkup data structure for lems that the existing solutions do not aim to solve. On heterogeneous data integration platform. 7709:263274, 2012. ground of an abstract model, that is out of scope for this [22] et al. A. Piorkowski. Load balancing for heterogeneous web paper, our solutions can be combined into fully abstract servers. 79:189198, 2010. and general architecture for heterogeneous data integra[23] P. Wisniewski and K. Stencel. Query rewriting based on metation. In future papers we will introduce the entire argranular aggregation. pages 457468, 2013. [24] M. Gawarkiewicz and P. Wisniewski. Partial aggregation using chitecture that has been devised to challenge the general hibernate. pages 9099, 2011. problems of data integration in heterogeneous and dis[25] A. Boniewicz et al. On materializing paths for faster recursive tributed environment. querying. pages 105112, 2013. [26] M. Burzanska et al. Recursive queries using object relational mapping. References [27] M.Chromiak et al. Exploiting order dependencies on primary keys for optimization. 2014. [1] http://www.omg.org/technology/documents/corba_spec_catalog.htm, 2010. aged to store until this update message. This can be done

11