ANNALES UMCS INFORMATICA
AI XV, 1 (2015) 7 - 11
DOI: 10.1515/umcsinfo-2015-0001
Heterogeneous Data Integration Architecture-Challenging Integration Issues Michal Chromiak 1
1∗
, Marcin Grabowiecki
Institute of Computer Science, Maria Curie-Sklodowska University, pl. M. Curie-Sklodowskiej 5, 20-031 Lublin, Poland
Abstract As of today, most of the data processing systems have to deal with large amount of data originated in numerous sources. Data sources almost always dier regarding its purpose of existence. Thus model, data processing engine and technology dier intensely. Due to current trend for systems fusion there is a growing demand for data to be present in common way regardless of its legacy. Many systems has been devised as an answer to such integration needs. However, the present data integration systems mostly are dedicated solutions that brings constraints and issues when considered in general. In this paper we will focus on the present solutions for data integration, their aws originating in their architecture or design concepts and present an abstract and general approach that could be introduced as an answer to existing issues. The system integration is considered out of scope for this paper, we will focus particularly on ecient data integration. Keywords : grid integration model, heterogeneous integration, distributed architecture, data integration, big data, distributed transaction, warehouse, ETL, OLAP
1
Data Integration Role
data Integration plays a signicant role, as the data puzzle is meaningful only while it is considered in particular
Integrating distributed data and service resources is
context. Therefore the integration is so crucial for mod-
an ultimate goal of many current technologies, includ-
ern systems. Present information systems must deal with
ing distributed and federated databases, brokers based
large amount of data.
on the CORBA standard [1], Sun's RMI, P2P technolo-
becoming particularly important. The data itself in most
gies, grid technologies, Web Services [2], Sun's JINI [3],
cases is scattered, uses dierent models or query engines,
virtual repositories [4], metacomputing federations [5, 6]
is stored in dierent locations and administered by inde-
and perhaps others. The distribution of resources has de-
pendent administrators. In scope of data integration this
sirable features such as autonomic maintenance and ad-
situation is undesirable. At present, despite of many ded-
ministration of local data and services, unlimited scala-
icated solutions for data integrations (see section 3) or
bility due to many servers, avoiding global failures, sup-
global caching [9], most of them are not considered gen-
porting security and privacy, etc.
On the other hand,
eral enough to provide an abstract layer that could unify
there is a need for global processing of distributed re-
all of the data integration on ground of general solution.
sources that treats them as a centralized repository with
The rest of the paper is organized as follows. We focus
resource location and implementation transparency. Such
on problems during data integration and common models
integration is needed especially when considering domains
that address them in Section 2, existing data integrations
(e.g. spatial) with large amounts of data that is required
solutions and their issues in Section 3. We also propose
for further analysis [7] and optimisation [8], but is origi-
an abstract approach that would solve the issues present
nated from many sources. Distributed resources are often
in Section 4. Section 5 concludes.
Thus, the big data concepts are
developed independently (with no central management) thus with the high probability they are heterogeneous,
2
that is, incompatible concerning, in particular, local data-
The Models of Integration (Enterprise Design Patterns)
base schemas, naming of resource entities, coding of values and access methods. There are methods to deal with heterogeneity, in particular, federated databases, brokers
Integration may continue using multiple techniques
based on the CORBA standard and virtual repositories.
utilizing their each of their particular advantages where
The integrated data context is essential in terms of under-
needed. However, it is believed that asynchronous mes-
standing the domain. The more the data is isolated from
saging technique plays an increasingly important role
colligated data, the less informative it is. In other words,
among other styles. Apart from asynchronous messaging, there are other approaches that solve the same problem,
[email protected]
∗
7
ANNALES UMCS INFORMATICA
AI XV, 1 (2015) 7 - 11
DOI: 10.1515/umcsinfo-2015-0001
In
Since 1992 [19] the concept of data warehouse has been
general approaches to integration can be considered in
proposed, and the database vendors have rushed to imple-
order of evolution:
ment the functionalities for constructing data warehouses.
each has its distinct advantages and disadvantage.
•
• •
•
shared le - one party writes to a le and other
That made on-line analytical processing (OLAP) emerge
reads from it. File contact details, format and
as a technology for decision support systems.
access timing must be agreed between parties
problems may arise in building a data warehouse with
using it.
pre-existing data, since it has various types of heterogene-
shared database - one database is a data source
ity. This integration scheme for data warehouse however
and store for many applications
had to focus on some conicts, namely value-to-value,
Remote Method Invocation - one party exposes
attribute-to-attribute and table-to-table. These conicts
to the world its interface to inner procedures
are not exclusive, they may occur in any pair of relations
that can be later executed form outside of this
at the same time.
party. The communication is real-time and syn-
in two distinct pre-existing databases,
chronous.
databases are designed by dierent designers or driven
asynchronous messaging - one party publishes
by dierent assumptions. This conicts are resolved only
1 messages to a common message channel and the
Such heterogeneity occurs frequently when dierent
partially or are considered out of scope. The idea of the presented projects is aimed to solve all
other parties can read those messages at later time from the channel.
However,
of the issues and prepare monolithic solution.
However, the channel
While in
case of published achievements the results can be accessed
and the message format must be agreed.
and classied freely there is still a considerable amount of
3
closed, enterprise solutions that are mainly dedicated for
The Collation of Issues With Data
commercial DBMS. In this paper we try to challenge and
Integration Solutions
solve problems we further elaborate in following sections. OLAP OLAP is a decision supporting software that
Current research in the eld of database integration is
gathers
focused around detailed areas without general approach
cubes).
that would be a complete (i.e. independent, extensible by
nancial, etc.)
design) and tested concept. The solutions that we propose
populated
in this paper bring the exibility and abstraction layer
multiple vendor and models (databases, pliki, services,
enabling unrestricted design for networking and schema
etc.).
customization.
ETL tools.
The existing research in the domain of
data
in
multidimensional
structures
(hyper-
It also enables the analysis (statistical, sale, of collected data.
with
data
from
OLAP solutions are
heterogeneous
sources
i.e.
Data acquisition and transformation is done by The design concept of OLAP is to collect
integration of heterogeneous databases has provided a di-
analyzed data fast and eectively.
verse array of solutions. The main heterogeneous DB in-
is no formally unied query language for OLAP imple-
tegration issues that are being answered in this research
mentations1.
eld has been briey classied in the past 1995 in [10] and
- ROLAP (relational), MOLAP (multidimensional) and
[11]. However, some attempts has take place even earlier
HOLAP (hybrid). The indisputable advantage of OLAP
in 1983 [12]. At present, the main area of interest regard-
is fast analysis of historical data, collected periodically
ing the integration techniques is basing on XML solutions.
from data sources and aggregated in form of data storing
It is believed, as mentioned in [13], that the XML has be-
structures.
Unfortunately there
There are three types of OLAP systems
come the indisputable standard both for data exchange and content management, and moreover that is about to
However, it must not be forgotten that data acquisi-
become the lingua franca of data interchange. This point
tion and transformation is a tedious and time consuming
of view can be justied by immense number of research
process.
papers, like [14, 15, 16] and many more, trying to uti-
regardless of actual needs of analytical process client.
lize the XML technology as the tool for database scheme
The fact of often unnecessary data collecting from data
and data representation. The XML has also become part
sources bring overhead to networking and contributing
of many commercial products supported by giants of the
systems when considering big data.
software industry like Microsoft [17], Oracle [18] or IBM.
would be the fact that the OLAP analytical processes
Another aspect of signicance of the integration re-
This
utilize
hardware
resources
Another problem
almost always are based on outdated data.
search should be considered in the eld of enterprise.
intensively
This is due
to continuous data changes in contributory systems and periodical data acquisitions to OLAP. Therefore OLAP
1Message is data structure - such as a string, a byte array, a
client is reported with historical results.
record, or an object. It can be interpreted simply as data, as the description of a command to be invoked on the receiver, or as the description of an event that occurred in the sender
It is not a
problem in all systems but it is a serious aw when
8
ANNALES UMCS INFORMATICA
AI XV, 1 (2015) 7 - 11
considering current data requests.
DOI: 10.1515/umcsinfo-2015-0001
is transactional and therefore reliable and consistent but also requires large resource request per transaction.
ETL Modern enterprise solutions, mostly base on ETL
tools.
Modern information systems often require input
4
from multiple data sources (databases, xml les, csv les,
Proposed Abstract Integration Solutions
web services, etc.). What is more, the integration of data brings possibilities of data transformation and analysis.
Answering the need of a general and abstract architec-
Therefore, the data needs to be extracted, transformed
ture for data integration, we introduce some of the tech-
and loaded to the destination system which usually is a data warehouse. those needs.
niques that we have used to overcome the issues men-
ETL (Extract Transform Load) meets
tioned in previous section, while still preserving the key
There are numerous ETL solutions origi-
nating in enterprise
2 and open source3 vendors, all shar-
functionalities of exemplary solutions. The proposed architecture depict in the Figure 1 is based on processing
ing the same model. The Extract phase is challenging
metadata instead of the actual data itself. The architec-
as needs to handle multiple data sources. The databases
ture brings the following advantages by design:
themselves are dierent regarding their model: relational, object, document, graph, column-oriented, etc. Only re-
(1) eciency - metadata instead of real data
lational databases include many SQL dialects and what
(2) manageable - easy to manage due to only structures manipulation
is more the NewSQL is coming. Finally, the web applications can sometimes be also the data source (web spider-
(3) trac optimization - less network connections
ing). The Transform phase modies the extracted data.
(4) reliability - less data to transfer; not relying on network reliability
Simple tasks such as selection of specic columns, column
(5) optimized - uses only native queries; brings the
joining, data ltration (selecting data from formula based
most of native data source optimizations
ranges), formula based data transformations, aggregation,
(6) no distributed transactions
table joining, sorting, etc. Some more complicated tasks are also pivoting or disaggregation of repeating columns
Below we discuss the reasons for applying the following
into a separate detail table. At this stage the data types
assumptions and justify their usage in the architecture
are processed (eg. data interpretation from string values).
depicted in Figure 1.
The last phase is data Load to data warehouse. Depending on strategy, new data can be periodically overwritten
4.1
or new data is written while providing timestamp of data aggregation.
The ETL issues discussed above can be overcome
The ETL plays a great role for the specic architecture that it represent.
Solution Proposal for the ETL issues
thanks
However, ETL is not general or
to
a
general
architecture
thanks to its data-less model.
particularly
design
The key is to provide
abstract and moreover, has some quite signicant issues.
the data that are up to date with the source state.
Let us focus on the conceptual problems of ETL. The
achieve this i.e. obtain the data same as the data stored
fundamental issue with ETL is that the model, where the
in data source, we have devised an architecture that
data collecting is periodical, what brings the danger that
provides not the data itself but the fast access method
when the data goes to data warehouse it is already out-
for the requested data. The Fast Access Method (FAM)
dated or will soon become such.
ETL ignores fact that
would be the data source native query, that would pull
the data sources keep changing on regular basis. Collect-
the data in the fastest possible way the data source
ing data takes place only at particular time moments i.e.
can provide.
once a day, once a week, etc.
models.
In result data warehouse
most of the time contains historical data.
To
Such query would dier across dierent
For instance, in relational model the fastest
For applica-
method for reaching the data is calling them by their
tions where the data can be historical this is not an issue.
primary key, in object database it would be the OID
However, when the online data processing is required ETL
(object ID), etc.
is unacceptable.
icated metamodel [20] that is based not on the data
Basing on this idea, we utilize a ded-
Another architectural concept of ETL is unavoidable
itself but on the complex metadata that also includes the
data pulls, regardless whether the data are to be viewed
FAM. The metamodel for representing the integration
or analyzed or not. This can generate redundant network
patterns is presented in [21] and is out of scope of this
chatter and increase data source systems load, often un-
paper, however, it enables the means to obtain a pure
necessarily. One needs to remember that often data load
native query (FAM) from a metadata model to pull the actual data from the data source.
2Oracle Data Integrator, SQL Server Transformation Services,
Such data are not
only current but are also pulled eectively. For example.
IBM InforSphere DataStorage 3Talend Open Studio, Pentaho Data Integration
Let us assume the goal of extracting about a billion of
9
ANNALES UMCS INFORMATICA
AI XV, 1 (2015) 7 - 11
DOI: 10.1515/umcsinfo-2015-0001
Figure 1. Proposed architecture for the heterogeneous data integration grid
records using not optimized query.
same schema.
This would bring
The integrator sends native updating re-
major latency issues and could even cause system crash,
quest to all three sources. Now, the problem is when one
which is not acceptable.
of the three sources is not able to update its state and thus
A native query basing on the is the most
cause potential inconsistency in integrated grid. In regu-
eective way to query arbitrary data source. This is due
lar distributed system, in such condition, one would have
to the native nature of query. The model utilizing native
to apply one of the following solutions. First approach is a
query can use all of the optimization techniques that the
distributed two phase commit which means all actions be-
data source provide.
best row ID (primary key, object ID, etc.)
This is done transparently as the
ing a part of the same transaction. This implies tying up
native optimization methods (i.e. indexing) are managed
all of the systems in voting process until the unsuccessful
by the database engine that processes each native query.
update on failed source is going to be handled or until re-
The general architecture actually has no knowledge of
maining two sources will rollback the distributed update.
the actual optimization happening at the data source
Both cases bring major ineciency forcing entire system
side. The native query is simply send to the datasource
to halt until the problem is resolved. Another, more ad-
and the result is the requested data. Such an approach
vanced solution, would be dividing the distributed trans-
results in up-to-date and reliable data on demand. This is
action into distinct transactions but in an asynchronous
particularly important in online applications that require
queue. This way you can send the database update mes-
current data state. The metamodels advantage over the
sage by enqueuing the update message in one transaction
data model for i.e. ETL, is that it is only metadata about
and forget about it. This is actually not really distributed
the data.
No data dependent constraints are necessary.
due to the fact that close location of the queue broker and
In fact such issues are going to be considered and handled
the message sender is required. This means that such so-
at the target data source with its best optimization of
lution requires the queue broker and the database to be
native query engine and without any eort at global
collocated, in the same datacenter with fast connectivity
integration scale.
and high availability. The queuing system takes care of everything by utilizing reliable messaging. The target sys-
4.2
tems then processes update message locally in transaction
Solution Proposal to OLAP issues
from the queue. So each dequeue of message and update
As the OLAP systems are based on ETL populating the
to a data source is a separate transaction. Now, in case
OLAP with source data, the downsides of ETL discussed
when one of the data sources will not make an update
above have become also part of OLAP systems. Due to
from the queue, there is no way of instantaneous rollback
this inherent drawback from ETL on ground of current
on remaining sources, that has already accomplished this
data acquisition, the proposed solution is solving the issue
task with success. A compensation process is required in
by delivering up-to-date content that is pulled on demand.
4.3
such case. This is done by sending message from the system that failed to process the transaction to the remaining
No distributed transactions
systems, either to rollback or to handle this situation in
The discussed architecture, due to metadata utilization,
other arbitrary way. The big advantage of this solutions
can also store simple updates apart from selecting queries.
is that transactions happen quickly, asynchronously and
However, this requires further elaboration. In case of pro-
do not tie up entire system over one transaction. On the
posed data integration solution, the data stored at data
other hand it is not really distributed due to collocation
sources can be updated with extra records. This brings
requirement and while intense system load this can bring
us to a point when a data replication in a distributed grid
major system slow down due to need of compensation
environment must be handled. Let us assume then, that
even with fast connectivity.
the system is going to add one record that needs to be replicated in three integrated databases controlling the
10
AI XV, 1 (2015) 7 - 11
ANNALES UMCS INFORMATICA
In our solution a transaction is not required.
DOI: 10.1515/umcsinfo-2015-0001
[2] http://en.wikipedia.org/wiki/list_of_web_service_specications, 2010. [3] http://www.jini.org/wiki/jini_architecture_specication, 2009. [4] K. Kuliberda, R. Adamus, J. Wislicki, K. Kaczmarski, T. M. Kowalski, and K. Subieta. A generic proposal for a transparent integration of distributed data by an autonomous layer in a virtual repository. an International Journal (MAGS), 3(4):393 410, 2007. [5] M. W. Sobolewski. Computing and metacomputing intergrid.
In the
exemplary case, when two out of three data sources accept the update and the third one does not, we simply change the metamodel of integrator to store access methods pointing only to the rst two parties where the update has succeeded. Now, in case of asking integrator for all updated records, only the data sources that has made updates with success (i.e. data source one and two) will be made available to pull data. The third data source will be
Proc. 10th International Conference on Enterprise Informa-
considered only while querying for data that it has man-
tion Systems, Barcelona, Spain, 2008. [6] M. W. Sobolewski. Federated collaborations with exertions. pages 127132, 2008. due to metamodel presented in [20]. [7] R. Grycuk et al. Content-based image indexing by data clustering and inverse document frequency. Communications in Computer and Information Science, pages 374383, 2014. 4.4 Architecture open to optimization [8] M. Lupa and A Piórkowski. Spatial query optimization based on transformation of constraints man-machine interactions. The key aspect of integration relies on networking. 242, 2014. High availability is clue to integrate resources eciently. [9] P. Leszczynski and K. Stencel. Update propagator for joint scalable storage. Fundam. Inform., 119(3-4), 2012. Considering intensive networking trac the optimization [10] P. Hepner. Integrating heterogeneous databases: An overview, of retrieving replicated data is crucial to eective and school of computing and mathematics. Deakin University, Geeresponsive communication. Therefore, we have prepared long, Victoria, Australia, 1995. architecture for replaceable load balancing algorithms [11] V. D. Gilgor and G. L. Luckenbaugh. Interconnecting heteroge[22] covering access to replicated data. The architecture neous database management system. IEEE Comp. Soc. Press, 17(1), 1983. at the level of integrator is also prepared for applying [12] S. E. Madnick. A taxonomy for classifying commercial apoptimizations that has already been used for ORM soproaches to information integration in heterogeneous environlutions like Hibernate [23, 24, 25, 26]. The results of ments. Database Engineering - Special Issues on Database the mentioned optimization are published in separate Connectivity, 13(2), 1999. paper. In [27] the authors has proven that exemplary [13] V. Rajeswari et al. Heterogeneous database integration for web applications. International Journal on Computer Science and optimization technique basing on query rewriting and Engineering, 1(13):227234, 2009. order dependencies can be applied to arbitrary data [14] S. C. Tseng Frank. Heterogeneous database integration using source without need to interfere with the database engine xml. and thus regardless of its origin. This is due to eective [15] S. Wei-Jung and H. Minng-Ying. An interactive tool based on design of the proposed architecture and its exibility and xml technology for data exchange between heterogeneous erp systems. Journal of CIIE, 22(4):273278, 2005. agility of appliance. [16] P. Rodriguez-Gianolli and J. Mylopoulos. A semantic approach to xml-based data integration conceptual modeling. [17] Microsoft TechNet. http://technet.microsoft.com/en5 Conclusions and Future Work us/library/ms151835.aspx. [18] J. Basu Nirav Chanchani. Heterogeneous xml-based data inteExisting solutions that has been widely used tend to gration. solve specic problems not even trying to generalize due [19] B. Inmon. Building the data warehouse. to their constrained applications. In this paper we have [20] M. Chromiak and K. Stencel. A data model for heterogeneous presented the ways that we believe can challenge the probdata integration architecture. 424, 2014. [21] M. Chromiak and K. Stencel. The linkup data structure for lems that the existing solutions do not aim to solve. On heterogeneous data integration platform. 7709:263274, 2012. ground of an abstract model, that is out of scope for this [22] et al. A. Piorkowski. Load balancing for heterogeneous web paper, our solutions can be combined into fully abstract servers. 79:189198, 2010. and general architecture for heterogeneous data integra[23] P. Wisniewski and K. Stencel. Query rewriting based on metation. In future papers we will introduce the entire argranular aggregation. pages 457468, 2013. [24] M. Gawarkiewicz and P. Wisniewski. Partial aggregation using chitecture that has been devised to challenge the general hibernate. pages 9099, 2011. problems of data integration in heterogeneous and dis[25] A. Boniewicz et al. On materializing paths for faster recursive tributed environment. querying. pages 105112, 2013. [26] M. Burzanska et al. Recursive queries using object relational mapping. References [27] M.Chromiak et al. Exploiting order dependencies on primary keys for optimization. 2014. [1] http://www.omg.org/technology/documents/corba_spec_catalog.htm, 2010. aged to store until this update message. This can be done
11