The Nimble XML Data Integration System - CiteSeerX

9 downloads 0 Views 53KB Size Report
Daniel S. Weld ddraper, alon, dweld @nimble.com. Nimble Technology, Inc. ..... jaraman, Y. Sagiv, J. Ullman, and J. Widom. The TSIMMIS project: Integration of ...
The Nimble XML Data Integration System Denise Draper

Alon Y. HaLevy

Daniel S. Weld

fddraper, alon, [email protected] Nimble Technology, Inc. 1938 Fairview Avenue East Suite 100 Seattle, WA 98102 USA

Abstract

type into a commercial product for XML data integration. Our product provides a common interface to a corporation’s myriad data sources, whether they are modern relational databases, legacy systems or structured files. A beta-version of the product is now being deployed at several Fortune-500 customer sites, and a formal product launchshould preceed this ICDE conference. We begin by presenting the product goals and architecture (Section 2), and then highlight several issues we think require more attention from the research community: choices in the design of the underlying algebra, on-the-fly data cleaning, caching and materialization policies, and issues related to source unavailability (Section 3).

For better or for worse, XML has emerged as a de facto standard for data interchange. This consensus is likely to lead to increased demand for technology that allows users to integrate data from a variety of applications, repositories, and partners which are located across the corporate intranet or on the Internet. Nimble Technology has spent two years developing a product to service this market. Originally conceived after decades of person-years of research on data integration, the product is now being deployed at several Fortune-500 beta-customer sites. This abstract reports on the key challenges we faced in the design of our product and highlights some issues we think require more attention from the research community. In particular, we address architectural issues arising from designing a product to support XML as its core representation, choices in the design of the underlying algebra, on-the-fly data cleaning and caching and materialization policies.

2. Overview of the Product Our prime target customers include enterprises who need to conveniently query across multiple internal and external data sources. Market research suggests that the typical Fortune 1000 company has over one hundred enterprise applications running on fifteen platforms and eight data storage architectures2. While many of these sources are modern relational databases, legacy systems (e.g., hierarchical IMS installations) and structured files are surprisingly prevalant. This diversity of data sources is exacerbated by the current e-business trend which is driving companies to connect their online systems to those of their suppliers and channels. The key feature distinguishing our product is the use of an XML-like data model at the system’s very core. There are two key reasons for building the system based XML.

1. Introduction XML’s attributes of simplicity and flexibility have caused it to become the de facto standard for data interchange.1 Nearly every vendor of data-management tools has delivered some degree of XML support, with more on the way. Finally, it is becoming possible to build a comprehensive data integration system without first devoting significant resources to the construction and maintenance of wrappers for each data source. Although data integration has been studied by the research community for some time, we believe that the market for sophisticated data integration products is just beginning. Nimble Technology has spent two years extending and transforming a University of Washington research proto1 See

 First, XML provides much greater flexibility in the kinds of data that can be handled by our system. In additional to the common case of relational data, we can naturally handle data from hierchical stores (e.g., IMS) and data in structured files. Furthermore, XML enables one to more naturally model differences between 2 Data

www.xml.org for a list of XML-based interchange standards.

1

from Gartner Group Report.

representations of data in different sources. In the research community, the idea of using a semi-structured data model for data integration was first proposed in the Tsimmis Project [8].

 The second reason for using XML is more markettingdriven than technical — users find data integration more compelling when XML is involved. Since XML is touted as a standard for data exchange within and across organizations, IT personnel find it easier to imagine applications of data integration when XML is the data transport format. Our product is currently being deployed at several betacustomer sites of very large companies. In many discussions with potential customers we found numerous applications areas where users require data integration functionality. In one common scenario, the need for data integration arises when information about the customers of a company is scattered across multiple databases in the organization, and the company would like to learn more about its customers (by integrating all the data into one view) and to ensure that the data about customers is consistent across the databases. In some cases, the data sources have existed for a long time, and in others they have resulted from continuous activities of mergers and acquisitions. It is important to emphasize that in many of these cases, the option of creating a new unified database that stores all the information is not possible, either because of operational constraints or because of the cost of doing so, and the political implications within an organization. Another class of applications for data integration is companies who need to build large-scale web sites which serve information from multiple internal sources. The task of building the web site itself is an enormous one in these cases, and it is very important to the customer to be able to separate the task of building the web site from the task of integrating the underlying data. Hence, they would like to provide the designers of the web site an already integrated view of their data sources.

2.1. System architecture A diagram of the architecture of our system is shown in Figure 1. It is similar in spirit to the architecture of research prototypes (e.g., [8, 14, 18, 9, 1], see [12, 6, 5] for surveys), with the distinguishing factor that our system is built upon an XML data model. The query language supported by our system is XMLQL [4]. XML-QL was the only existing expressive query language for XML when we started designing our system. Ultimately, we plan to adopt the standard query language recommended by the W3C Query Working Group. Users

and applications interact with the system using a set of mediated schemas. These schemas are essentially definitions of views over the schemas of the data sources (similar to the global-as-view approach [6]). It is important to notice that these schemas can be built in a hierachical fasion. That is, we can define successive schemas as views over other underlying schemas. This provides important flexibility when there are going to be multiple classes of users and applications using the system, and also facilitates defining the integration of the data sources, because it can be done in an incremental fashion (possibly in different parts of an organization). The system front end is flexible, offering multiple layers of access. For example, a lens is an object that contains a set of XML queries, parameters, XSL formatting, and authentication information. Result formatting can be targeted to specific devices (e.g., web interface, wireless device). Customers who wish to use a lower-level interface to the integration engine are also supported. When an XML-QL query is posed to the integration engine it is parsed and broken into multiple fragments based on the target data sources. The compiler translates each fragment into the appropriate query language for the destination source; for example, if an RDB is being queried, then the compiler generates SQL. Note that the compiler considers both the type of the underlying source, information concerning the layout of the data within the sources, and the presence of indices on the data. The metadata server contains the mappings that allow XML-QL to be split apart and translated appropriately; mappings are set via the management tools. Load balancing is provided; multiple instances of the integration engine can be run simultaneously on one or more servers. Even though our main archiecture is built on a federated integration model, this alone is not always sufficient for all needs. Thus we support a compound architecture that includes offline data manipulation and replication as well, using our data administrator sub-system. In summary, our product has the following features:

 high-performance, scalable query processing of data from multiple sources.  dynamic data mapping and cleaning  allow users to specify new queries and views  enable third party applications and devices to query and display results  robust system managament  XML as the unifying model underlying the system.

Front End

Lens File

Info Browser

SDK

APIs

Management Tools

Integration Engine Cache

Compiler

Tool Suite Lens Builder

Executor

Metadata Server

Integration Builder Concordance Developer

Common XML View

Relational DB

Data Mart / Warehouse

Legacy

Flat Files

Web Pages

Security Tools

Integration Layer

User Apps

Data Administrator

Figure 1. Architecture of product and ancillary tools.

3. Challenges In this section we describe the key challenges we faced in designing our system, and point out several challenges that deserve more attention from the research community.

3.1. Data model and algebra design To build a system with XML at its core, we first need to decide on a data model for XML and an algebra over which our query processor will operate. The W3C’s XML Query Working Group was actually considering the same issues at the same time, but as their interests were geared more towards semantic definition, and ours towards performance, we found that we needed a different model. The research community has also discussed models for XML, and more generally, semi-structured data []. Some have argued to model XML as graphs and others have advocated a simpler model of trees. However, most of these discussions focused on designing a data model for pure XML (and for good reasons). One of the key lessons from our design is that we wanted a data model that can certainly accommodate XML, but would let us deal efficiently with the types of data that we expected to see from users most frequently (e.g., relational, hierarchical). Hence, our data model allows for the semi-structured aspects of XML, but is slightly more structured than models described for XML, thus accomodating relational and hierarchical data more naturally. The same philosophy extended into the design of our algebra. We wanted to ensure that the algebra supported the operations on standard data models efficiently, and supported operations that combine data from multiple models efficiently as well. In developing the algebra, we also realized that we need to distinguish two roles for the algebra: the first is that an algebra serves as an abstraction of the query language, and

the second is that the algebra models a set of physical operators that are implemented by the query processor. In the context of the relational model, relational algebra to a large extent serves both of these purposes (and hence the distinction is often forgotten). However, in our context and in the context of pure XML, we do not yet have the equivalent of relational algebra. Consequently, it is important to keep in mind the purpose for which one is designing an algebra. In our work we focussed on designing a physical algebra, because it had direct impact on the design and implementation of our system. We did not design a logical algebra for several reasons, chief among them is that we expect the query language we support to be a moving target for a while. In our system we translate a query into an internal representation, and from there directly to query execution plans in the physical algebra.

3.2. Dynamic Data Cleaning Although data cleaning and ETL capabilities are a crucial aspect of data warehousing and data integration, they have received only limited attention from the research community to date (e.g., [7, 15, 3, 10, 11, 16]). Data cleaning is difficult for the following reasons: Data anomalies: Values may be truncated, abbreviated, incorrect or missing. Corresponding records may contain inconsistent values for some fields. Keys, attributes, structures, encoding conventions may differ across applications. In addition, we need to consider the following well known problems: (1) the object identity problem: different tokens may be used for the same object in different sources. Large amounts of human effort may be required to develop a concordance database which records determinations for equivalent objects; (2) the translation problem: source A may use several fields (e.g., city, state, . . . ) to describe what source B models with a single field (address). Translation between

sources may require parsing or information extraction techniques; and (3) representational inadequacy: Important data relationships might be hidden because a legacy system did not provide a key structure that linked related documents (e.g., multiple account numbers might block the fact that all are from subsidiaries of a parent company). Data changes over time: For example, a company might use a standard chart of account before 1997 but SAP (with new account codes) afterwards. Furthermore, as new data is entered it may uncover previously hidden anomalies. Management: a user interface is required to manage creation, execution and maintenance of data cleaning flows may require complex features (e.g., programming by demonstration) for ease of use. Data cleaning in the context of data integration is different from that of data warehousing. When a data mart or warehouse is created, data cleaning can be easily incorporated as part of the data import process. In contrast, with data integration, the source data is unchanged, and at least some of the cleansing and matching need to be performed dynamically. One of features we have found essential in most practical situations is a separate data store that is created to serve to match records from two or more different original data sources. We call this a concordance database, and we provide special tools within our data cleaning architecture for creating and using concordance databases. Overall, our data cleaning system (still under construction) will have the following features:

 The framework is extensible, handling immediate needs (e.g., name and address standardization) and allowing for future enhancements as they are demanded by customers. Domain-specific and customerprovided normalization and matching functions are supported.  To the greatest extent possible, the system runs autonomously while incorporating human input for disambiguation when necessary. This necessitates breaking the cleansing process into two phases: datamining and extraction. Support for the datamining phase involves human-centered tools for interactively analyzing data, testing transforms, resolving ambiguities, looking for duplicates and anomalies, finding legacy data encoded in text fields, etc.. During the extraction phase, past human decisions are reapplied via a concordance database and exceptions are trapped to allow extraction to continue with cleanup applied post-hoc when a human is available.  The system supports a data lineage mechanism,

recording data ancestry, human decisions, and supporting roll-back whenever possible.

 The system is designed to be robust and efficient, working on large quantities of data, and facilitating efficient query processing of virtually-clean data whenever possible.  The system facilitates quick creation and maintenance of data cleaning flows. It will be easy to add new data sources to an existing flow. We use a declarative representation of the flow [7].

3.3. Warehousing vs. virtual integration Data integration solutions have been largely based on two opposing approaches: warehousing and virtual querying (see [13, 12] for a discussion of the two approaches). The warehousing solution requires building a data warehouse, and writing programs that load the data from the data sources to the warehouse periodically. The main advantage of the warehousing approach is the performance of query processing. The main disadvantages are that the data may not be fresh, and that if the design of the warehouse schema is static, then it is harder to accommodate new data sources or variations among existing sources. In the virtual integration solution we define one or more mediated schemas, which are not used for storing data but only for querying it. When a query is issued to the system, it is translated into a set of queries over the data sources, the solutions of which are combined by the data integration system. With this solution we get fresh data, and with the semi-structured aspects of XML we can more easily accommodate differences between the data sources. However, we may pay a considerable performance penalty because we need to contact the sources for every query. Furthermore, some sources may not always be accessible and therefore we may not even be able to process certain queries at some times. A cornerstone of our architecture is that the system should be configurable to query on demand as well as materialize some data locally. Our management tools enable specification of which data sources (or queries over data sources) should be materialized in a local store, and should be refreshed on demand. The query processor knows to make use of local copies of data when available. Note that our materialization strategy is different from warehousing in the sense that one does not design a warehouse schema. Instead, one materializes views over the mediated schema. This capability is also very important from the customer acquisition perspective. In particular, IT managers often do not want to take on a warehousing effort because of the 618 month lead time that it requires. On the other hand, it is

very attractive to be able to set up a data integration application immediately, and optimize its performance over time by devising caching strategies. Our architecture raises several research challenges. Most importantly, there is a need for algorithms that decide which data (and over which sources) need to be materialized. The problem we face is an extension to the problem of selecting views (and indexes) to materialize in a database system or data warehouse [17, 2]. The further complications in our context are because (1) the data sources are autonomous and may contain overlapping data, (2) we may need to adjust the set of materialized views over time depending on the query load, and (3) we do not have good cost estimates for querying over remote data sources (and therefore it’s hard to compare the costs with the alternative of materialization).

3.4. Source availability In many applications, it’s never the case that all sources are available—they may be offline, or network connectivity may not be available. It is often not acceptable in this situation to simply return an error or an empty result. In the worst case, there may be so many data sources that the probability that they are all available simultaneously is nearly zero. We are designing our system to behave intelligently in this situation by providing partial results, and indicating to the user that the results were not complete. Some of the challenges include whether and how to allow the query to specify behavior when data sources are unavailable, and what the default behavior should be.

4. Conclusions As this abstract points out, the general data integration problem is a challenging one, because such a wide variety of needs come into play at once. One of the key challenges in building such a system is prioritizing the feature schedule, to address our customers’ needs as effectively as possible. During the design and implementation of the Nimble integration product, we addressed the following needs:

 General query language features (data types, operators) equivalent to a “standard” SQL query engine.  XML related features, particularly those related to document order (XML documents are intrincially ordered), “navigation”-style access (which includes navigating the XML document structure up, down and sideways), and recursion.  Robust and reasonably efficient access to a wide variety of data source systems.

 An internal query optimizer that can address the varying query capabilities of different data sources.  Caching and other performance tuning capabilities.  Configuration and management tools that make it possible for administrators to set up, monitor, and understand, the system.

References [1] S. Adali, K. S. Candan, Y. Papakonstantinou, and V. S. Subrahmanian. Query caching and optimization in distributed mediator systems. In Proc. of SIGMOD, pages 137–148, 1996. [2] S. Agrawal, S. Chaudhuri, and V. Narasayya. Automated selection of materialized views and indexes in Microsoft SQL Server. In Proc. of VLDB, Cairo, Egypt, 2000. [3] W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proc. of SIGMOD, Seattle, WA, 1998. [4] A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. A query language for XML. http://www.research.att.com/ mff/xml/w3c-note.html, 1998. [5] R. Domenig and K. Dittrich. An overview and classification of mediated query systems. SIGMOD Record, 28(3):63–72, September 1999. [6] D. Florescu, A. Levy, and A. Mendelzon. Database techniques for the world-wide web: A survey. SIGMOD Record, 27(3):59–74, September 1998. [7] H. Galhardas, D. Florescu, D. Shasha, and E. Simon. An extensible framework for data cleaning. In Proc. of ICDE, 2000. [8] H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom. The TSIMMIS project: Integration of heterogeneous information sources. Journal of Intelligent Information Systems, 8(2):117–132, March 1997. [9] L. Haas, D. Kossmann, E. Wimmers, and J. Yang. Optimizing queries across diverse data sources. In Proc. of VLDB, Athens, Greece, 1997. [10] M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large datasets. In SIGMOD-95, May 1995. [11] M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the murge/purge problem. Journal of Data Mining and Knowledge Discovery, 2:9–37, 1998. [12] R. Hull. Managing semantic heterogeneity in databases: A theoretical perspective. In Proc. of PODS, pages 51–61, Tucson, Arizona, 1997. [13] R. Hull and G. Zhou. A framework for supporting data integration using the materialized and virtual approaches. In Proc. of SIGMOD, pages 481–492, Montreal, Canada, 1996. [14] A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogeneous information sources using source descriptions. In Proc. of VLDB, Bombay, India, 1996. [15] V. Raman and J. Hellerstein. An interactive framework for data cleaning. Technical report, University of California, Berkeley, 2000.

[16] A. Roychoudhury, I. Ramakrishnan, and T. Swift. Rulebased data standardized for enterprise data bases. In ”International Conference on Practical Applications of Prolog”, 1997. [17] D. Theodoratos and T. Sellis. Data warehouse design. In Proc. of VLDB, Athens, Greece, 1997. [18] A. Tomasic, L. Raschid, and P. Valduriez. Scaling access to distributed heterogeneous data sources with Disco. IEEE Transactions On Knowledge and Data Engineering (to appear), 1998.

Suggest Documents