revealing real problems in real data warehouse

0 downloads 0 Views 56KB Size Report
It is assumed that the reader is familiar with the basic concepts of data ware- .... A further major issue according to modeling perspective is that classifications within ..... Proceedings of the 8th International Conference on Very Large Data Bases ...
REVEALING REAL PROBLEMS IN REAL DATA WAREHOUSE APPLICATIONS Wolfgang Lehner*, Thomas Ruf+ *

) Martin-Luther Universität Halle-Wittenberg - Institut für Informatik Kurt-Mothes-Str. 1, D-06120 Halle (Saale) [email protected] +)

Martketing Services Europe (MSE) Nordwestring 101, D-90319 Nürnberg [email protected]

Abstract Data warehouse systems offer novel techniques and services for data-rich applications both from a data modeling and a data processing point of view. In this paper, we investigate how well state-of-the-art concepts fulfill the requirements of real-world data warehouse applications. Referring to a concrete example from the market research area, key problems with current approaches are identified in the areas of dimensional modeling, aggregation management, metric definitions, versioning and duality of both master and tracking data, context-sensitive fact calculations, derived attributes, heterogeneous reports, and data security. For some of these problems, solutions are shown both within and beyond the framework of the data warehouse platform used for building GfK’s market research data warehouse system. The paper concludes with a list of requirements for extensions of current data warehouse and OLAP systems.

50

1

1. Introduction

Introduction

Following the promises of various OLAP ('Online Analytical Processing'; [CoCs93]) tool and relational database vendors, building a data warehouse is an easy and straightforward task. The purpose of this paper is to share with the reader some observations made and lessons learned from applying state-of-the-art data warehouse technology to a real-world application scenario. It is assumed that the reader is familiar with the basic concepts of data warehousing and On-Line Analytical (OLAP) processing, in particular with dimensional data modeling using classification hierarchies. By discussing how these concepts are used for building the GfK data warehouse system, some general requirements for next-generation data warehouse systems are derived, along with some proposals of how to overcome specific problems in current systems. Our presentations are based on the experiences in building a data warehouse for the GfK Marketing Services. The GfK group (http://www.gfk.de/) is a worldwide operating market research company with >3300 employees and >$350Mio revenues in 1998. The headquarters is based in Nuremberg, Germany. Besides media, ad-hoc and consumer panel research, GfK offers a bundle of services in the non-food retail panel area. The non-food retail panel, which is run by the organizational unit “GfK Marketing Services”, monitors basic market information (e.g. prices, stocks, sales units) from a selected sample of shops on a regular timely basis. The data monitored from the sample shops are transformed into a common format, identified (i.e. mapped to the GfK product master data), cleansed, extrapolated to the total markets and transformed into information on key market factors, e.g. market share and model distribution information. This information helps GfK’s customers to measure their performance in the markets they operate in and to optimize their marketing and logistics efforts. Since 1998, GfK is developing a new data production and reporting system for its non-food business. The aim is to replace the out-dated, host-based system currently in use by a modern, client/server-based solution and thus to substantially extend the data analysis capabilities. The new data production and reporting system is based on data warehouse technology offered by MicroStrategy, which may be classified as a ROLAP (relational OLAP) approach. The underlying database system is Oracle 8 running on an HP Unix server system. To give an impression of the data volume, sales, stock, and price figures are monitored from over 3000 shops and for more than 250.000 single articles for Germany alone on a bi-monthly basis. To be able to perform trend-analysis operations, back data for at least 3 years is required to be kept on-line. The application will be described in more detail throughout the remainder of this paper. The GfK data warehouse application will be used to reveal general requirements as well as to show specific problems in building the data warehouse system for the GfK Marketing Services. Section 2 starts the discussion describing all modeling oriented perspectives with a special focus on a comprehensive support of dynamic classification mechanism and a versioning scheme for master as well as tracking data. Thereafter, we address various require-

2 Modeling the World of Business

51

ments from a data warehouse user and administrator point of view in Section 3. Section 4 outlines the general requirements according to an efficient aggregation process. We discuss several perspectives which are essential from our practical point of view for implementing an efficient data warehouse system. Section 5 picks up the point of management of summary tables which is crucial in physical database design for a data warehouse. The paper closes with a summary and a conclusion.

2

Modeling the World of Business

Data modeling in data warehouse systems is based on a distinction between quantifying and qualifying data. The former, often also called tracking data or fact data, describe time-variant, typically numerical attribute values (e.g. price or stock data). The latter spawn a multidimensional descriptive context which is necessary to assign a meaning to the quantifying data (e.g. product, shop, and time information). The typical data warehouse architecture is characterized by a central fact table, surrounded by some descriptive dimensions. Actually there are three basic dimensions (product, segment, and time) and one auxiliary dimension (price classes). The basic dimensions directly correspond to the data collection parameters in the field, i.e. measured data are collected in an article/shop/period context; the price class dimension will be described in Sections 2.2 and 2.3. In a proper dimensional model, dimensions are orthogonal to one another, meaning that values in the dimensions can be selected and combined in an arbitrary manner. If dimensions are modeled properly, they can be discussed independently from one another. We will concentrate on the product dimension in the following to elaborate on modeling dimensions internally.

2.1 Context-Sensitive Dimensional Attributes As already outlined in the introduction, the data schema for our data warehouse project seems to fit very seamlessly into what is commonly known as a star- or snowflake schema. As we will discuss in this section, however, there are a lot of requirements, which can not directly supported by referring to such a simple approach. Nevertheless, we believe that our list of defects is not specific to the market research application domain but addresses a wide area of applications. The product dimension is hierarchically organized; single articles are grouped into product groups, these in turn are recursively grouped into product categories and those into product sectors. Each product within a product group holds a set of features describing properties of products for that specific product.

52

2. Modeling the World of Business

According to the general multidimensional modeling idea, dimensional attributes are used to further describe elements of a dimension. For the product dimension, typical dimensional attributes are brand, color, or packaging type, which may be assigned globally to every article. However, our product dimension additionally holds local dimensional attributes. For example, the feature 'video system' is only valid for video related articles, whereas the average water usage may be applicable only for the product group of 'dish washers'. Conceptually, the dimensional hierarchy represents an inheritance hierarchy, where each node within that hierarchy reflects a specialization of its parent node. Although local dimensional are a very natural phenomena, we could not find sufficient support for them in a commercial product. Therefore, we demand further support from a modeling perspective ([Lehn98]) as well as from an architectural point of view in exploiting already existing object-relational features of commercial database systems. We currently solve that problem of attributes specific to a certain product group by introducing an additional layer, which maps locally valid properties to generic global properties. A generic attribute ‚M02‘ may therefore reflect the property ‚video system‘ in the product group of camcorders and, within the same data schema, denote the property ‚water usage‘ for all dish washers. Unfortunately, no OLAP vendor was able to provide such a resolution, so that we are currently going to implement this layer in our end-user applications.

2.2 Secondary and Ad-Hoc Classifications Tightly coupled with the former characteristics of specialized properties of dimensional elements is the requirement to define secondary classifications. Many OLAP tools provide that mechanism as a parallel classification (‚week‘ and ‚month‘ classifications are parallel under the ‚year' node). In our context, the existence of a secondary classification depends on a specific product group and primarily refers to a dimensional attribute. Moreover, classifications based on values of dimensional attributes are not necessarily balanced. A typical example might be 'sales of camcorders split by VideoSystemGroups' where ‚VideoSystemGroup‘ is defined as a secondary classification based on the specific VideoSystem of a single camcorder. For example, the VideoSystem instances ´VHS´, `VHS-C´ and ´S-VHS´ may be classified into the VideoSystemGroup ´VHS´. In combination with this requirement, our typical power users, which are working intensively with the data warehouse system, want to be able to create classifications 'on the fly' to test whether such created baskets of products yields in classes with a higher value for further analysis. This fact is important from a modeling point of view, from a end-user program design point of view, and especially important according to physical support of new aggregation combinations coming along with such ‚ad-hoc‘ classifications.

2.3 Dynamic Classifications

53

The third observation in this context may be denoted as 'functionally determined classifications'. For example, price classifications are defined so that each class of the classification subsumes the same number of articles. Since the price of an article is recorded in the context of tracking data, this implies that tracking data have an influence on the design of master data! Unfortunately, we have not found any OLAP product supporting both requirements to our full satisfaction, resulting again in a high additional implementation overhead for our data warehouse project.

2.3 Dynamic Classifications A further major issue according to modeling perspective is that classifications within our data warehouse schema are not static but are dynamically changing. In general, master data changes must be considered under different perspectives. We will outline the problems and resulting requirements again referring to the product dimension of our concrete scenario. Since each market is changing rapidly, it is obvious that new articles are added to the master data base nearly every day. The assignment of an article to a product group is extremely critical, since the presence or absence of specific articles implies a great semantic impact according to the market shares within a single segment. On the other side, single items of a dimensional structure must never be deleted. Instead, each item must be accompanied by a valid time indicator denoting the time frame when the specific item has been valid. The third case reflects the situation, when a single item moves from one product group to another one. This may happen explicitly or implicitly. The explicit move refers to the situation when a new class is introduced and populated with already existing products. The implicit move of a single article corresponds to the situation when the corresponding classification is functionally determined. For example, since the assignment of a product to a specific price class depends on the (average) price of that product, a product may move from one class to another if the price of that product changes. Consider a range query over multiple periods. In this case, the system has to decide, to which price (and to which price classifications) the query has to refer. The more general problem, which from our point of view needs a detailed investigation, is known as the 'duality of master and tracking data' ([Shos82]). On the one hand, price figures, for example, are gathered periodically in the context of tracking data and minimum/average/maximum prices are derived from these detailed data. Hence, price information is used to define the classifications and therefore should be considered and treated like master data. These characteristics require a comprehensive management of the validity and versioning of dimensional structures (or of single items within a dimension). Especially in combination with ad-hoc classifications, the notion of different variants of dimensional structures has to be paid attention to. To validate former query results, each valid state of a dimensional structure must be easily retrievable and multidimensional data must be consistently queryable.

54

3. User-Friendly Warehouse Access

Technically, versioning may be implemented with explicit time stamps (´valid_from´, ´valid_through´) or implicitly by the use of surrogate keys. In the latter case, the multi-column key consisting of the identifiers of all dimensional attributes which have to managed under version control is replaced by a single-column surrogate key, which is used also as the foreign key to the fact table. Whenever one of the underlying dimensional attributes is versioning, a new surrogate key is created. This allows to access the data alternatively in a version-aware (key-based access) and a version-free (ID-based access) manner; validity information may be derived from the first and last occurrences of a specific key in the fact table. Without going into the very detail, we will note at this point that not only versioning of master data but also versioning of tracking data is of fundamental importance. As an example for that requirement, consider that a single shop reports incorrect sales figures (which, in real world, happens all the time!), and derived information is already delivered to the customer when the sales figures in the warehouse database are corrected. To provide report consistency (for example w.r.t. yearly summaries), it must be always possible to refer to the old and wrong, but already delivered tracking data.

3

User-Friendly Warehouse Access

In this section, we will briefly summarize the problems and requirements of data warehouse access from a user point of view. We will omit classical requirements like 'as fast as the old system' or 'as flexible as my current spreadsheet program', since those requirements are of fundamental importance for the acceptance of every software system and need not to be repeated here.

3.1 Report Management, Design, and Access To outline the requirements according to a comprehensive report management, we again refer to our scenario, where for each reporting period thousands of predefined reports and charts have to be generated. It is easy to imagine that many persons from different departments are involved in developing and designing those reports and charts. Therefore, a possibility of having concurrent access to a library of such reports is crucial. Moreover, reports again may be classified according to specific characteristics ('ranking reports', 'running reports', 'distributions', ...). Furthermore, reports have to be subject of a versioning process, since report definitions are evolving over time. A comprehensive report library must consider general report definitions as well as definitions specific for single customers and groups of customers. As a final point, we would like to mention that report definitions must be subject of user access restriction. We will elaborate on this point in more detail in Section 3.4.

3.2 Proactive Information Delivery

55

From a report management point of view, we could not find any commercial solution which provides all of the requirements mentioned above. Our current implementation basis (DSS Suite; [MSI98]) allows to group templates and to organize reports at least in a quite restrictive manner. Versioning of reports is not supported. From the design and access point of view, we experienced that current products are quite restrictive in generating complex OLAP reports. In our case, single reports do not consist of a single homogenous spreadsheet but of a combination of more than one spreadsheet, which are designed following the wishes of our customers (and not some algorithmic or schematic principle). Unfortunately, current products support only complex reports defined on a schematic level (i.e. by giving attribute combinations). As it is explained in full detail in [RuGR99], we were forced to implement a way to define composite reports based on the instance level by ourselves.

3.2 Proactive Information Delivery As the concept of data warehousing is getting more and more attractive within the every-daybusiness, there is a growing need for proactive information delivery as an alternative way to retrieve knowledge from the data warehouse database. The supported scenario should be like this: In a first step, a user poses an Inquire()-Call to retrieve a list of channel or sources of information. In a second step the user may submit a Subscription()-Call to place an order of incoming information. This ’order’ is registered in the data warehouse system. As soon as new information arrives in the system and the corresponding delivery property is satisfied, the subscription is evaluated and the result is delivered to the appropriate user. The delivery of this pieces of information may be done by various ways. Small and urgent information may be delivered using mobile communication techniques (like SMS-message to cell phones). Customized information may be translated into different data formats (like Excel) and prepared for download (laptop computers) or synchronisation (handheld computers). It should be noted here that the process of proactive information delivery may be additional to any other query driven analysis methods as mentioned before. The technique targets a preliminary step to get informed about important and interesting changes in the data warehouse.

3.3 Programming Interface Since a data warehouse in general targets the information supply for a whole company, we learned that a data warehouse has to be open even for department-specific extensions which are not covered by the data warehouse project team directly (either due to financial or political situations). Therefore, we believe that a data warehouse system has to provide efficient access to its data for the end user as well as for application programmers at other departments or organizational units.

56

4. Ways of Optimizing Aggregation Processing

With an underlying relational database system (see Section 4.1 for a discussion of the system architecture), SQL is one but certainly not the right way to access multidimensional information. Our current implementation of end-user tools ([RuGR99]) is based on the object-oriented interface provided by our commercial OLAP engine (DSS Objects). This programming interface provides an easy-to-use, high-level access to existing data warehouse 'objects' (like report definitions). On the other hand, this interface is highly proprietary w.r.t. to the specific OLAP solution. According to this point, we demand standardized, high-level access to multidimensional data. One promising approach is the current state of OLE-DB, especially in conjunction with the 'MultiDimensional eXpression' language MDX ([MSC98]).

3.4 User-Friendliness Unlimited? When talking about fast and user-friendly access to data, we also have to discuss the backside of that coin - security of detailed tracking data. As we already have mentioned, shops are delivering their very critical business data (like sales and turnover figures) to GfK down to the individual article level. Thus, it is extremely important to keep those detailed tracking data secret for the public and especially for the main competitors of a specific data provider. Unfortunately, this cannot be provided alone by simply rejecting a query asking for specific data. It is much more challenging (and unfortunately we see nobody who is currently dealing with that problem) to recognize and deny the access to so-called tracker queries dynamically at run-time. Tracker queries are mathematically well-understood ([Mich91]) and may be seen as a set of queries retrieving just regular looking aggregated information. However, combining the results of a sequence of non-critical queries may narrow down the data to the single data item level, especially if some knowledge can be derived outside of the database (e.g. about trade brands that are only sold in specific shops). This problem becomes even more critical when we think about connecting our warehouse database to the WWW. We demand (further) research efforts in this area and believe that this is a general problem and not only specific to our scenario.

4

Ways of Optimizing Aggregation Processing

Aggregation processing is the most challenging and mission-critical point in building an efficient data warehouse system. In this section, we will outline some ways of how to optimize aggregation processing in a data warehouse system. We will enumerate the most interesting points and detail the current state of research and commercial systems. Moreover, we will attach a list of requirements and demands which are, from our practical point of view, most beneficial for a successful data warehouse project.

4.1 The System Architecture

57

4.1 The System Architecture With the advent of the idea of data warehousing and multidimensional data exploration, there has been a long discussion about the right way of bringing multidimensional data ‚down to the metal’, i.e. mapping data cubes to main memory and to hard disk ([Coll96]). In the meantime, it has turned out that storing the high volumes of detailed tracking data in a relational system is most promising, especially from a scalability point of view. For evaluation reasons, we initially performed a case study in the context of our data warehouse project, where we have compared a relational with a multidimensional database system for a pre-defined and as far as possible representative set of queries ([LeRT95]). From our investigations, we strongly support the recommendation that multidimensional data structures are suited mainly for organizing data warehouse information in main memory as well as storing data for small ‚Desktop Online Analytical Processing’ applications on disk (e.g. Plato [MSC99]). Hybrid architectures try to combine both techniques. They are either derived from the relational approach and provide a multidimensional caching mechanism at the client side or are coming along with a real multidimensionally organized data cube enabling a ‘drill-through’ possibility to a relational database system for the very detailed tracking data. Although the relational database technique has proven to be reliable as well as scaleable up to multiple terabytes, running a relational database system in a data warehouse mode shows different usage patterns as compared to running a database system in a transactional mode. Instead of single row access, large volumes of data are touched by a single query and mostly aggregated along pre-defined hierarchies (see Section 2). To speed up data warehouse queries by aggregation support, several independent as well as combined techniques are proposed in the literature and step-by-step implemented into commercial relational database products. The current state-of-the-art in commercially available systems is that aggregates are registered in the data warehouse system and automatically used if they match the query predicate. Little work is known so far in the area of self-adopting aggregation management techniques and partial match predicate usage.

4.2 Support of Specialized Index Structures Whereas traditional B-tree-like index structures are useful for queries with a high selectivity (e.g. ”give me the address of Mr. Miller”), the old idea of bit-wise index structures ([ONei87]) comes to a rejuvenation in the context of data warehouse applications ([ONQu97]). Opposite to B-tree index structures, bit-wise indexes are designed to speed up queries ranging over attributes with a low cardinality (e.g. sales by gender). Nowadays, every major relational database system has implemented a flavor of those index structures varying mostly in compression methods and extension to join indexes. Pointing into the same direction as summary tables, join indexes are basically tables holding tuple identifiers of a precomputed join operation between the fact table and a dimension table of a specific star- or

58

4. Ways of Optimizing Aggregation Processing

snowflake schema. In our application scenario, all of these techniques may be applied, as cardinality ranges from 2 (e.g. ´yes´/´no´ or ´with´/´without´) to a couple of hundreds (e.g. brands) for frequently needed data warehouse attributes.

4.3 Sampling Tracing the roots back to the area of Statistical Databases ([Olke93]), Informix (http:// www.informix.com/) has been the first major database vendor implementing a sampling mechanism to speed up queries especially in the developing and testing phase of a database application. New approaches like [HeHW97] try to iteratively apply sampling techniques yielding in a result as exact as the user wants it to be. Sampling techniques in general show a high potential especially in the context of risk analyses, trend explorations, or trend forecast applications. The concept must not be confused in our application with sampling during data collection; whilst the data collection sample defines the universe of data to be evaluated, sampling during query execution relates to sub-sets of data within the universe of discourse. Since we believe that sampling serves a wide class of applications, we demand that sampling techniques in relational databases are much more exploited.

4.4 Materialized Views Based on the concept of database snapshots ([AdLi80]), summary tables reflect a special form of materialized views, where queries make heavy use of aggregation operations. To provide efficient access to highly aggregated data, using, maintaining, and selecting the appropriate set of summary tables in a data warehouse application is of paramount importance. Although the concept of precomputing summary data is a well-known technique in the SSDB (‘Statistical & Scientific Databases’, [Shos82], [ChMc89]) area, an adequate support of summary tables is a hot topic in relational database research and within the commercial database community. Since summary tables are crucial to a successful data warehouse project, we will detail the requirements coming along with this technique in Section 5 in more detail.

4.5 Multiple Query Optimization Current database technology is primarily designed to isolate single queries from one another (Isolation as part of the ACID concept for transaction processing). In the context of data warehousing, however, it is quite common to a-priorily know a set of queries for which results have to be computed on the same set of tables, i.e. a single star-schema. This characteristic provides a sound basis for the application of multiple query optimizations. Prior work in this area ([Sell88]) has focussed on finding common join operations and predicates within a set of queries. This approach failed mainly due to the complexity in handling general predicates. From a data warehouse point of view, the focus of multiple query optimization is on

4.6 Handling Historical Data

59

using common aggregation levels. The work of [YaKl97] reflects the current state of research in this area. As a matter of fact and reinforced by experiences from our concrete application scenario, where thousands of predefined reports have to be computed in every reporting period, we believe that the technique of multiple query optimization in some kind of batchmode would provide an enormous optimization potential.

4.6 Handling Historical Data The last point in our – by no means complete - list of optimization potentials for an efficient aggregation processing is dealing with historical data. According to the wide-spread definition of a ‘data warehouse’ ([Kimb96]), keeping historical data is one of the four distinctive characteristics of a data warehouse. Unfortunately, as we had to learn in our application scenario, historical data causes several problems in several aspects, where we demand extensive support from the underlying database system. As outlined in the former section, the raw data base of the current period may slightly change during a reporting period, e.g. when some shops delay the reporting of their sales figures or correct them afterwards. Therefore, the data of the current period has to be physically stored in a different way as the data from older periods, e.g. with a higher RAID level. As the data becomes more and more stable, the physical organization of the data may be changed to less cost-intensive forms. Keeping historic data on a high RAID level is not only an issue of saving money, but migrating stable data to cheaper formats, which are often optimized for read-only access, may even increase the data access rate. As we have seen in our application, it is often the case that very detailed tracking data from former reporting periods must not necessarily be kept online. It should be possible that data are transparently migrated to a near-line tape archive, for example, whenever possible. However, the data has to be still accessible by the database system in an application-transparent way. This migration policy has to be combined with a comprehensive summary table management. Up to now, we see nobody spending some efforts within this area at the commercial side as well as within the research community.

5

The Magic Triangle of Summary Tables

The concept of summary tables addresses three different functional perspectives which have to be adequately supported within an efficient data warehouse system. The requirements of using, maintaining, and selecting the ‘best’ set of summary tables define our so-called ‚Magic Triangle of Summary Tables’. In addition to these functional perspectives, we will address the question of which architectural component has to provide a sophisticated management for summary tables yielding the greatest benefit with the lowest additional cost.

60

5. The Magic Triangle of Summary Tables

5.1 Transparent Use of Summary Tables Assuming that pre-computing summary tables already exist, the first step is to take advantage of their existence and provide a speed-up in answering incoming queries. This could be done either on the query specification level, where the user has to be aware of existing summary tables, or transparently within the ROLAP server or inside the relational database engine. An explicit use of summary data in the end-user queries leads to a dynamic management of the set of summary tables at application level and should not be taken into consideration. Internal query re-routing at the level of the ROLAP server implies, on the one hand, that the server has to have appropriate knowledge of the existence of summary tables. On the other hand, the OLAP engine usually has detailed knowledge of functional dependencies prevailing in the dimensional structures (cf. the article – product group – product category – product sector hierarchy mentioned earlier in this paper). As long as the underlying relational schema is not normalized (this is the normal case in data warehouse applications), information about functional dependencies inside a table is not available within a relational engine. To overcome these restrictions and to use that knowledge for the test of derivability of an incoming query, SQL extensions like the ‘create hierarchy’ statement in RiSQL ([RBS98]) are required. Current algorithms of derivability (e.g. [ScSV96]) are limited to ‘equal match’ or to ‘query containment’ queries, where the query and the summary table must either match exactly, or the query must be completely derivable from the underlying summary table. A first proposal removing this limitation and derive a single query from a set of summary tables (‘set-derivability’) is presented in [AlGL99]. Since the transparent use of summary data is state-of-the-art in modern database systems, we demand the implementation of extended derivability techniques and a more convenient way to formulate user-defined aggregation operations to reflect the increasing need for complex and application-oriented statistical analyses.

5.2 Maintaining Summary Data As already outlined in the modeling section, tracking data in data warehouse applications is only stable from a theoretical point of view. Most applications require to change the fact data after the production of generic summary data or specific pre-defined end-user reports. This requires that derived data have to be maintained if a change of the base data occurs. The research community has been tackling the aggregate maintenance problem for the last few years resulting in highly sophisticated algorithms (see [GuMa95] and [MuQM97] for an excellent overview). On the opposite, commercial database systems hardly have even started to implement those features. Although a relational database system seems to be the right place for such ´repair´ operations (the changes are made at the base data residing in the relational engine), many derived data like the result of complex distribution or derivation analyses are typically performed inside the OLAP server and have therefore to be maintained under direct control of the OLAP server.

5.3 Selection of the ‘Best’ Set of Summary Tables

61

In general, we see that a lot of work still needs to be done in this important area. Especially in the context of user-defined aggregation functions, it becomes crucial for relational databases systems to fulfill the ever increasing requirements with regard to complex aggregation functions. We see the need of a way to define an aggregation function itself in combination with a buddy-function describing a (as far as possible incremental) maintenance algorithm for the original aggregation function.

5.3 Selection of the ‘Best’ Set of Summary Tables The third perspective in the context of summary data management is the question of which attribute combination yields the best performance gains in query processing and should therefore become a materialized summary table ([Pend98]). Optimality in this case is determined by the (estimated) size of the summary data, the reference frequency of the attribute combination (either directly or indirectly from queries referring to a combination which is derivable from that specific combination), and from the savings potentials in relation to the raw data or the next summary table from which this specific combination is derivable. Current research proposals (e.g. [GHRU97], [BaPT97], and [Gupt97]) are addressing that problem only with respect to attribute combinations. Therefore, these algorithms lack an appropriate support of analysis for hot-spots like ‘the last two periods’, because they do not take data partitions into consideration. The only known work within this area is [DRSN98] and [AlGL99]. This perspective comes along with the requirement of adaptability of the set of summary tables according to the users’ reference behavior. Ideally, the database system should automatically determine the best set of summary tables. The reality on the commercial side is that the database administrator has to explicitly define and populate summary tables. Current research has already picked up this topic in the context of automatically determining the set of appropriate indexes. Commercial products like Redbrick Vista ([RBS98]) or OLAP Services ([MSC99]) also provide a first solution to that problem. From our data warehouse application point of view, content-based and dynamically organized sets of summary tables are of fundamental importance to ensure adequate query response time, and we demand further development within this area. The current state-of-theart is providing an administrator with some hints of good summary tables ([MSI98]), which is far from being optimal.

6

Summary and Conclusion

In this paper, we have presented our experiences from building a market research data warehouse for GfK Marketing Services. We have addressed the problems from a modeling point of view (esp. modeling a dimensional hierarchy like an inheritance hierarchy), from a user’s

62

6. Summary and Conclusion

access point of view (management of report libraries and data access denial for tracker queries), and we stated our requirements in combination with physical database design considerations in the context of data warehousing. As far as we can see, existing products only cover some core data warehouse functions, but lack of sophisticated modeling, administration and operation support for many real-world problems. Unfortunately, it seems that the vendors prefer to endorse new opportunities (e.g. broadcasting services over the Web) rather than strengthening the core systems with badly needed functionality. The worst thing to happen would be that the responsibility of finding proper solutions to challenging requirements is shipped back to the end user.

6 Summary and Conclusion

63

References AdLi80 AlGL99 BaPT97

ChMc89 CoCS93 Coll96 DRSN98

GHRU97

GuMa95 Gupt97

HeHW97

Kimb96 LeRT95

Lehn98

Mich91 MSC98 MSC99 MSI98 MuQM97

Olke93 ONei87

Adiba, M.E.; Lindsay, B.G.: Database Snapshots. In: Proceedings of the 6th International Conference on Very Large Data Bases (VLDB’80, Montreal, Canada, Oct. 1-3), 1980, pp. 86-91 Albrecht, J.; Guenzel, H.; Lehner, W.: Foundations for the Derivability of Multidimensional Aggregates, submitted to DAWAK‘99 Baralis, E.; Paraboschi, S.; Teniente, E.: Materialized Views Selection in a Multidimensional Database. In: Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB’97, Athens, Greece, Aug. 25-29), 1997, pp. 156-165 Chen, M.C.; McNamee, L.P.: On the Data Model and Access Method of Summary Data Management. In: IEEE Transactions on Knowledge and Data Engineering 1(1989)4, pp. 519-529 Codd, E.F.: Codd, S.B.; Salley, C.T.: Providing OLAP (On-line Analytical Processing) to User Analysts: An IT Mandate, White Paper, Arbor Software Corporation, 1993 Colliat, G.: OLAP, Relational, and Multidimensional Database Systems. In: ACM SIGMOD Record 25(1996)3, pp. 64-69 Deshpande, P.M.; Ramasamy, K.; Shukla, A.; Naughton, J.F.: Caching Multidimensional Queries Using Chunks. In: Proceedings of the 27th International Conference on Management of Data (SIGMOD’98, Seattle (WA), June 2-4), 1998, pp. 259-270 Gupta, H.; Harinarayan, V.; Rajaraman, A.; Ullman, J.D.: Index Selection for OLAP. In: Proceedings of the 13th International Conference on Data Engineering (ICDE’97, Birmingham, Great Britain, April 7-11), 1997, pp. 208-219 Gupta, A.; Mumick, I.: Maintenance of Materialized Views: Problems, Techniques, and Applications. In: IEEE Data Engineering Bulletin 18(1995)2, pp. 3-18 Gupta, H.: Selection of Views to Materialize in a Data Warehouse. In: Proceedings of the 6th International Conference on Database Theory (ICDT‘97, Delphi, Greece, Jan. 8-10), 1997, pp. 98-112 Hellerstein, J.M.; Haas, P.J.; Wang, H.J.: Online Aggregation. In: Proceedings of the 26th International Conference on Management of Data (SIGMOD’97, Tucson (AZ), May 13-15), 1997, pp. 171-182 Kimball, R.: The Data Warehouse Toolkit, 2nd edition. New York, Chichester, Brisbane, Toronto, Singapore: John Wiley & Sons, Inc., 1996 Lehner, W.; Ruf, T.; Teschke, M.: Data Management in Scientific Computing: A Study in Market Research. In: Proceedings of the International Conference on Applications of Databases (ADB’95, Santa Clara (CA), Dec. 13-15), 1995, pp. 31-35 Lehner, W.: Modeling Large Scale OLAP Scenarios. In: Proceedings of the 6th International Conference on Extending Database Technology (EDBT’98, Valencia, Spain, March 23-27), 1998, pp. 153-167 Michalewicz, Z. (Ed.): Statistical and Scientific Databases. Chichester, West Sussex, England: Ellis Horwood Limited, 1991 Microsoft Corporation: OLE DB and OLE DB for OLAP specification, 1999 (http:// www.microsoft.com/data/oledb/) Microsoft Corporation: SQL Server 7.0 OLAP Services, 1999 MicroStrategy, Inc.: DSS Suite, 1998 Mumick, I.; Quass, D.; Mumick, B.: Maintenance of Data Cubes and Summary Tables in a Warehouse. In: Proceedings of the 26th International Conference on Management of Data (SIGMOD’97, Tucson (AZ), May 13-15), 1997, pp. 100-111 Olken, F.: Random Sampling from Databases. Technical Report 32883, University of California Berkeley; Lawrence Berkeley Laboratory, Berkeley (CA), April 1993 O’Neil, P.: Model 204: Architecture and Performance. In: Gawlick, D.; Haynie, M.; Reuter, A. (Ed.): High Performance Transaction Systems. Lecture Notes in Computer Science 359, Springer, 1987

64

6. Summary and Conclusion

ONQu97 O’Neil, P.; Quass, D.: Improved Query Performance with Variant Indexes. In: Proceedings of the 26th International Conference on Management of Data (SIGMOD’97, Tucson (AZ), May 13-15), 1997, pp. 38-49 Pend98 Pendse, N.: Database Explosion, Business Intelligence Ltd., 1998 (http://www.olapreport.com/ DatabaseExplosion.htm) RBS98 Red Brick Systems, Inc.: Red Brick Vista Aggregate Computation and Management. White Paper, 1998 RuGR99 Ruf, T.; Goerlich, J.; Reinfels, I.: Complex report support in data warehouse and OLAP environments, submitted to DAWAK‘99 ScSV96 Scheuermann, P.; Shim, J.; Vingralek, R.: WATCHMAN: A Data Warehouse Intelligent Cache Manager. In: Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB’96, Bombay, India, Sept. 3-6), 1996, pp. 51-62 Sell88 Sellis, T.: Multiple Query Optimization. In: Transactions on Database Systems 13(1988)1, pp. 2351 Shos82 Shoshani, A.: Statistical Databases: Characteristics, Problems, and Some Solutions. In: Proceedings of the 8th International Conference on Very Large Data Bases (VLDB’82, Mexico City, Mexico, Sept. 8-10), 1982, pp. 208-222 YaKL97 Yang, J.; Karlapalem, K.; Li, Q.: Algorithms for Materialized View Design in Data Warehousing Environment. In: Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB’97, Athens, Greece, Aug. 25-29), 1997, pp. 136-145

Suggest Documents