Enterprise Data Management in Mixed Workload Environments Jens Krueger, Martin Grund, Christian Tinnefeld, Jan Schaffner, Stephan Mueller and Alexander Zeier Hasso Plattner Institute for IT Systems Engineering, University of Potsdam, August-Bebel-Str. 88, D-14482 Potsdam, Germany
[email protected] WWW Home Page: http://epic.hpi.uni-potsdam.de Abstract—Enterprise applications are presently built on a 20year old data management infrastructure that was designed to meet a specific set of requirements for OLTP systems. In the meantime, enterprise applications have become more sophisticated, data set sizes have increased, requirements on the freshness of input data have been strengthened, and the time allotted for completing business processes has been reduced. To meet these challenges, enterprise applications have become increasingly complicated to make up for short-comings in the data management infrastructure. This paper outlines the characteristics of enterprise application with regards to the underlying data management layer. We also propose a database design perfectly fit to the demanded requirements of enterprise applications.
I. I NTRODUCTION Nowadays, enterprise applications for large and mid-size companies are subject to tied conditions. Rarely any company can manage daily business and offer its services at a certain level of quality without the extensive use of comprehensive software systems throughout all departments of the company. As described in [19], enterprise applications are presently built on a 20-year old data management infrastructure that was designed to meet certain set of requirements for transaction processing systems. In the meantime, enterprise applications have become more sophisticated, addressing for example legal regulations, governmental compliance, new accounting principles, and global supply chains. In addition, data set sizes have increased, requirements on the freshness of input data have been strengthened, and the time allotted for completing business processes has been reduced. To meet these challenges, enterprise applications have become increasingly complicated to make up for shortcomings in the data management infrastructure. These complications increase the total cost of ownership of the applications and make them harder to use. Companies like SAP offer standard software solutions for a wide range of different industries and application domains. It is a fact that the system landscape in most companies is very heterogeneous. Different operating- and database systems have to be taken into account when developing enterprise applications for a wide range of potential customers. Often, the decision to choose a certain database, operating system or the combination of both is of political or historical origin leading
to a mismatch of the data management with regards to actual requirements of the applications. The extended life cycle of enterprise applications with evolutionary changes corroborates this effect as well. The remainder of this paper is structured as follows: Firstly, section II will address the requirements for data management derived from enterprise applications such as the mixed workload. Also, the results of analyzing customer data will be described. Then in section III database technologies, which impact the data management for enterprise applications will be reviewed. Consequently, section IV points out the feasibility and improvements of applying those technologies in combination. Section V surveys the related work conducted in this area. Finally, a conclusion will be provided in section VI. II. ERP D ESIGN C ONSIDERATIONS This section presents issues to concern about when looking at requirements for enterprise application specific data management. A. Mixed Workload In the context of enterprise data management database systems are classified being optimized either for online transaction processing (OLTP) or online analytical processing (OLAP). In fact, enterprise applications today are primarily focused on the day-to-day transaction processing needed to run the business while the analytical processing necessary to understand and manage the business is added on after the fact. In contrast to this classification, single applications such as Available-To-Promise (ATP) or Demand Planning exist, which cannot be exclusively referred to one or the other workload category [20]. These application initiate a mixed workload in terms of that they process small sets of transactional data at a time including write operations and simple read queries as well as complex, unpredictable mostly-read operations on large sets of data with a projectivity on just a few columns. Having a mixed workload is nothing new and has been analysed on database level a decade ago by French [9] - the insight that it is originated by a single application is new. Given this and the fact that databases are either build for OLTP or OLAP,
B. The Mismatch The afore mentioned, simplified example of a complex enterprise application shows workload characteristics, which match with those associated with OLTP and OLAP. As a consequence, nowadays database management systems cannot fulfil the requirements of specific enterprise applications since they are optimized for one or the other category leading to a mismatch of enterprise applications regarding the underlying data management layer. Mainly because conventional RDMBSs cannot execute certain important complex operations in a timely manner. While this problem is widely recognized for analytical applications, it also pertains to sophisticated transactional applications. To meet this issue, enterprise applications have become increasingly complicated to make up for shortcomings in the data management infrastructure. One of these solutions packaging the operations as long-running batch jobs. Consequently, this approach slows down the rate at which business processes can be completed, possibly exceeding external requirements. Maintaining pre-computed, materialized results of the operations are another solution. Materialized views in data warehouses for analytical applications are an example of this approach, which makes applications less flexible, harder to use, and more expensive to maintain. To address this mixed workload the database management layer has to be aware of this fact and optimized towards these contradicting workloads by leveraging nowadays advances in hardware such as the availability of huge amounts of main memory. Additionally and presented by Schaffner et al. in [16], recent trends in data management, like storing data in a column-wise fashion and light-weight compression algorithms, support the feasibility of building a enterprise application specific data management layer.
C. Data Characteristics Give the mixed workload characteristics, data managements systems optimized for analytic-style queries seem to be the best match. However, fast reconstruction of complete tuples is still an essential requirement of OLTP workloads. While ERP data schemas consist of very wide relations due to inherent complexity this operation can be expensive. For example, in a large enterprise system the accounting document has 98 attributes while the corresponding line item contains 301 attributes. Consequently, in order to make an assumption of whether a row- or column-oriented oriented database is better suited for an ERP system, the usage of each attribute of a table is explored. The main focus in this evaluation are the distinct values of each column. While taking the most common applications in enterprises, the financial accounting and sales order processing have been analyzed. It is assumed that the type of data characteristics and structure can be applied to other application domains as well. 100%
Distinct Values
it is evident that there is no database management system that adequately addresses the needed characteristics for these complex enterprise applications. For example, within sales order processing systems, the decision of being able to deliver the product at the requested time relies on the ATP check. The execution of this results in a confirmation for the sales order containing information about the product quantity and the delivery date. Consequently, the checking operation leads to a database request summing up all available resources in the context of the specific product. Apparently, materialized aggregates could be seen as one solution to tackle the expensive operation of on-the-fly aggregation. However, they fail in processing real-time order rescheduling due to incoming high priority orders leading to a reallocation of all products. Considering this operation as essential part of the present ATP application encompasses characteristics of analytical workloads with regards to low selectivity and low projectivity as well as aggregation functionality is used and read-only queries are executed. Along the afore mentioned check operation the write operations to declare products as promised to customers work on fine-granular transactional level. While looking at the characteristics of these write operations it is obvious that they belong to the OLTP category.
75%
50%
25%
0%
Columns
Fig. 1.
Distinct values
Figure 1 shows ten out of 98 columns of an accounting document header. The columns are in descending order of relative frequency. All columns that are not shown have a very small relative frequency of their distinct values. These are 90% of all columns. The attribute with the highest relative frequency is the column of the document number, which is also the key of the table. The other columns depicted are the reference number, description text, and date and time columns. Striking at the table is that 72% of all columns just contain one single distinct value. This single value can either be a null value or a default value. Consequently, it obvious that just a certain amount of attributes are ever used and thus in interest of the application. However, these vary from company to company and cannot be removed but the data management layer has to be aware of the characteristics. III. DATABASE T ECHNOLOGIES Traditionally, enterprise applications store their data in a conventional RDBMS. These originated from the System R [3] in the 1970s. In order to fulfill the requirements of OLTP applications these days row-oriented database management systems were developed. There the data is organized logically and physically in tables containing rows. Each table represents all entities of a certain type and each row represents a single entity where the specific columns represent attributes of each entity. Given the fact, that most of the described data access characteristics of OLTP applications are accessing a
full relation and storage is assumed as disk-based this storage layout is preferable. A. Column Database While conventional RDMBS store data in a row-wise fashion, which has advantages for OLTP applications, this physical data representation is not optimized for read-mostly, analytic-style queries where typically only a few columns are projected. Following the trend of specialized database [19] data management system have been developed organizing data along columns[18][4]. While the logical data layout still remains as before, meaning that the data is still organized in tables, the physical layout now differs from this. This storage model is also known as decomposition storage model [6] or as vertically partitioned storage [21][13]. Such a storage model has several advantages when it comes to the before mentioned analytical queries since these queries typically require the values of few columns but many rows. In contrast to a row store, column-stores can, because storing each column individually, read only the values actually required for processing the query. Given that, memory and I/O bandwidth resource consumption can be reduced. Another advantage of storing data by columns is the locality of values which allows light weight compression techniques. However this columnar layout also has disadvantages especially when disk-based. Since data has to be fetched from several locations reconstructing full relation is expensive[11]. As stated by Copeland and Khoshafian [6], for the same reason writing is slower compared to row-stores. B. In-Memory Data Management Since a couple of years the prices for main memory are falling while the size of available memory on each node increases constantly. Consequently, this leads to the situation where it seems feasible to run a complete database of an enterprise application entirely in main memory. By structuring data access to memory instead of disk and eliminating the need for I/O from disk to run database applications, Main memory databases (MMDBs) can achieve a substantial improvements in performance, processing time and throughput rates over conventional database systems. Even when having the complete data in memory, the disk-based and now cached database must still execute and manage movements of data, define block sizes, and carry much of the same processing overhead as if data was residing on disk. Thus these conventional database systems stay optimized for the particular characteristics of disk storage mechanisms. As the main memory definitely has other properties which differ from properties of magnetic disks, these differences change the design and performance of the database system. First of all, it concerns changes of the data structures in MMDB. The conventional data structures used are usually designed to reduce the number of disk accesses. The idea is to find a block where the data resides with as little access as possible and to bring that block into the main memory. However, since main memory is random accessible without any delays data is not going to be organized according
to blocks. In MMDB systems the objective is changed from reducing I/Os to efficient utilization of both central processing unit (CPU) and main memory space. Due to the still growing gap between the CPU speed and the memory access speed the memory latency is the bottleneck of MMDBs. To narrow the exponentially growing performance gap between CPU speed and memory latency, cache memories have been introduced, consisting of fast but expensive and therefore much smaller static random access memory chips. This cache memories are organized in multiple cascading levels between the main memory and the CPU. In such hierarchical memory systems, the time span that passes after issuing a data access until the requested data is available in the CPU, the latency, increases with the distance from the CPU. Given that, reducing cache misses by data locality is the major optimization in MMDBs. Thus, to improve data locality, attention has to be paid to define cache-conscious data structures to reduce CPU data stalls during query execution. C. Compression Moore’s Law [12] says that complexity, and with that speed, of circuits will double every two years, has been true in the case of CPUs for more than 35 years. Nothing similar is true for hard drive or memory speed, so that for many queries I/O has increasingly become the bottleneck [4]. Due to Moore’s Law there is an abundance of CPU power, much of which is spent waiting on I/O operations to complete, thus trading CPU power for I/O improvements comes more and more desirable. In order to do so, compression is one solution to reduce the amount of data moved and therefore decreases the amount of I/O that is required to read certain data while requiring additional CPU power for decompression of that data. In general, compression works best with most similar values. The less distinct values, the better the data can be compressed. Especially by compressing column-wise the fact of the same data type and domain enables efficient light weight compression. For example, dictionary based compression stores each unique value of an attribute in a separate dictionary and assigns a reference to this value. Every occurrence of a specific value in the attribute is then replaced by the reference. Storage efficiency is achieved due to the fact that the reference number is in most cases smaller than the actual value. This can even be improved by trimming the reference number to the smallest possible length needed to handle the amount of distinct values of the attribute, which are stored in the dictionary. A description of most common techniques can be found in [1]. IV. P UTTING IT ALL TOGETHER The feasibility of MMDB in enterprise application environments is greatly facilitated by technical innovations. Until the 1990’s MMDB have been used for specialized databases where high throughput on a small set of data was necessary. Nowadays, the data that can be kept in main memory is not the critical issue anymore because 64-bit architectures are not limited to addressing four gigabytes of main memory. Theoretically,
up to 16 exabytes of main memory can be addressed. Also, the falling prices of main memory have contributed to the success of MMDBs as the usage of 128 gigabytes or even more can be realized with moderate financial investments. Based on the fact, that the data is not kept on disks but in main memory, significant performance improvements can be achieved. The costly joins of normalized tables can thus be calculated on-thefly. Pre-aggregation and de-normalization can be reconsidered since it imposes problems such as inflexibility for reads and high effort for maintaining these structures. All the more, in systems where real-time capabilities are needed the instant update of materialized views slows down insert and update queries. Given that on-the-fly calculation of data is nearly as fast as using materialized aggregates storing this redundant data becomes obsolete. Furthermore, a main-memory database profits from a column-oriented data layout in the way that the possibilities to compress the data increase. Collocating data of the same type side by side provides many advantages, not only for processing but also for the possibilities of applying different compression techniques. As Plattner points out in [14] on enterprise data a compression factor of at least 10 can be achieved. Hence, nowadays systems with 128 gigabytes of main memory can fit a substantial large enterprise database. While considering in-memory processing and column-wise data storage, the characteristics identified on the conducted analysis of customer data support the concept of combining the before mentioned database technologies. Given that only 10 percent of the attributes consist of a significant amount of distinct values lead to the conclusion that only these attributes are in interest for processing. Hence, queries are only taking a small part of the whole relation into account. Consequently, organizing the data column-wise seems to be a perfect match. This assumption can be underlined further more with the mixed workload which has to be handled by enterprise data management layers. Besides conventional simple OLTPstyle queries more complex OLAP-style queries occupying the DBMS. The later are mainly operating a certain set of columns but on many rows. This column oriented operation provides another argument for main memory column stores, which rely on good cache locality in order to achieve substantial performance boosts. For example, aggregate queries on a for enterprise application typical very wide table processes a significant amount of data. In a row-oriented environment complete tuples of data would have to be transferred into the CPU’s cache, in order to process the query, but only a small fraction of the tuples data is actually needed to calculate the sum. A lot of the cache memory is wasted for attributes that are not of interest to the specific query. Having a table with a number of 200 attributes, the efficient utilization of the CPU cache is clearly not at hand. Because of the fact that a column-oriented database stores attributes sequentially in columns, they are also transferred sequentially into the processor’s cache. The CPU will experience less cache misses and is able to calculate the desired operation in a faster and, in terms of memory usage, more efficient way.
V. R ELATED W ORK Much of the existing research on mixed workloads focuses on conventional RDBMSs. [5] assumes different runtimes for certain queries and categorize queries into classes which have certain performance goals. The authors propose an algorithm for dynamically self-adjusting DBMS. Ramamurthy et al. describe in [15] a DBMS design to handle workloads of different characteristics by introducing mirrors with different physical storage layouts. While early work on compression, such as [2][7], was focused on reducing I/O by reducing the size of the data, newer research has focussed on the effects of compression onto query execution [10][22]. They have shown that write performance can be hurt by implementing compression but on the other hand, read performance can be improved. By using such light weight compression techniques, for most read queries decompression at the first database operator is not necessary and can be delayed. This materialization strategies improve query response times by leveraging the advantage of compressed data. A significant amount of research has gone into compression that is adapted to a column stores properties as described in [1]. Another related work belongs to the area of how database systems could take advantage of the characteristics of main memory, especially when compared to disks. This was conducted with a focus on data structures by [8][17]. VI. C ONCLUSION Due to the growing complexity of enterprise applications the requirements for database systems in this environment have changed. As a consequence, the workloads that such systems need to handle diversify more and more. Enterprise applications of the earlier days had to perform OLTP style queries mostly. The trend has moved towards a mix of OLTP and OLAP style queries, with an increasing part of the latter. The problem is that the most common enterprise platforms and especially underlying data management layer incorporate conventional RDBMS. These kind of DBMS are optimized for OLTP style applications, which dominated the market when the decision was made to use them for enterprise platforms. In this paper, we presented requirements for an enterprise application specific data management derived from analyzing workloads and data. We then describe recent trend in database technologies, which have been applied to certain areas. By putting the requirements and the database technologies together, we can point out the feasibility of using a man memory based column store which makes use of compression techniques. R EFERENCES [1] Daniel Abadi, Samuel Madden, and Miguel Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 671–682, New York, NY, USA, 2006. ACM. [2] Peter Alsberg. The management of a large data base in iris. Journal of Chemical Information and Computer Sciences, 15(1):23, 1975.
[3] Morton M. Astrahan, Mike W. Blasgen, Donald D. Chamberlin, Kapali P. Eswaran, Jim Gray, Patricia P. Griffiths, W. Frank King III, Raymond A. Lorie, Paul R. McJones, James W. Mehl, Gianfranco R. Putzolu, Irving L. Traiger, Bradford W. Wade, and Vera Watson. System r: Relational approach to database management. ACM Trans. Database Syst., 1(2):97–137, 1976. [4] Peter A. Boncz, Stefan Manegold, and Martin L. Kersten. Database Architecture Optimized for the New Bottleneck: Memory Access. In VLDB’99, Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK, pages 54–65. Morgan Kaufmann, 1999. [5] Kurt P. Brown, Manish Mehta, Michael J. Carey, and Miron Livny. Towards automated performance tuning for complex workloads. In VLDB ’94: Proceedings of the 20th International Conference on Very Large Data Bases, pages 72–84, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. [6] George P. Copeland and Setrag Khoshafian. A Decomposition Storage Model. In Proceedings of the 1985 ACM SIGMOD International Conference on Management of Data, Austin, Texas, May 28-31, 1985, pages 268–279. ACM Press, 1985. [7] Gordon V. Cormack and R. Nigel Horspool. Data compression using dynamic markov modelling. The Computer Journal, 30:541–550, 1986. [8] David J DeWitt, Randy H Katz, Frank Olken, Leonard D Shapiro, Michael R Stonebraker, and David Wood. Implementation techniques for main memory database systems. SIGMOD Rec., 14(2):1–8, 1984. [9] Clark D. French. “one size fits all” database architectures do not work for dss. SIGMOD Rec., 24(2):449–450, 1995. [10] Goetz Graefe and Leonard D. Shapiro. Data compression and database performance. In In Proc. ACM/IEEE-CS Symp. On Applied Computing, pages 22–27, 1991. [11] Stavros Harizopoulos, Velen Liang, Daniel J. Abadi, and Samuel Madden. Performance tradeoffs in read-optimized databases. In VLDB ’06: Proceedings of the 32nd international conference on Very large data bases, pages 487–498. VLDB Endowment, 2006. [12] G. E. Moore. Cramming More Components Onto Integrated Circuits. Electronics, 38(8), 1965. [13] Shamkant B. Navathe, Stefano Ceri, Gio Wiederhold, and Jinglie Dou. Vertical partitioning algorithms for database design. ACM Trans. Database Syst., 9(4):680–710, 1984. [14] Hasso Plattner. A common database approach for oltp and olap using an in-memory column database. In SIGMOD ’09: Proceedings of the 35th SIGMOD international conference on Management of data, pages 1–2, New York, NY, USA, 2009. ACM. [15] Ravishankar Ramamurthy, David J. DeWitt, and Qi Su. A case for fractured mirrors. In VLDB ’02: Proceedings of the 28th international conference on Very Large Data Bases, pages 430–441. VLDB Endowment, 2002. [16] Jan Schaffner, Anja Bog, Jens Krueger, and Alexander Zeier. A hybrid row-column oltp database architecture for operational reporting. In BIRTE 2008, 2008. [17] Joachim W. Schmidt, Stefano Ceri, and Michele Missikoff, editors. Advances in Database Technology - EDBT’88, Proceedings of the International Conference on Extending Database Technology, Venice, Italy, March 14-18, 1988, volume 303 of Lecture Notes in Computer Science. Springer, 1988. [18] Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth J. O’Neil, Patrick E. O’Neil, Alex Rasin, Nga Tran, and Stanley B. Zdonik. C-Store: A Column-oriented DBMS. In Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, August 30 - September 2, 2005, pages 553– 564. ACM, 2005. [19] Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. The end of an architectural era: (it’s time for a complete rewrite). In VLDB ’07: Proceedings of the 33rd international conference on Very large data bases, pages 1150– 1160. VLDB Endowment, 2007. [20] Christian Tinnefeld, Jens Krueger, Jan Schaffner, and Anja Bog. A database engine for flexible real-time available-to-promise. In IEEE Symposium on Advanced Management of Information for Globalized Enterprises (AMIGE’08), pages 1–5, 2008. [21] P. J. Titman. An Experimental Data Base System Using Binary Relations. In IFIP Working Conference Data Base Management, 1974.
[22] Till Westmann, Donald Kossmann, Sven Helmer, and Guido Moerkotte. The implementation and performance of compressed databases. SIGMOD Rec., 29(3):55–67, 2000.