AMERICA77
An Algebraic Compression Framework for Query Results Zhiyuan Chen and Praveen Seshadri Cornell University zhychen,
[email protected], contact: (607)255-1045, fax:(607)255-4428 Abstract Decision-support applications in emerging environments require that SQL query results or intermediate results are shipped to clients for further analysis and presentation. These clients may use low bandwidth connections or have severe memory restrictions. Consequently, there is a need to compress the results of a query for efficient transfer and client-side access. This paper explores a variety of techniques that address this issue. We present a framework to represent “compression plans” formed by composing primitive compression operators. We also present optimization algorithms that enumerate valid compression plans and choose an optimal plan. The factors that influence this choice include statistical and semantic information on the query result. Using queries adapted from the TPC-D benchmark, we demonstrate that our techniques can on average result in 86% greater compression than standard tools like WinZip. 1 Introduction Most database applications have multi-tier client-server architectures. The database back-end server is used to run queries over the stored data. The query results are shipped across a hierarchy of clients, with possible transformations along the way. Increasingly, these “clients” reside on a desktop machine, a laptop, or a palmtop computer. In this paper, we apply compression to the results of SQL queries. 1. In OLAP applications, the query results need to be moved to the clients for ready visualization. The clients may download the query results across an expensive and/or a slow connection such as wireless network. 2. In mobile computing applications, the clients are usually disconnected, and consequently need to download large query results for offline use. For the emerging class of palmtop computers, storage memory is severely constrained. For instance, the PalmPilot III has only 2MB of RAM for both applications and data. Without data compression, such a client can not hold more than 2MB of data! Therefore, for efficient transfer and client-side memory conservation, the query results need to be compressed. 3. Recent research [Mayr99] has demonstrated the use of client-side functions (UDFs) within SQL queries. This requires that partial results be shipped from the server to the client. A slightly different motivation underlies the research on semantic caching [DFJ96] that also expects that the results of queries are cached at clients. This paper is motivated by all these environments, where the ability to compress query results enhances the usability, functionality and/or performance of the application. 1.1 Summary of Contributions Data compression has traditionally been applied to database indexing structures [WAG73, COM79, GRS98]. While there has been some work on compression techniques for query evaluation [RH93, RHS95, SNG93, GRS98, GS91], this activity is typically restricted to the internals of decision-support systems. There is also much existing research in the data compression community, mostly focused on compression algorithms for specific data types (like text and multimedia). Three issues make it non-trivial to apply compression in database systems:
1. General compression methods work well only on large chunks of data [RHS95], while database applications usually need to access data in small pieces (tuples, attributes, etc). 2. A DBMS maintains different types of data, and these require a combination of compression methods rather than a single method. The choice of an appropriate combination is not obvious. 3. Compression and decompression are CPU intensive operations, which is especially important at low-end clients. Our research makes the following contributions: 1. We analyze SQL queries and demonstrate how the semantics of the queries, the statistics computed in query processing or query optimization, and the underlying tables can identify compression opportunities. 2. We present an algebraic framework to represent the compression of an entire query result using the combination of many methods. We define primitive compression operators, and a compression plan as a sequence of compression operators. 3. We demonstrate through implementation that well-chosen compression plans can result in significantly greater ( 86% on average in our experiments) compression ratios than standard compression methods like LZ77 (used in gzip, WinZip). We apply compression plans to modified versions of all the queries in the TPC-D benchmark and report on the resulting compression ratios. We also present a decompression experiment on handheld devices to show that lightweight plans using information on query results can improve the performance. 4. The choice of an appropriate compression plan requires “compression optimization”, analogous to query optimization. We present an exhaustive algorithm to enumerate compression plans and choose a good plan, as well as a faster heuristic algorithm. 1.2 Brief Summary of Related Work A number of researchers have considered compression methods on text and multimedia data. The methods they consider can be roughly divided into two categories, statistical and dictionary methods. Statistical methods include Huffman encoding [Huf52], arithmetic encoding [WNC87], etc. Dictionary methods include LZ78 [WEL84], LZ77 [LZ77, LZ76], etc. A comprehensive survey of compression methods can be found in [SAL98]. Most related work on database compression has focused on the development of new algorithms and the application of existing techniques within the storage layer [ALM96, COR85, GOY83, GRA93, LB81, SEV83]. [GRS98] discusses page level offset encoding on indexes and numerical data. The tuple differential algorithm is presented in [NR95]. COLA (Column-Based Attribute Level Non-Adaptive Arithmetic Coding) is presented in [RHS95]. Considerable research has dealt with compression of scientific and statistical databases [BAS85, EOS81]. In commercial database products, SYBASE IQ [SYB98] uses a compression technique similar to gzip. DB2 [IW94] also uses Ziv-Lempel compression method in the storage layer. Other researchers have investigated the effect of compression on the performance of database systems [RH93, RHS95, SNG93, GRS98]. [GS91] discusses the benefits of keeping data compressed as long as possible in query processing. The performance of TPC-D benchmark on compressed database is presented in [WKHM98]. Our
research is complementary to the above work because we focus on the compression of query results leaving the DBMS. To the best of our knowledge, there is no research that directly addresses this topic. Another related research area is mobile databases. Much work has been done on caching [BM95, LC98, CSL98], data replication and data management [HSW95, DK98, HL96], transaction processing [DL98, VB98, PR98, R98]. However, our focus is on the compression of query results shipped to handheld devices. 2 Compression Methods 2.1 Data Compression Methods While we intend to leverage well-known data compression algorithms, we also use two database-specific compression methods described below -- normalization and grouping. 2.1.1 Normalization as a Compression Method. Stored tables are usually normalized to eliminate data redundancy. However, when queries involve foreign key joins, the results become unnormalized. For instance, assume we have tables R1 (A,B) and R2 (A,C). A is a primary key in R1 and a foreign key in R2. Therefore, we have a functional dependency A->B. Assume we run the query select * from R1, R2 where R1.A = R2.A and R1.A < 1000 order by R1.A Suppose there are 1000 distinct A values and 100,000 tuples in the result. Instead of storing A and B values 1000 times (because actually they only have 1000 distinct values), they are stored 100,000 times! We can reduce such redundancy by normalizing the result into two relations, AB and AC such that B values are only stored 1000 times. The simple normalization technique has two problems (a) values of attribute A still have redundancy, (A values are stored 101,000 times) (b) we need a join to regenerate the result. However, when the data is already sorted on A (the LHS of the functional dependency), there is a more efficient normalization algorithm that partitions the table into equality partitions on A (and thereby on B as well). A and B are stored just once for each partition, along with the matching C values. This physical representation of the normalized tables has little redundancy (and hence occupies fewer bytes). Figure 2.1 shows the procedure of compression and decompression using sorted normalization.
tuple 1 tuple 2 tuple 3
A 1
B 2
C 6
1 1
2 2
5 4
Compression A : 1
Decompression
B : 2
6 5 4
Figure 2.1 Application of normalization method on A->B . 2.1.2 Grouping As a Compression Method Even if there is no functional dependency, we may utilize a similar compression technique when the data is sorted. We can group together duplicate attribute values and merely record the value once along with the matching values
of other attributes. Since query results are often sorted, the grouping method can be applied on the sorted attributes without significant cost. 2.1.3 Standard Compression Methods We now present a list of the standard methods that are used in this paper. Each method is described briefly, and detailed algorithmic explanations can be found in [RH93]. While this list is not exhaustive, any specific database system will implement a finite number of compression algorithms, and our work assumes this finite list. Name
Description
Differential encoding
Compute the difference of adjacent values (same column or same tuple)
Offset encoding
Encoding the offset to a base value
Null suppression
Omit leading zero bytes for numerical values and leading or ending spaces for strings.
Non adaptive dictionary encoding. Use a fixed dictionary to encode data. LZ77, LZ78
Adaptive dictionary encoding
Huffman
Assign fewer bits to represent more frequent characters
Arithmetic encoding
Represent a string by interval according to probabilities of each character. Table 2.1 standard compression methods
2.2 Using Compression Methods Clearly, it is possible to apply a general-purpose compression technique like LZ77 compression to the result of any query. However, there are drawbacks to this approach: (a) The entire data set needs to be decompressed in order to access individual records or attributes, which may defeat the purpose of compression in some cases (as on a palmtop computer). (b) It uses no semantic or statistical knowledge about the query result, although this may be available. (c) Since each table consists of different types of data in each column, there is no reason to believe that a single compression method is ideal for the whole table. Consequently, we are faced with the non-trivial issue of composing multiple compression methods in a consistent and efficient way. We discuss the consistency issue in Section 3.3.2 (what is a valid compression plan?) and the efficiency issue in Section 5 (how is a good compression plan chosen?). In the rest of this section, we further motivate our work by demonstrating that database information can be effectively used to accomplish high compression ratios on query results. 2.3 Sources of Information for Compression A database system can acquire information about the result of a query from: •
The semantics of the query.
•
The statistics computed during query optimization. Many techniques [JKM98, HS92, PIHS96] such as
histograms and sampling have been proposed to maintain such information1. •
The statistical and semantic information kept in catalog.
Example 2.1 select S_SUPPKEY, N_NAME, R_NAME, S_PHONE, O_ORDERDATE, L_SHIPDATE, SUM (L_EXTENDEDPRICE*(1- L_DISCOUNT)) AS REVENUE from LINEITEM, SUPPLIER, NATION, REGION, ORDER where O_ORDERDATE >= ’1998-1-01’ AND O_ORDERDATE < ’1999-1-01’ AND L_SHIPDATE < O_ORDERDATE + 3 months AND S_SUPPKEY = L_SUPPKEY AND S_NATIONKEY = N_NATIONKEY AND N_REGIONKEY = R_REGIONKEY AND L_ORDERKEY = O_ORDERKEY group by S_SUPPKEY, N_NAME, R_NAME, S_PHONE, O_ORDERDATE, L_SHIPDATE order by S_SUPPKEY, O_ORDERDATE having REVENUE >= 10,000 AND REVENUE < 100,000 We use an example to illustrate the use of such semantic and statistical information. Example 2.1 selects for each supplier, his nation, region, phone number, and revenue on every possible order date and ship date in the year of 1998. Only the revenue for parts delivered within 3 months of the order date are included and the revenue ranges from 10,000 to 100,000. The underlying database is a 100 MB scaled database in TPC-D benchmark [TPC95]. Useful information is available at three granularities: (a) about individual attributes, (b) about each tuple, (c) about the entire relation. For each individual attribute, it is useful to know: •
Range of values: If the range is small, we may encode each value by its offset from the lower bound of value
range. We can determine the range of values from statistics computed in query optimization or query processing. For instance, we can get S_SUPPKEY’s range as [1,1000]. Therefore, offset encoding can be applied. Further, WHERE and HAVING clauses often limit the range of values. For example, in Example 2.1, from the HAVING clause, REVENUE is larger than 10,000 and smaller than 100,000. Moreover, we know from the schema that REVENUE only has two digits after the zero point. Therefore, the ASCII form of REVENUE requires no more than 8 characters, and we can use four bits for each numeric character in ASCII format. As another example, the WHERE clause contains condition O_ORDERDATE < ‘1999-01-01 and >= ‘1998-1-01’: Therefore, we can encode the 12 possible month values and 31 possible day values within 2 bytes.
1
If we assume that the query result is materialized, then it is possible to obtain much of the necessary statistical information during or after query processing. However, we do not make this simplifying assumption, and instead use estimates from query optimization so that we can decide the compression strategy before query processing. Either approach is valid.
•
Number of distinct values: If this is small, we can use non-adaptive dictionary compression. This information is
usually maintained in the catalogs for all attributes that are indexed. Query optimizer also computes similar estimates for attributes of intermediate results. •
Character distribution for strings: This information is useful for string encoding methods (e.g. Huffman). We
can get it from query optimization or catalog. For instance, in example 2.1, from semantic constraint in the catalog that S_PHONE can only hold characters corresponding to digits and ‘-‘, we can use Huffman encoding which encodes each character by four bits. For each tuple, it is useful to know the following “cross-attribute” information: •
Value Constraints: We may get this information from the WHERE clause. In example 2.1, the WHERE clause
constrains L_SHIPDATE other attributes in this table” is a trivial FD. In example 2.1, FD “S_SUPPKEY -> S_NAME, N_NAME, R_NAME,S_PHONE” indicates the opportunity for normalization compression. At the level of the entire relation, it is useful to know the order in which tuples are produced. Depending on the order, certain grouping compressions are naturally suggested. In example 2.1, the ORDER BY clause defines the physical order of result, thereby providing information for the normalization and grouping compression methods. For instance, the data is sorted on S_SUPPKEY and we have a FD “S_SUPPKEY-> S_NAME, N_NAME, R_NAME,S_ADDRESS, S_PHONE”. Therefore, we use normalization on these fields. 2.4 Example of Performance Improvement To motivate the rest of the paper, we present a quantitative example of the effects of query result compression. Consider the compression of the result of the SQL query (approx. 10 MB) in example 2.1. We use combinations of various methods discussed above to form suitable “compression plans” (details in Appendix A). Plan A is the plan of choice, which allows individual tuples and attribute values to be decompressed. Plan B extends Plan A by applying LZ77 to each column of the relation. While this gains additional compression, the entire relation needs to be decompressed to access any tuple or attribute value. Plan C differs from Plan A in that it does not use the normalization method described in section 2.1.1, leading to a lower level of compression. We compare these plans with WinZip (using file level LZ77) which of course requires that the entire relation be decompressed. We present the compression ratio (original result size divided by compressed result size), compression and decompression throughput (uncompressed result size / compression or decompression time). The experiment is run on a machine with a Pentium-II 447 MHz processor and 256 MB RAM.
Strategy
WinZip
Plan A
Plan B
Plan C
Compression ratio
4.25
5.57
10.68
2.87
Compression Throughput
1.25 MB/s
1.76 MB/s
1.39 MB/s
1.64 MB/s
Decompression Throughput
4.60 MB/s
3.46 MB/s
2.93 MB/s
3.6 MB/s
Table 2.2 Comparison on compression ratio, compression and decompression throughput Plan A achieves a higher compression ratio than WinZip with lower compression cost. More importantly, this plan allows the client to decompress tuples and attributes individually, while WinZip has to decompress the whole file. The compression ratio is high because we use information on query results for compression. The decompression throughput is lower than LZ77 because of the overhead involved in combining multiple decompression methods. Plan B shows that if we do not care about decompression granularity, we can achieve compression ratio more than twice as that of LZ77. Finally, Plan C shows that normalization does play an important role as a compression method for such a database query result. We should view query result compression in the context of an end-to-end system. Assume a PC client communicates with the database server over a 56.6 kbs modem. The “cost” of accessing the query result is some combination of compression, decompression and transmission times. Assume that we simply add these times to compute the cost. Using no compression requires 1,408 seconds; using WinZip requires 341 seconds; using Plan B requires 143 seconds. Therefore, in this environment, Plan B requires only 42% of the end-to-end time of WinZip. 3 A Framework for Query Result Compression We now present an algebraic framework to represent a composite compression strategy for a relation produced as the result of a query execution. The framework has four components: (a) a compressed relation representation called a compressed granule set (CGS) (b) compression operators, each defining an output CGS from a single input CGS (c) a compression plan formed by composing a sequence of operators, and (d) an operational semantics for such an algebraic compression plan. We view an uncompressed relation A (in this case, the result of a query) as a set of N tuples. We assume that every tuple has a unique identifier, and that the tuple identifiers range from zero through N-1, each with M attribute values. Consequently, A[ i, j ] uniquely identifies the j’th attribute in tuple i and A[ i ] identifies tuple i. 3.1 Compressed Granule Set (CGS) A CGS is a set of compressed granules, each of which represents one or more attributes in the corresponding uncompressed relation. We also assume each granule has a granule identifier and a reference such that it can be accessed via the reference. Each granule contains: •
The compressed representation of a set of attributes A[i,j] that are contained in the granule.
•
The N-to-1 attribute-granule mapping from the set of A[i,j] contained in the granule to the granule identifier.
Any attribute is contained in exactly one granule (we use the term “contains” to indicate the reverse mapping). •
Meta information including:
•
The data type of the compressed representation.
•
A decompression method that can be applied to the granule to generate the individual attributes.
•
Any additional information needed for the decompression method. (e.g. a dictionary).
For instance, Figure 3.1 shows two CGS r0 and r1. r0 is on the left and contains nine granules, each containing one attribute. r1 is the output of applying normalization compression using FD field 1 -> field 2. It contains seven granules. gi is granule identifier. Granule g1 contains: (a) compressed data value “1” with integer data type, (b) attribute-granule mapping: { A[1,1], A[2,1] } -> g1, (c) the decompression method which is the reverse of normalization, (d) Additional information FD field 1 -> field 2.
Field
1
2
3
Tuple 1
1
2
7
2
1
2
5
2
1
6
3 Input CGS r0
g1 1
g3 2
g5: 7
g2
g4 1
g7: 6
2
g6: 5
Group 3:granularity G3 Output CGS r1
Group 2:granularity G2
Group 1:granularity G1
Figure 3.1 Compress a CGS r0 to r1 using normalization with FD field 1 -> field 2 Based on the attribute-granule mapping, we define the granularity of the CGS as follows: Definition 1: Granularity G is a binary equivalence relation on attribute values such that for all granules g, tuples i1, i2, and attributes j1, j2, g contains A[ i1, j1 ] iff it contains all other attributes A[ i2, j2] such that A[i2,j2] G A[i1,j1]. 3.2 Regular CGS While this definition of CGS allows arbitrary granularities, we usually expect all data in a particular column to be compressed in a homogenous manner. We define a regular CGS as a CGS whose granularity is specified by a twodimensional partition of the relation. The vertical partition defines disjoint sets of columns. Attributes of the same field set belong to the same granule group. The horizontal partition defines disjoint sets of rows. These further divide each granule group into individual granules. For instance, r1 in figure 3.1 contains 3 granule groups, one for each field. For a regular CGS, the meta-information for granules in the same group must be identical. Definition 2: A Regular Granularity G = , where F is a field set, P is a predicate and for all tuples i1, i2, and attributes j1, j2, A[i1,j1] G A[i2,j2] iff j1∈ F and j2∈F and P( A[ i1 ],A[ i2 ] ) is true. Intuitively, F represents the vertical partition and P defines the horizontal partition. For instance, the granularity of group 1 in figure 2.1 is (attributes of the same value form each granule). In the rest of this discussion and the paper, we assume that all CGSs and granularities are regular. Since the granules in the same granule group of a CGS share the same data types and compression method, the common meta-information does not need to be stored with each granule. Instead, it can be stored once for each
granule group. The group meta-information for all compression groups is represented as a compression schema. The compression schema holds the following information for each field of the relation: •
The granularity of the group containing this field.
•
The common data type of granules containing this field.
•
A sequence of compression operations executed on this field. The decompression method is the
reverse of the sequence of compression operations. We use following notion for compression schema: ( field i: , … ). For instance, the compression schema for CGS r1 in figure 3.1 is: (1: , 2: , 3: ). Definition 3: Partial Order on regular granularities. A granularity G1: G2: iff F1⊆ F2 and P1 -> P2. This partial order corresponds to an intuitive notion of “coarseness” (G2 is coarser than G1). Theorem 1: If G1: G2: , then A[i1,j1] G1 A[i2, j2] ⇒ A[i1,j1] G2 A[i2, j2]. Intuitively, if two attributes are in the same granule of a CGS, they are in the same granule of any coarser CGS. The proof is straightforward and is omitted for reasons of space. 3.3 Compression Algebra Definition 3: A compression algebra expression is recursively defined as (a) a CGS (b) a unary algebra operator validly applied to a single algebra expression. The result of every compression algebra expression is a CGS. An expression is valid if the operator is compatible (i.e. type-checks) with the input CGS as defined in Section 3.3.2. 3.3.1 Compression Algebra Operators Compression operator defines a mapping from an input CGS to an output CGS. There are four properties of such operators that can be used to reason about them (or that can be used to specify them). Granule scope: The granule scope is a partition on the input granules. The data in each partition of the input granules is collectively represented by one output granule. Consequently, the output CGS granularity can be defined completely in terms of the input CGS granularity and the granule scope. For instance, in figure 3.1, the granule scope of the normalization operator is { , }. Reference scope: For each partition of input granules defined by the granule scope (and hence for each output granule), the reference scope defines a set of input granules (reference granules) that are used to compress the partition (although not all of them may be contained in the output granule). Using output granule g3 in Figure 3.1 as an example, the reference scope includes four granules, containing A[1, 1] , A[2, 1], A[1, 2], and A[2, 2] respectively. The granules containing A[1, 2] and A[2, 2] are directly used to compute the compressed output
granule. The granules containing A[1, 1] and A[2, 1] are used to check the condition A[1, 1] = A[2, 1] so that A[ 1, 2] and A[ 2, 2] will be contained in g3. Clearly, input granules defined by granule scope are also reference granules. Compression Method: This defines the procedure to generate the compressed value of an output granule based on the input granules (defined by granule scope) and reference granules (defined by reference scope). It consists of a compression function and the meta-information used in compression (this is existing information, not the information derived from reference granules). The function expects certain input data type and generates output with certain output data type. The specification is: < Method name, { input data type -> output type }, meta-information> For instance, in figure 3.1, the compression method of normalization operator is: the same data type as input type}, FD field 1 -> field 2>. Decompression Method: This is the reverse of the compression method and its specification is identical. Based on these operator properties, the output CGS (compression schema, granularity) can be easily defined in terms of the input CGS, as follows: 1)
Define the output granularity based on the granule scope. The granularities of fields not compressed by this
operator keep unchanged. 2)
In the compression schema, update data type of fields compressed by the operator to the output data type.
3)
In the compression schema, add this operator to the tail of the operator sequence of compressed fields.
3.3.2 Type Checking Algebra Expressions If the compression schema of a CGS type-checks with an algebra operator, we can generate a valid expression by applying the operator to the CGS. The following constraints must be checked: 1)
The input data type of the compression method must agree with data types of input granules.
2)
Since granule is the minimal access unit, we can not access part of its attributes without decompression. We
assume we never decompress a granule in compression. Therefore, the granule scope must include input granules as a whole. By theorem 1, the granule scope should be coarser than granularities of input granule. Similarly, reference granules should also be included as a whole. 3.4 Plan Execution We intend to treat compression algebra expressions as compression plans that direct the actual compression of the query results. Towards this goal, we now provide an operational semantics for plan execution. The result of a query plan execution is the CGS generated by a serial execution as described below. Any plan execution that produces the same result is considered equivalent2. A plan is a sequence of operators. Due to the space constraint, we focus on the compression phase of a compression plan. The decompression phase is just the dual of the compression phase.
2
We refrain from using the term “serializable”, despite the strong analogy.
3.4.1 Serial Execution A serial execution executes operators one-by-one. The initial query result relation is a trivial CGS. Each operator execution generates an output CGS, which acts as the input for the next operator. 3.4.2 Pipelined Serial Execution To reduce the storage of each intermediate CGS, we can pipeline the execution of many operators in the plan. Each pipelined operator processes a chunk of input granules at a time. This chunk is called the pipelining unit. For every output granule, each operator must be able to access the corresponding input granules. 3.4.3 Parallel Execution The orders between some operators in a plan are exchangeable. For instance, in Figure 3.1, the following two plans produce equivalent results: Offset encoding on fld 3 ( Normalization with fld 1 -> fld 2 ( r0 )) and Normalization with fld 1 -> fld 2 ( Offset encoding on fld 3 ( r0 )). If the input granules of one operator intersect with the input or reference granules of the other operator, there exists a partial order between them. This is similar to read/write and write/write conflicts in concurrency control. The compression plan is a sequence of operators, which is a particular total order of operators consistent with the partial order. Any other total order of operators that is consistent with the partial order is an equivalent compression plan. We can use the partial order information for parallel execution of the plan by partitioning the operators into layers. Each layer consists of several operators that have no partial orders between them. The operators in higher layer have edges to the next lower layer. Then, we replace operators in each layer with a parallel operator. The parallel operator executes operators in this layer simultaneously and combines the output back to a single CGS. Layer N-1
Parallel operator N-1
Layer N
Parallel operator N
Figure 3.2 Operators in each layer can be executed in parallel. ([SHULPHQWDO5HVXOWV
This section demonstrates the effect of compression plans using semantic and statistical information on queries, compared with traditional compression methods (WinZip). We first compare plans optimized for network transmission (where minimizing data size is critical). We also run decompression code on a handheld device to explore realistic decompression costs. All the compression plans are manually constructed. We derive intuition from these plans to guide our later choice of a heuristic optimization algorithm. The compression code is implemented on top of the Cornell Predator ORDBMS. The server runs on a Pentium II 447 MHZ processor with 256 MB RAM.
4.1 Plans Optimized for Network Transmission vs. WinZip We need to choose a set of queries that are realistic and have large result size (needing compression). We accomplish this by adapting queries from the TPC-D benchmark [TPC95]. We justify this choice because: •
Queries in the TPC-D benchmark are characteristic of a wide range of decision support applications. Most
of the queries include grouping and aggregation, making their result sizes small. As we mention in the introduction, in the environment we consider, aggregations and analysis are performed locally at the clients rather than within the database server. Therefore, we adapt the queries by removing the grouping and aggregation. •
The data in the TPC-D benchmark contains various data types such as string, integer, decimal and
character. The character distribution of string fields varies; some fields are easy to compress while others are not. Our goal in this set of experiments is to minimize the compressed size of the query result. Therefore, we assume the compression operators can have any granule and reference scope. These manually optimized plans have similar forms: •
They all use normalization or grouping method first if such methods can be applied.
•
Then appropriate field level compression methods such as offset encoding, across-column value constraint,
dictionary, and Huffman encoding are applied on each field. More than one such operator may be applied on the same field if the combination can achieve greater compression. •
Finally, LZ77 operators with reference scope are applied on each field. This is
because we expect more similarities of values of the same field than across the fields [RHS95]. The database size is 100 MB. The average result size is 2.5 MB. Each tuple has 52 bytes and 5 fields on average. Figure 4.1 shows the results for 17 adapted queries. Except for query 17, using statistical and semantic information
Compression Ratio
will improve the compression ratio. 3
5
3
0
2
5
2
0
1
5
1
0
C R W
o a
m p r e s s i o t io o f in Z ip
n
C o m p r e s s i o R a t io o f o u r p l a n s
n
5 0 q 1
q 2
q 3
q 4
q 5
q 6
q 7
q 8
q 9
Q
u
e
q 1 0
q 1 1
q 1 2
q 1 3
q 1 4
q 1 5
q 1 6
q 1 7
r y
Figure 4.1 Compression Ratio of WinZip and Our plans on Results of Adapted TPCD 17 Queries. X axis is query. Y axis is compression ratio.
Average Compression Ratio
WinZip
Our Plans
3.43
6.40
Table 4.1 Table 4.1 shows that using WinZip can achieve an average compression ratio (CR) of 3.43. Our plans achieve an
average CR of 6.40, which is 86% higher than that of WinZip. Figure 4.2 shows the relative improvement in CR of our plans over WinZip ((CR of our plan/ CR of WinZip) –1). For eleven out of the seventeen queries, the CR improves by 40% to 80%. For three queries, the improvement is over 80%. Only three queries (query 4, 10 and 17) have less than 40% improvements in compression ratios. Query 17 has no improvement at all, because its result only contains two floating-point fields. Since there is no additional information, our plan does not outperform WinZip. For query 4 and query 10, there are two large string fields, whose values are generated based on a uniform distribution. Therefore, we can not achieve better compression than WinZip on these two fields, which form a substantial proportion of the data.
Our Ratio / WinZip ratio -1
C o m
p r e s s io n
R a tio
Im
p r o v e m
e n t O v e r
W
in Z ip
2 .5 2 1 .5 1 0 .5 0 1
2
3
4
5
6
7
8
9
1 0
1 1
1 2
1 3
1 4
1 5
1 6
1 7
Q u e r y
Figure 4.2 Compression Ratio Improvement of Our Plans over WinZip. X axis is the queries and Y axis is our compression ratio / that of WinZip –1.
2 C T O D T O
1 .5 1
o m p r e s s io n h ro u g h p u t o f u r P la n e c o m p r e s s io n h ro u g h p u t o f u r P la n s
17
15
13
11
9
7
5
0
3
0 .5 1
WinZip
Our Throughput /
Figure 4.3 shows the compression and decompression cost of our plans vs. WinZip.
Q u e ry
Figure 4.3. Comparison of Compression and Decompression Throughput of Our Plans and WinZip. Vertical Axis is the throughput of our plans / that of WinZip. Table 4.2 shows the average compression (decompression) throughput is 72%(52%) of that of WinZip. This is the price we pay for the higher compression achieved. Avg. Compression Throughput
Avg. Decompression Throughput
Our plan
0.90 MB/s
2.40 MB/s
WinZip
1.25 MB/s
4.60 MB/s Table 4.2
4.2 Decompression On Handheld Devices In contrast to the assumptions of the previous experiments, we now consider handheld client devices. We run an experiment on a palm-size CASSIOPEIA E-10 device, running Windows CE on a MIPS 4000 processor with 4 MB RAM (2MB of persistent data storage and 2 MB of program memory). The query is the same as example 2.1 and the result size is 3.72 MB, making it too large to store without compression. We also assume that the client needs random access to tuples, and clearly it is not feasible to decompress the entire result in an initial step. Instead, we store the result in compressed form and only decompress the data being accessed (decompress on demand). Windows CE uses a default page-level compression for all persistent storage (with 4 KB page size). Therefore, the compression plans should meet three (possibly conflicting) requirements: (a) they should explore redundancies that Windows CE’s compression method does not explore (b) they should allow tuple-level access, (c) they should be lightweight, since a handheld device has a very slow processor and limited battery power. Specifically, we compare the behavior of: •
Plan A presented in section 2.4.
•
Plan C presented in section 2.4 (plan A excluding the normalization operator). Both plan A and plan C
allow tuple-level decompression. Plan C uses less information on query results than plan A. •
Plan D: empty plan, using the default WindowsCE compression.
•
Plan E: Page level LZ77.
Table 4.3 shows the storage usage (both persistent and program memory), and the time needed to randomly accessing 1000 tuples of the result. Average tuple size is 52 bytes. Plan A
Plan C
Plan D
Plan E
Storage
0.75 MB
0.95 MB
1.42 MB
1.43 MB
Accessing Time
2.85 second
2.77 second
2.48 second
5.75 second
Table 4.3 Plan E has no improvement in storage usage because LZ77 does not explore different redundancies as the Windows CE’s method. Plan A and plan C only uses 53% and 67% of the space as Windows CE’s compression method. This is because they use knowledge of the query result, which is not considered in Windows CE’s compression method. Further, their accessing time is just 15% and 11.7% more than that of empty plan. This is because they use lightweight methods and allow tuple-level decompression. Plan A is also better than plan C because the former only uses 79% the space of the latter and has just 3% more accessing time. This result shows that using lightweight plans exploring knowledge on query results can improve performance on handheld devices.
2SWLPL]DWLRQ,VVXHV
Compression plan optimization generates a valid compression plan to minimize the overhead and maximize the benefit of compression. The inputs to the optimization algorithm include the network bandwidth, the CPU speed, information about query results, client resource constraints, and client access patterns. We first present an exhaustive but inefficient naïve algorithm, then discuss cost estimation. Finally, we present a polynomial heuristic algorithm that is based on intuition derived from our experimental results. 5.1 A Naïve Algorithm A simple way to construct a plan is to add a type-consistent operator to an existing plan that does not contain this operator. The naïve algorithm starts with an empty plan. It repeatedly enumerates new plans by adding new operators to existing plans, and stops when no new plans are generated. It evaluates the cost of each plan (as described in the next section) and returns the minimal cost plan. Clearly, this algorithm is exhaustive. It also halts if the number of possible methods, granule scopes and reference scopes is limited. This is because: 1) the total number of possible plans is finite because there is a finite number of valid operators (defined by combinations of possible methods, granule scopes and reference scopes) and we never include the same operator twice in any plan; 2) the naïve algorithm enumerates new plans in each round and stops if no new plans are generated. As a simple enhancement, we keep the set of operators and the partial orders between them for each plan so that equivalent plans are considered the same. However, the naïve algorithm is not efficient because we need to check all valid sequences of compression operators. Suppose there are M fields and for each field, we can choose exactly one operator to compress this field only. So the combinations of such operators are valid sequences. The number of such combinations is 2M, which is a lower bound of the total search space. Assume M = 20, the lower bound is already over 1 million! 5.2 Cost Estimation The cost of a plan equals the compression overhead minus the compression benefit. The compression overhead is: Weight1 * Compression Cost + Weight2 * Decompression Cost. The compression benefit is defined as: Weight3 * Saving on Network Transfer + Weight4 * Saving on Client Side Storage. Above formula can fit various goals of optimization such as minimizing the transition cost (set weight1, weight2, and weight4 = 0) and minimizing client side cost (set weight1 and weight 3 = 0). 5.2.1 Compression and Decompression Cost The compression cost of each operator includes CPU cost and I/O cost. CPU cost depends on CPU speed, the compression method, and input data size. The I/O cost includes the cost for reading input and writing output, which can be saved if the plan is pipelined in a unit that can be held in memory. Moreover, some compression methods may require extra I/O cost in compression. For instance, if the data is not in order, normalization requires sorting of the input and pays extra I/O cost. For a serially executed plan, the compression cost equals sum of the cost for each
operator. For a plan executed in parallel, the cost equals sum of the cost of parallel operators, which consists of the cost of simultaneous execution of operators in that layer and overheads such as combining output. The decompression cost depends on the compression plan and the client access pattern of the result. If the client decompresses the query result immediately, as in section 4.1, the decompression cost is just the cost to decompress the whole result. We can compute it by adding up the cost for individual operators. If the client stores the result in compressed form and needs N random accesses to certain units (tuples, attributes, etc.), the decompression cost equals the cost to decompress granules containing each access unit times N. If the client needs N sequential accesses instead, we can assume that the client caches the decompressed units and the cost is reduced to the cost to decompress all granules containing the attributes being accessed. 5.2.2 Compression Benefit The compression benefit includes the saving on network transmission and client-side storage. The saving on transmission = Transmission cost of the uncompressed result – that of the compressed result. The transmission cost equals a function of the result size, which depends on network properties. The saving on client-side storage depends on whether on-demand random access decompression is required. If not, the saving is zero. Otherwise, the saving is: Uncompressed result size – compressed result size – uncompressed size of granules accessed each time – extra program memory used for decompression. We compute the compressed result size based on the compression methods, uncompressed result size, and other information such as average attribute size and the number of attributes. The details are specific to each method and are omitted for reasons of space. For some methods, it is certainly non-trivial to make accurate estimates. 5.3 A Heuristic Algorithm In order to reduce the search space, we derive the following heuristics from the experiments in section 4: 1. Apply methods using information on query results before general methods. This is because such information is only valid for the uncompressed data. After such “semantic” compression, general methods may further enhance the compression by exploring redundancies not captured by the semantic or statistical information. For instance, LZ77 can explore repetition of bytes while grouping and normalization can only explore repetition of whole attribute values. 2. Use local per-field optimization to choose operators using information on query results. This is because compression overhead or benefit of the whole plan can be seen as the sum of the overhead or benefit on individual fields. So we can combine plans optimized for each field to a low cost global plan. 3. For general compression methods, we prefer column-wise operators (whose granule scope and reference scope are per-field) to row-wise operators (whose granule and reference scope are all fields). This is because usually there are more similarities down the column than across columns [RHS95].
4. If the client applies on-demand decompression and needs random access in a certain granularity, we should choose the granule scope close to this granularity to reduce unnecessary decompression cost. 5. If we want to pipeline the compression on the server side, the granule scope should be finer than the pipelining unit such that we can access input granules within the unit. 6. For general methods, we prefer coarser reference scopes such that more information can be collected to enhance compression. The heuristic algorithm is as follows: 1. First consider methods using semantic/statistical information on query results as follows: (a) For each field, apply the naïve algorithm to enumerate all valid combinations of operators compressing this field using information on query results. (b) Choose the minimal cost plan for each field and combine these plans to a global plan, which contains all operators in individual plans. The partial orders of the combined plan include union of partial orders of each individual plan and the partial orders between operators in different plans. This plan should also type check. 2. Consider general methods for each field as follows: (a) Choose an appropriate method based on available information. For instance, if a field is not compressed and we have the character distribution table, arithmetic or Huffman encoding is the choice. Otherwise, methods like LZ77 will be favored. (b) Choose appropriate granule scope and reference scope by heuristics 4, 5, and 6. For type-consistency, the granule scope should be also coarser than the granularity of the output CGS of the global plan generated by step 1. (c) Add chosen operators to the global plan generated by step 1. Accept an operator only if it can reduce the cost. The resulting plan is the final plan. For instance, suppose we want to compress CGS r0 in figure 3.1 for two clients: client 1 is a handheld device that decompresses result on demand and needs random access to tuples; client 2 downloads result through a slow network and only cares about the result size. The available compression methods are offset encoding, normalization, and LZ77. We also require that the plan can be pipelined on a unit no bigger than a page. For field 1 and 2, we will enumerate four combinations: Normalization, normalization + offset encoding, offset encoding + normalization, offset encoding. Assume “normalization + offset encoding” is the minimal plan for field 1 and 2. And offset encoding is the minimal plan for field 3. The global plan is: Plan 1: Offset encoding on fld 1 ( Offset encoding on fld 2 ( offset encoding on fld 3 ( Normalization on fld 1 and fld 2 ( r0 ) ) ) ) Then we choose LZ77 for each field because no information is available. For client 1, the random access granularity is per tuple, so for field 1 and 2, the reference scope and granule scope should be “tuples with same field 1 values”.
For field 3, both of them should be “one tuple”. For client 2, the granule scope should be finer than page level by heuristics 5. Th reference scope should be file level by heuristics 6. For client 1, all LZ77 operators are rejected because as shown in [RHS95], the reference scope is so small that we do not expect LZ77 can work well. Therefore, the final plan is just plan 1, which is similar to the plan A in section 4.2. For client 2, these operators are all accepted since we believe LZ77 with reference scope can improve the compression (as shown in example 2.1). The final plan is similar to the plan B in section 2.4. Assume the number of fields is M, and for each field, there are X valid combinations of operators using information on query results, the cost for step 1 is O( M*X ). The cost for step 2 is O ( M ). Therefore, this algorithm has the cost O ( M*X ). Although it is not exhaustive, the plans it generates are similar to those best plans in our experiments. 6 Conclusion and Future Work This paper addresses the issue of compressing query results and makes the following contributions: •
We present a framework to model compression strategies. The framework models a compression plan as a
sequence of compression operators. This is similar in spirit to a relational query plan. •
We describe the use of semantic and statistical information about the query to compress its results.
•
We present experiments on queries adapted from the TPC-D benchmark. They show that on average, our
handcrafted compression plans achieve an 86% improvement in compression ratio over a standard compression algorithm(WinZip). •
We also present an experiment with a handheld device acting as the client. This demonstrates the
importance of lightweight compression plans in such resource-constrained environments. •
We present naïve and heuristic optimization algorithms that choose low-cost query plans for a variety of
different requirements. The heuristic algorithm reflects the same intuition that was used in our handcrafted plans. There are two open research problems that we identify are currently exploring. •
Our experiments have shown the potential benefits of compression plans based on the proposed framework.
We derived intuition from our handcrafted plans to devise a heuristic plan optimization algorithm. However, we do not yet have a fully implemented and tested compression plan optimizer. The optimization algorithm will need to be evaluated against a significantly broad workload, and the cost formulas will need tuning. We are adding these features as an extension to the Predator database system. •
Another interesting idea is to blur the boundary between query processing and compression processing. By
performing some compression operators “early”, the performance of both query and compression could improve. Acknowledgement This work on the Cornell Jaguar project was funded in part through an IBM Faculty Development award and a Microsoft research grant to Praveen Seshadri, through a contract with Rome Air Force Labs (F30602-98-C-0266) and through a grant from the National Science Foundation (IIS-9812020).
References [ALM96] G. Antoshenkov, D. Lomet, and J. Murray. Order preserving compression. In Proc. IEEE Conf. on Data Engineering, pages 655-663, New Orleans, LA, USA, 1996. [BAS85] M. A. Bassiouni. Data Compression in Scientific and Statistical Databases. IEEE Trans. on Software Eng.. October 1985. [BM95] D.Barbara, T.I Mielinski, Sleepers and Workoholics - caching strategies in mobile environments, Very Large Databases Journal , December 1995. [COM79] D. Comer. The ubiquitous B-tree. ACM Computing Surveys, 11(2):121-137, 1979. [COR85] G. Cormack. Data Compression in a database system. Commnications of the ACM, Dec. 1985. [CSL98] Boris Y. L. Chan, Antonio Si, Hong Va Leong: Cache Management for Mobile Databases: Design and Evaluation. ICDE 1998: 54-63 [DFJ96] Shaul Dar, Michael J. Franklin, Björn Þór Jónsson, Divesh Srivastava, Michael Tan: Semantic Data Caching and Replacement. VLDB 1996: 330341. [DK98] Margaret H. Dunham, Vijay Kumar: Location Dependent Data and its Management in Mobile Databases. Ninth International Workshop on Database and Expert Systems Applications: 414419, 1998. [DL98] Ravi A. Dirckze, Le Gruenwald: A Toggle Transaction Management Technique for Mobile Multidatabases. Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, 371-377, Bethesda, Maryland, USA, November 3-7, 1998. [EOS81] S. J. Eggers, F. Olken and A. Shoshani. A Comporession Technique for Large Statistical Data Bases. In VLDB, pages 424-434, 1981. [GOY83] P. Goyal, Coding methods for text string search on compressed databases, Information System, 8,3 (1983). [GRA93] G. Graefe. Options in Physical Databases. SIGMOD Record, September 1993. [GRS98] J. Goldstein, R. Ramakrishnan, and U. Shaft. Compression relations and indexes. In Proc. IEEE Conf. on Data Engineering, Orlando, FL, USA, 1998. [GS91] G. Graefe and L. Shapiro. Data compression and database performance. In Proc. ACM/IEEE-CS Symp. On Applied Computing, Kansas City, MO, April 1991.
[HL96] Andreas Heuer, Astrid Lubinski: Database Access in Mobile Environments. Database and Expert Systems Applications, 7th International Conference: 544-553, Zurich, Switzerland, September 9-13, 1996. [Huf52] D. Huffman. A method for the construction of minimum-redundanc codes. Proc. IRE, 40(9): 1098-1101, Septementer 1952. [HS92] Peter J. Haas, Arun N. Swami: Sequential Sampling Procedures for Query Size Estimation. SIGMOD Conference 1992: 341-350 [HSW94] Yixiu Huang, A. Prasad Sistla, Ouri Wolfson: Data Replication for Mobile Computers. SIGMOD Conference 1994: 13-24 [IW94] B. Iyer and D. Wilhite. Data compression support in databases. In Proc. of the Conf. on Very Large Databases, pages 695-704, Santiago, Chile, Sept. 1994. [JKM98] H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Kenneth C. Sevcik, Torsten Suel: Optimal Histograms with Quality Guarantees. VLDB 1998: 275-286 [LB81] L. Lynch and E. Borwinrigg. Application of Data Compression to a Large Bibliography Data Base. VLDB 1981, 435. [LC98] Susan Weissman Lauzac, Panos K. Chrysanthis: Programming Views for Mobile Database Clients. Ninth International Workshop on Database and Expert Systems Applications: 408413, Vienna, Austria, August 24-28, 1998. [LZ76] A . Lempel and J. Ziv. On the complexity of finite sequences. IEEE Transactions on Information Theory, 22(1):75-81, 1976. [LZ77] A . Lempel and J. Ziv. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 31(3):337-343, 1977. [Mayr99] Tobias Mayr and Praveen Seshadri. Client-site Query Extensions. Accepted by SIMGOD 99. [NR95] W. K. Ng, C.V. Ravishankar. BlockOriented Compression Techniques for Large Statistical Databases. IEEE Trans. for Knowledge and Data Engineering, Taipeh, Taiwan, 1995. [PIHS96] Viswanath Poosala, Yannis E. Ioannidis, Peter J. Haas, Eugene J. Shekita: Improved Histograms for Selectivity Estimation of Range Predicates. SIGMOD Conf. 1996: 294-305 [PR98] Anjaneyulu Pasala, D. Janaki Ram: New Object Models for Seamless Transition across
[SYB98] Sybase IQ white Paper. http://www.sybase.com/products/dataware/iqwpaper .html. [TPC95] Transaction Processing Performance Council TPC. TPC benchmark D (decision support). Standard Specification 1.0, Transaction Processing Performance Council (TPC), May 1995. http://www.tpc.org. [VB98] Hartmut Vogler, Alejandro P. Buchmann: Using Multiple Mobile Agents for Distributed Transactions. Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems: 114-121, August 20-22, 1998. [WAG73] R. Wagner. Indexing designing considerations. IBM Systems Journal, 12(4):351367, 1973. [WEL84] T.A. Welch. A Techinique for High Performance Data Compression. Comm. Of ACM. June 1984. [WKHM98] Till Westmann, Donald Kossmann, Sven Helmer, Guido Moerkotte. The implementation of performance of compressed databases. [WNC87] I. Witten, R. Neal and J. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520-540, June 1987.
Heterogeneous Mobile Environments. Ninth International Workshop on Database and Expert Systems Applications: 446-451, 1998 . [R98] Andry Rakotonirainy: Adaptable Transaction Consistency for Mobile Environments Ninth International Workshop on Database and Expert Systems Applications: 440-445, 1998. [RH93] M. Roth and S. Van Horn. Database compression. ACM SIGMOD Record, 22(3): 31-39, Sept., 1993. [RHS95] Gautam Ray, Jayant R. Harista, S, Seshadri. Database Compression: A Performance Enhancement Tool. Advances in Data Management ’95, Proceedings of the 7th International Conference on Management of Data (COMAD), 1995, Pune, India. [SAL98] David Salomon. Data compression, the complete reference. Springer-Verlag New York, Inc, 1998. [SEV83] D. Severance. A practitioner’s guide to database compression. Information Systems, 1983. [SNG93] Leonard Shapiro, Shengsong Ni, Goetz Graefe. Full-Time Data Compression: An ADT for Database Performance. Portland State Univ., OR, USA, 1993.
Appendix A. Compression plan A in example 2.1. Method Normalization
Fields S_SUPPKEY, N_NAME, R_NAME, S_PHONE
Offset encoding
S_SUPPKEY
NULL Suppression for each string attribute Dictionary Dictionary Four bits encoding for characters including digits, ‘-‘ Use two bytes to encode month and day Use one byte to encode difference with O_SHIPDATE Encode ASCII form of float value.
Each string attribute R_NAME N_NAME S_PHONE O_SHIPDATE L_SHIPDATE REVENUE