Mining the Data Cube for Improving its Scheme (Extended Abstract)
Stijn Dekeyser
Bart Kuijpers
Universiteit Antwerpen (UIA)
Jan Paredaens
Jef Wijseny
Vrije Universiteit Brussel (VUB) z
Abstract
The integration of data mining, data warehousing, and OLAP is an important research direction| Han [4] uses the term OLAP mining in this respect. Data warehouses facilitate data mining, and conversely, the desire to mine knowledge is an incentive to maintain large, integrated data warehouses. The data structures of data warehouses are typically \ at" multidimensional data cubes. In this paper, we present the notion of nested data cube, which we believe is a natural paradigm for perceiving large data cubes. Flat data cubes can be nested in numerous ways. Three measures for characterizing the complexity/simplicity of a particular nesting are presented. An interesting data mining problem then is: Given threshold values for each measure, nd a nesting that satis es each threshold. The computational complexity and the applicability of this problem are investigated.
1 Motivation for Nested Data Cubes The integration of data mining, data warehousing, and OLAP is considered a promising research area. Data warehouses facilitate data mining, and conversely, the desire to mine knowledge is an incentive to maintain large, integrated data warehouses. OLAP users typically view the data as \ at" multidimensional data cubes. The top table of gure 1 shows the non-empty entries of such a data cube. The rst row, for example, indicates that at Allen's, the price of yellow hinges amounted to 10 dollars in January. The data cube may be called incomplete, as certain values are missing; for example, at Allen's there were no red hinges in January. This tabular representation has also been called an f-table [1]. Such tables are \ at," and consequently, can cause some perceptional diculties: The dimensionality can be too high to be practically visualizable in a \cubic" format. The example table of gure 1 (top) has dimensionality=4. The cardinality can be large. The example table of gure 1 (top) has cardinality=11. In practice, the cardinality is typically in the order of thousands or millions. Clearly, such large tables or cubes are not t for human perception. Therefore, semi-automated \cube viewers" are needed, to help users to browse through the data cube.
Universiteit Antwerpen (UIA), Informatica, Universiteitsplein 1, B-2610 Antwerpen, Belgium. Email:
g
kuijpers, pareda @uia.ua.ac.be. y z
Person handling correspondence. Vrije Universiteit Brussel (VUB), Informatica, Pleinlaan 2, B-1050 Brussel, Belgium. Email:
1
fdekeyser,
[email protected].
Item hinge hinge hinge hinge hinge handle handle handle hinge hinge hinge Item
Store Allen's Allen's Allen's Allen's Allen's Allen's Allen's Allen's Smith's Smith's Smith's
Color yellow yellow yellow red red yellow yellow yellow red red red
Month Jan Feb Mar Feb Mar Jan Feb Mar Jan Feb Mar
! ! ! ! ! ! ! ! ! ! !
10 11 12 8 8 7 7 9 12 13 13
Store Color yellow yellow yellow red red
Month Jan Feb Mar Feb Mar
! ! ! ! !
10 11 12 8 8
Allen's
!
handle Allen's
!
Color yellow yellow yellow
Month Jan Feb Mar
! ! !
7 7 9
hinge
!
Color red red red
Month Jan Feb Mar
! ! !
12 13 13
hinge
Smith's
Figure 1: Nested data cube. A natural way to cope with huge data cubes is to use nesting. The bottom table in gure 1 is a nested table. We will refer to this representation as a nested data cube. We assume it is intuitively clear how this table is to be interpreted|a formal de nition can be found in [2], which also contains an algebra to manipulate such nested data cubes. Intuitively, nested data cubes can be obtained from at ones by applying GROUP-BY operations at dierent depths. The scheme of the top table is denoted [Item Store Color Month ! numeric], and that of the bottom table [Item Store ! [Color Month ! numeric]]. Note that nested data cubes adopt the functional nature (implied by the !-symbol) of their at counterparts. ;
;
;
;
;
2 Characterizing Measures A at data cube can be nested in a large number of ways. Figure 2 shows a dierent nesting of the same data; the scheme of this cube is [Store ! [Item Color ! [Month ! numeric]]]. It is not clear a priori which nesting is preferable. Following [6], we can make a distinction between two types of appropriateness measures for a particular nesting: ;
2
Store
Allen's
!
Item
Color
hinge
yellow
!
Month Jan Feb Mar
! ! !
10 11 12
hinge
red
!
Month Feb Mar
! !
8 8
!
Month Jan Feb Mar
! ! !
7 7 9
handle yellow
Item
Smith's
!
Color
!
hinge red
Month Jan Feb Mar
! ! !
12 13 13
Figure 2: Nested data cube representing the same information.
3
Subjective measures: those that depend on the class of users who examine the data cube. For example, the nested data cube of gure 2 will be preferred by users wishing a store-at-a-time look. Objective measures: those that depend solely on properties of the underlying data. In this study, our main interest is in the following objective measures:
Depth: The number of nested groupings in a nested data cube. We agree that a at cube has
depth=1. The nested data cube of gure 1 (bottom) has depth=2, and the one of gure 2 has depth=3. Dimensionality: Every nesting depth has its associated dimensionality. The nested data cube of gure 1 (bottom), has dimensionality=2 at depth 1 (dimensions Item and Store ), and dimensionality=2 at depth 2 (dimensions Color and Month ). Cardinality: A nested data cube is a recursive structure, and contains several smaller nested data subcubes. For the nested data cube of gure 1 (bottom), we nd three non- rst normal form (NF2 ) rows at depth 1 (about hinges at Allen's, handles at Allen's, and hinges at Smith's resp). The smaller subcubes at depth 2 have cardinalities 5, 3, and 3 resp. It can be readily seen that these three objective measures are not independent. For example, as the total number of dimensions is xed, increasing the depth will decrease the dimensionality at some depth. Intuitively, a lower value for a certain measure implies a simpler data cube (supposing that the other measures are xed). For example, if we remove a number of NF2 rows from a nested data cube, thereby decreasing the cardinality, then the new data cube is easier to perceive than the original one.
3 Scheme Mining The inter-dependency of our three proposed objective measures leads to some interesting, and surprisingly complex, optimization problems. In particular, one can put upper bounds for two measures, and ask to nest the data cube in such a way that (i) the upper bounds are satis ed, and (ii) the third measure is optimized. Solutions to these optimization problems are of practical importance: the cube with the optimized scheme will be easier to perceive by an end-user. For example, we may constrain the depth 2, and, for visualization purposes, the dimensionality 3 at each depth. For our example data cube, this gives still several possible nestings; two possibilities are [Item Store Month ! [Color ! numeric]] and [Item ! [Month Color ! numeric]]. The question then is to determine, given these depth and dimensionality bounds, which scheme gives rise to the lowest cardinalities at each depth. These optimization problems are data mining problems: the optimal scheme is derived (\mined") from the data content of the cube. Here, schemes play the role of rules in classical data mining problems. The \optimal" scheme can (and usually will) change if the data content changes. For complexity analysis, the optimization problems can be transformed into roughly equivalent decision problems. For complete data cubes 1 , these problems tend to have simple solutions. Nevertheless, for incomplete data cubes, we found that the problems can become intractable. An example is given next. The following problem asks to nest a at data cube once in the presence of dimensionality restrictions at depth 1 and cardinality restrictions at depth 2. We can prove that the problem is NP-complete. ;
1
;
; Store
;
By complete data cubes, we mean that there is an entry for every combination of values from the underlying domain.
4
MINIMIZING SUBCUBE CARDINALITY INSTANCE A at (i.e., depth=1) data cube with dimensionality= and cardinality= ; positive integers and . QUESTION Can be nested once (i.e., increase the depth to become 2) such that (i) dimensionality at depth 1, and (ii) cardinality for each nested data subcube at depth 2. Comment: Transformation from VERTEX COVER. Remains NP-complete even if = 1. T
c
C
d
D
C
D
T
d
c
c
Importantly, we intend that in practical applications, our scheme mining problems will be augmented with user-speci ed, subjective restrictions. For example, a user interested in store-by-store information will x Store at depth 1. A better understanding of these scheme mining problems will be useful in the development and the improvement of \semi-autonomous" cube viewers. What we have in mind is a viewer (also called browser) which rst tries to meet user-speci ed wishes. These wishes typically lay down only part of the scheme; a semi-autonomous viewer can then nd a completion of the scheme that optimizes certain objective quality measures, and return the resulting view to the user. Since the complexity of this task turns out to be exponential in many cases, good heuristics need to be developed.
4 Concluding Remarks In this paper we rst presented nested data cubes, which provide a useful paradigm for perceiving data sets with high dimensionality and cardinality. A major dierence between nested data cubes and nested relations [5] is that the nesting of data cubes is restricted to the right-hand side of the !-symbol, which indicates a functional determinacy. We then introduced \scheme mining." The aim is to re-structure an OLAP cube so as to make it simpler to perceive. We established a theoretical framework for scheme mining, and indicated some hard problems, which deserve further attention. Solutions to these problems can be applied in the development of semi-autonomous cube viewers. Our way of mining schemes is dierent from scheme design using (nested) normalization theory [5]. Whereas normalization theory starts from a given set of functional and multivalued dependencies, our theory of scheme improvement is solely based on the content of the current data cube. It is true, however, that the ful llment of cardinality thresholds can be translated in the satisfaction of numerical functional dependencies [3]; a detailed discussion is beyond the scope of this abstract.
References [1] L. Cabibbo and R. Torlone. Querying multidimensional databases. In Sixth Int. Workshop on Database Programming Languages, pages 253{269, 1997. [2] S. Dekeyser, B. Kuijpers, J. Paredaens, and J. Wijsen. Nested data cubes for OLAP. Technical Report 9804, University of Antwerp, 1998. [3] J. Grant and J. Minker. Inferences for numerical dependencies. Theoretical Computer Science, 41:271{287, 1985. [4] J. Han. OLAP mining: An integration of OLAP with data mining. In Proceedings of the 7th IFIP 2.6 Working Conference on Database Semantics (DS-7), pages 1{9, 1997. 5
[5] W. Y. Mok, Y.-K. Ng, and D. W. Embley. A normal form for precisely characterizing redundancy in nested relations. ACM Trans. on Database Systems, 21(1):77{106, 1996. [6] A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Knowledge and Data Engineering, 8(6):970{974, 1996.
6