Methods and Interpretation of Database Summarisation John F. Roddick1, Mukesh K. Mohania1 and Sanjay Kumar Madria2 1
School of Computer and Information Science, University of South Australia, The Levels Campus, Mawson Lakes, South Australia 5095, Australia. Email: froddick,
[email protected] 2 Department of Computer Science, Purdue University, West Lafayette, Indiana, IN 47907, USA Email:
[email protected]
Abstract. The summarisation of large quantities of linear, tabular and multidimensional data to a more manageable quantity has found application in a number of areas including mobile databases, data warehouses, executive reporting and summary (commonly main memory) databases. The method of summarisation used has a significant bearing on the manner in which the resultant data can be employed and the way in which the information content can be subsequently interpreted. This paper surveys various summarisation procedures and provides a categorisation of the mechanisms by which information can be summarised and the nature and conditions under which valid interpretation of the summarised data can be made.
Keywords: Data Summarisation, Inference, Information Capacity, Existential Domain Values, Relation Decomposition.
1 Introduction The summarisation of data is becoming increasingly important for a number of reasons. Firstly, the increasing quantities of data being collected demands efficient methods for storing and analysing that data. This demand is resulting inter alia in the development of more or less autonomous data mining techniques, which aim to extract useful knowledge from structured and semi-structured data. Secondly, Moore’s Law is continuing to apply to the affordability of disk space but is not currently applicable to I/O channel speed improvements, resulting in a greater proportion of processes involving large datasets becoming I/O-bound. Offline analysis of data and data summarisation are thus being offered as solutions to reduce the volume of data needed to be read in real-time. Thirdly, increasing business competitiveness is requiring an information-based industry to make more use of its accumulated data and thus techniques of presenting useful data to decision makers in a timely manner is becoming crucial. However, while data summarisation is becoming increasingly used in data mining, data warehouses, data visualisation and main memory databases, to date the mechanisms for data summarisation have been designed for each specific task and this has frequently led to a summarisation procedure producing data which cannot be reused for other purposes. For example, a summarisation procedure that provides synoptic data
for management decision-making may or may not be able to be used for query optimisation or data distribution. While the costs of summarisation remains low this is not a problem but as the volume of data increases, summaries that can be used for a number of purposes become extremely useful. Moreover, applications such as mobile databases (which have limited and intermittent access to data on the main server) increasingly rely on making the most of transportable summaries. This work builds on some earlier work which outlined a process for utilising summary data as an alternative to main database access [10]. The concept was to decompose a query such that some of the fine-grained segments of that query might be able to be answered by the summary. The query segments are executed in parallel with the faster, local database either providing an answer in a shorter time or failing, in which case the main database was used, if available. The advantage of this method is that on the occasions that both summary and main database failed to provide an answer, an overcomplete answer is often possible. For further details please refer to the previous paper. User
Query Processor
Summary DB Main DB
Fig. 1. Parallel Summary and Main DB Access
This paper thus investigates the ways in which databases can be summarised with a view to providing maximal support for such architectures given the space restrictions. Summarisation techniques discussed include horizontal and vertical decomposition, domain lattices and existential and statistical attributes. Section 2 provides a discussion on the scope of the techniques available while Section 3 discusses in more depth the semantics of concepts as domain values in a summary database. Section 4 provides a formal categorisation of these techniques and Section 5 provides a further discussion of the field including some ideas for future research. Other previous work in this area includes studies towards the development of induction in databases which included some discussion on the ways of summarising data [8, 9] and a discussion of the effects of accommodating hierarchical domains on a query language is provided in [11]. Other early work in this area includes [2, 6, 12].
2 Data Summarisation Techniques - Discussion and Scope The various methods of summarisation can be categorised in many different ways. For example, as data summarisation is rarely lossless1 , summarisation techniques could be 1
in that there is normally some inference that could have been made from the original data that cannot be made from the summary.
categorised by the information lost during the process (both explicitly and implicitly) or the structural manner in which the data is eliminated. Alternatively, since inference will be applied to the resultant data, techniques could be categorised by the classes of inference that can be performed, perhaps weighting the classes according to business criteria. The manner in which the method of summarisation is chosen is highly application dependent and can be complex but includes factors such as the space available, the query hit rate required, the ability to update the summary, the expected or historical query profile, etc. In its simplest form, given a query Q on a database D composed of relations R1; R2 : : :Rn and a series of summarisation functions S1 ; S2 : : :Sn , the ideal summarisation process is such that
Q(D) = Q(S1 (R1); S2 (R2); : : :Sn (Rn))
(1)
That is, the answer provided by the summarised relation is identical with that that would have been obtained from the original database. For this to happen the summarisation techniques must be chosen carefully. In all of the following examples, we assume that, together with the resultant summary relation, the mechanism used for its construction (and therefore its relationship with the source relation) is known and available. As an example of summarisation for discussion, consider the example below which shows an original Employee relation and four successive summarisations.
Employee Name Smith, Jane Brown, Len Wong, Anne Black, John Grey, Kim Green, Mike Long, Angela Brittain, Jack Tan, Jim Lee, Leslie
Id 872 773 876 992 090 138 094 138 873 590
Position Level PositionType Salary Full Professor 1 Academic $80,500 Associate Professor 2 Academic $65,000 Associate Professor 3 Academic $68,000 Senior Lecturer 1 Academic $55,000 Senior Lecturer 1 Academic $55,000 Lecturer 2 Academic $43,000 Lecturer 2 Academic $43,000 Lecturer 3 Academic $45,500 Programmer 1 Non-Academic $36,000 Administrative Officer 2 Non-Academic $27,500
Employee2 Position Level PositionType Full Professor 1 Academic Associate Professor 2 Academic Associate Professor 3 Academic Senior Lecturer 1 Academic Lecturer 2 Academic Lecturer 3 Academic Programmer 1 Non-Academic Administrative Officer 2 Non-Academic
Salary $80,500 $65,000 $68,000 $55,000 $43,000 $45,500 $36,000 $27,500
Employee3 Position Level Salary Full Professor 1 $80,500 Associate Professor 2 $65,000 Associate Professor 3 $68,000 Senior Lecturer 1 $55,000 Lecturer 2 $43,000 Lecturer 3 $45,500 Employee4 Position Salary Range Full Professor $80,500 Associate Professor $65,000 - $68,000 Senior Lecturer $55,000 Lecturer $43,000 - $45,000 Employee5 Position Salary Range Senior Academic Staff $65,000 - $80,500 Junior Academic Staff $43,000 - $55,000
Each of these summarisations has reduced the size of the previous relation and represents a different method of summarisation. Moreover, at each point a number of alternative steps in the summarisation process were available. Employee2 is constructed from Employee by deleting the two attributes with the largest spread of domain values and eliminating duplicates. Employee3 is constructed by retaining only tuples with a PositionType of Academic and removing the attribute PositionType. Employee4 then introduces range values into the Salary domain and collapses a number of tuples as a consequence, and Employee5 then utilises concept hierarchies (qv. section 3) to reduce the Position field to fewer values. Each type of reduction also results in a different semantic interpretation. For example, a query that requires merely the attributes that remain can be answered from Employee2 as easily as it can from Employee (and possibly more quickly if Employee2 is able to be stored on a faster access device or at a local site or in main memory). The same query of Employee3 can only be answered if it was known that only those tuples with a PositionType of Academic were to be included, for instance, by an explicit reference in the query or by virtue of an interim result. By the time summarisation reduces to Employee5 the available information is restricted severely, particularly if the rules by which the data has been summarised are not available. Consider the following five queries: – – – – –
What is Professor Smith’s Salary? Given that Leslie Lee is an Administrative Officer, what is her salary? Given that Jane Smith is a Professor, what is her salary? Does the Department employee any Associate Professors? What is the minimum salary paid to a Senior Academic Staff member?
Assuming we know the summarisation mechanisms, Employee can provide answers to all five of these questions, Employee2 to all but the first question, Employee3 to all but the first two, etc.2 2
In some circumstances, the ability to answer a closely related question or provide an answer which is a generalisation of the correct answer (for instance in real-time applications, mobile
There is an important issue here – note that not only the answer, but also the ability to answer is as dependent on the data as on the construction of the relation. This differs from unsummarised relations in which the ability to answer the question is dependent wholly on the database structure. For example, the question Given that Mike Green is a Lecturer, what is his salary? cannot be answered by reference to Employee2 (unlike the example regarding Leslie Lee’s salary above). The interpretation of the absence of information in an unsummarised (ie. a source) relation is commonly governed by the closed world assumption – put simply, if it is not recorded, it is considered false. For summary relations a tri-state logic approach is necessary and this must be incorporated into the mechanism for interpreting summarised relations. For example, again used Employee2 consider the following questions: – Given that Jane Smith is a Full Professor, what is her Salary? – Given that Anne Wong is an Associate Professor, what is her Salary? – Given that Chris Trent is an Assistant Professor, what is his Salary? The first can be answered, the second and third cannot. However, the second cannot because some ambiguity between the salary level of an Associate Professor has been introduced by the summarisation process. The third cannot answer as Employee2 shows that there are no Assistant Professors recorded. Moreover, assuming we know the summarisation method, Employee2 can also state the that main database will not be able to answer the query either. 2.1 Augmenting the Summary Relation Note that some augmentation, in the form of additional attributes, could also be performed during summarisation. For example, the number of original tuples represented by the summarised tuple might be held which could assist in answering some questions. For example, Employee might be reduced to Employee2B as below. Employee2B Position Full Professor Associate Professor Associate Professor Senior Lecturer Lecturer Lecturer Programmer Administrative Officer
Level PositionType 1 2 3 1 2 3 1 2
Salary
Source Cardinality Academic $80,500 1 Academic $65,000 1 Academic $68,000 1 Academic $55,000 2 Academic $43,000 2 Academic $45,500 1 Non-Academic $36,000 1 Non-Academic $27,500 1
A question such as How many Senior Lecturers are in the Department? could then be answered. Furthermore, if we were to impose the rule that all source tuples were to be represented in some form in the summarised relation, perhaps through the introduction of additional existential domain values (such as Other values, Many values, etc.), then further databases, etc.) may be useful and could be performed. This aspect is not covered here and readers are directed to other work, such as [5, 10], for further details.
questions still could continue to be answered. Consider the reduction of Employee2B to Employee3B as below. Employee3B Position
Level Salary
Full Professor 1 $80,500 Associate Professor 2 $65,000 Associate Professor 3 $68,000 Senior Lecturer 1 $55,000 Lecturer 2 $43,000 Lecturer 3 $45,500 OTHER MANY OTHER
Source Cardinality 1 1 1 2 2 1 2
In this example, a question such as How many members of the Department are there? could continue to be answered. Note that the space taken for a source cardinality attribute or the existential domain values would normally be relatively small. The semantics of existential domain values is considered formally in section 4.5. Clearly, it is important if we are to reuse summarised data for a number of purposes that there is a general understanding of the semantics of generalised data and of the rules of reduction, particularly if summarised data is to be shared between systems.
3 Concepts as Domain Values 3.1 Semantics It is essential to define carefully the semantics of summarised data including the use of any higher level concepts that will be used as domain values. The definition of null, for example, in classical relational database theory had a number of meanings, the two most significant being [15]: – The attribute value was unknown; – The attribute was inapplicable. The misinterpretation of this value has led, on occasions, to problems3 . The definition of concepts as domain values must be similarly carefully defined (indeed, perhaps more so) particularly if range queries are to be processed correctly. There are at least three possible interpretations when a concept is used as a domain value such as in Employee5 (remembering that a tuple in a summary database may represent a number of instances): – The attribute takes zero or more of the values represented by the concept; – The attribute takes one or more of the values represented by the concept; – The attribute takes all of the values represented by the concept. 3
Note that as well as the misinterpretation of null, the use of other special purpose values has also been problematic. Consider, for example, the oft used code of 99 which has contributed to problems of non-year 2000 compliance when used in date fields.
Summary database construction and query language processing depends critically on which of these definitions are adopted. For example, take the following two queries of Employee5 earlier: – Do we employ any Senior Academic Staff members? – Do we employ any Associate Professors? Each interpretation of Senior Academic Staff would yield different answers to these questions. To the first question, the first interpretation would have to answer Unknown while the latter interpretations would be able to answer Yes. To the second question, the first two interpretations would have to answer Unknown while the last interpretations would answer Yes. We take the pragmatic view that the second interpretation is the most useful for three reasons: – The method of summary database construction leads naturally to this interpretation; – The utility of the first interpretation is limited; – The chances of a successful summarisation (including the maintenance of those summaries) with the third interpretation is small.
4 A Formal Categorisation of Summarisation Strategies As stated at the start of Section 2, the categorisation of summarisation methods can be performed in a variety of ways. In the taxonomy below, we adopt the following metric of utility for a summarisation technique:
W = ((RS )) ((RS ))
(2)
Where returns the space requirements of a relation, in this case either the complete relation, R or the summary relation S , returns the Information Capacity of a relation (see Section 4.1 below).
Thus, using various summarisation strategies, a summarisation weighting W can be determined which estimates the advantages of the space saving with the disadvantages the loss of information capacity of the resultant relation. As these are constructed using the same elements, they can be directly compared and the most beneficial strategies adopted. A number of summarisation techniques are identified below. On top of these, compression techniques can be adopted to reduce space usage; as they do not affect the information capacity of the resulting dataset and will not be investigated here (although interestingly, the space saving can be directly compared with the summarisation techniques here through equation 2). 4.1 Information Capacity Information Capacity (as a measure of the usefulness or inferencing capability) of relations has been discussed in other research and will not be explored in depth here [7].
However, it is important to note that the calculation of the function is likely to be highly context sensitive and may depend on the historical use of the attributes or the participation of attributes in foreign key dependencies and constraints. Moreover, while the first term in the calculation of W above is fairly accurate in terms of the space saving, the second term is highly dependent on the data actually held and will be, in many cases, an estimated value. Note that while a value for (S ) or (R) may be hard to determine, a value for their ratio (ie. (S )= (R)) will be easier. 4.2 Vertical Reduction by Attribute Projection This method involves the selection and deletion of one or more attributes and the subsequent elimination of duplicate tuples. It is a commonly adopted method of summarisation. For each attribute, the summarisation weighting W is determined as follows:
W (t) = (R(R?) t) KK((RR) ) (R(R?)t) t
(3)
Where returns the space requirements of a tuple, in this case either the complete tuple of the relation, R or a tuple without attribute t, K returns the cardinality of a relation, in this case the cardinality of R projected over attribute t and of R itself, and returns the Information Capacity of a relation.
For semantic query optimisation purposes, the summary relation S (= R ? A where A is the set of attributes removed) can be used whenever no attributes in A will be used directly or indirectly (through use as a foreign key, for example) in the query process4 . 4.3 Horizontal Reduction by Tuple Selection
This method involves the retention or deletion of tuples according to one or more selection criteria. As with vertical reduction, it is a commonly adopted method of summarisation. For each selection criteria, the summarisation weighting W can be determined as follows:
W (c) = KK((RR) ) ((Rc R) ) c
(4)
K returns the cardinality of a relation, in this case the cardinality of R over selection criterion c and of R itself, and returns the information capacity of a relation as for vertical reduction. For semantic query optimisation purposes, the summary relation S (= c R where c is Where
the conjunction of the selection criteria applied to either remove/retain tuples) can be used instead of R whenever c is exclusive of/subsumes the selection criterion specified in the query process. For example, if S only contains details of the staff in Science and the query applies only to Computer Science staff, then the query can be executed on S instead of R with no loss of accuracy. 4
Note that in practice, the use of S may result in the suppression of duplicate lines due to the elimination of duplicate tuples. In pure relational databases this should happen anyway but few implementations adhere to this.
4.4 Concept Ascension and Ranges Horizontal Reduction by Concept Ascension. The last two methods are relatively simply understood and accommodated but suffer from the problem that the information capacity reduces rapidly as attributes and tuples are removed. The idea of accommodating hierarchical domains has been discussed elsewhere [3–5, 10] and provides a mechanism whereby the information capacity of a summary dataset may degrade more slowly for a similar reduction in space. Briefly, the idea is to provide, commonly through user-supplied hierarchies although they may also be generated a priori by autonomous procedures, higher level concepts (which in earlier work [5] we referred to as Domain-Value Hierarchies or DVH), which when applied to one or more attributes may result in duplicate tuples that may then be coalesced. This method can be especially useful for use with temporal, spatio-temporal and other implicitly hierarchical data. The summarisation weighting calculation is similar as that for horizontal reduction:
W (c) = KK(A((RR))) (A((RR)))
(5)
Where A(R) returns the relation after applying various concept ascension procedures. The definition of the information capacity function is difficult to determine and depends largely on the manner in which hierarchies are used in query optimisation and query processing. Clearly, if the query processor is able (and is allowed) to provide sound but possibly overcomplete answers5 , the information capacity will be higher. Nevertheless, it has been shown elsewhere that summarised data can sometimes produce (knowingly) correct answers [10], and thus even in this limited case, the information capacity may be higher.
Vertical Reduction by Concept Ascension. In rarer cases, a concept hierarchy may exist between attributes (or an Inter-Attribute Hierarchy or IAH). (It should be noted that these will only occur if the relation is deliberately unnormalised (such as commonly occurs in a data warehouse) or if an induced dependency exists.) In this case, either the attribute with the higher or lower concept can be deleted. Deletion of the attribute with the higher concept would result in no information loss (assuming the hierarchy is held elsewhere) but would generally result in a lower space saving. Deletion of the attribute with the higher concept would result in some information loss but would generally allow greater compression of the relation. See, for example, the example below in which EmplId is removed followed by either Faculty (as in EmplDept2A) or Department (EmplDept2B): 5
ie. an answer which contains all of the requested tuples but may also contain additional answers. For example, the return of a range within which all answers lay might also include other answers which do not fit the criteria.
EmplDept EmplId 091 100 114 117 134 383 763 873 889 927
Rank Department Assoc. Professor Computer Science Professor History Senior Lecturer Languages Professor Mathematics Senior Lecturer Computer Science Professor Languages Assoc. Professor Electronic Engineering Professor Computer Science Assoc. Professor Computer Science Senior Lecturer History
EmplDept2A Rank Professor Professor Professor Professor Assoc. Professor Assoc. Professor Senior Lecturer Senior Lecturer Senior Lecturer
Faculty Science Humanities Humanities Science Science Humanities Engineering Science Science Humanities
Department Computer Science History Languages Mathematics Computer Science Electronic Engineering Computer Science History Languages
EmplDept2B Rank Faculty Professor Humanities Professor Science Assoc. Professor Engineering Assoc. Professor Science Senior Lecturer Humanities Senior Lecturer Science
The summarisation weighting calculation can be calculated in the same way as for Vertical Reduction by Attribute Projection. For the purposes of summarisation and the summarisation weighting calculation, the introduction of ranges can be considered a special case of concept ascension. For example, the reduction of Employee3 to Employee4 resulted in the same form of tuple aggregation as would have happened if concepts equivalent to each range had existed instead. 4.5 Existential Domain Values Some Existential Domain Values (ESDs) can be considered a special case of the definition of concepts, notably the root nodes, however even in these cases it remains important to define carefully the semantics of such a value. Three ESDs are considered and defined here: – MANY - The attribute can take one or more of the allowable values from the domain. Note that this is effectively the root node of the concept hierarchy for the domain. Note also that the term MANY is used in preference to ANY or ALL due to the semantics described in section 3.1.
– OTHER - The attribute can take one or more of the allowable values from the domain except those already used by that attribute in other explicitly enumerated tuples. This attribute was termed REST in some earlier work [8] but that term has been discarded for the same reasons as for ANY and ALL. – NULL - The attribute is inapplicable or the value is unknown. Other ESDs could, of course, be defined and while much research remains to be done, these are considered the most useful at this stage. Consider the early example, Employee3B. The advantage of the additional tuples becomes clear when questions such as: – Which ranks are paid $80,500? and – How many staff are recorded in the database? In the first case, the OTHER indicates that whatever tuples there were that were deleted, none had a salary of $80,500 and thus the question can be answered exactly. In the second case, the source cardinalities can be totalled to give the correct answer. Note that one of the significant benefits from the introduction of ESDs is in improvements in query language optimisation, a field that has increasingly encountered problems with the introduction of data mining and warehousing technology. For example, using ESDs, a complete mapping can be enforced between tuples in a relation and its summarised counterpart, as follows. Given a Source Relation R with tuples r1; r2; : : :rn , and a summarised relation S with tuples s1 ; s2 ; : : :sn , we have a complete mapping: 9i; j : 8ri 2 R ! (sj 2 S ) (6) where is a function that takes each attribute value and translates it to itself or to a higher level concept of the same concept (including ESDs).
A converse mapping also exists. Note that this indicates that if a fact is not represented in the summary relation in some form, it will also be absent from the source relation and thus some query processing may be completed using summary relations (qv. [10]). 4.6 Source Cardinality, Statistical Summarisation and other Summarisation Techniques The introduction of additional attributes, such as the cardinality of the number of original tuples, is another strategy to improve the information capacity of the resultant, summarised relation. Source cardinality, for example, could be used to answer questions requiring the count of tuples (although care must be taken with averages if ranges have also been introduced). However, the source cardinality is also of use in summary database maintenance as an addition, amendment or deletion from the main database may, or may not, result in a corresponding change to the summarised data.
5 Further Discussion and Future Research This research was originally conceived to handle the increasing disparity between the fast decreasing costs of disk storage and main memory and the slower improvements in
the performance of I/O access [9]. However, we have also applied these ideas to mobile databases [5] and we believe they may also find applicability in distributed systems including internet based distributed data sources. Interest in the development of models for semi-structured data [13] has suggested both the embedding of the structural component of the schema and the introduction of (largely structural) Document Type Descriptors [14]. Both remove the domain descriptions that have traditionally resided in the database schema and which are useful for data integration. It is thus necessary for any integration mechanism to be moved to the mediator [1] however, it is not yet clear how this might be achieved. While this paper does not discuss the details of relational operations on summary relations, it is clear that some forms of induction are being introduced into the normally “deductively” correct area of query language processing and it should be expected that a change in the interpretation of answers may be required. In [10] we discussed knowingly correct answers and a mechanism for providing a graceful degradation of responses was outlined. This becomes more significant when complex operations are performed over a number of relations, some of them summaries.
References 1. Cluet, S., Delobel, C., Simeon, J. and Smaga, K.: Your mediators need data conversion! In Proc. ACM SIGMOD International Conference on the Management of Data, 177-188. 1998. 2. Chen, M.C. and McNamee, L.P.: On the data model and access method of summary data management. IEEE Trans. Knowl. and Data Eng., 1(4):519-529. 1989. 3. Han, J. and Fu, Y. Dynamic Generation and Refinement of Concept Hierarchies for Knowledge Discovery in Databases. In Proc. AAAI’94 Workshop on Knowledge Discovery in Databases, 157-168. 1994. 4. Hu, X. Conceptual Clustering and Concept Hierarchies in Knowledge Discovery. Masters Thesis, Simon Fraser University, 1993. 5. Madria, S.K., Mohania, M.K. and Roddick, J.F. A query processing model for mobile computing. In Proc. Foundations of Database Organisation, 147-157. 1998. 6. Malvestuto, F. The derivation problem for summary data. SIGMOD Rec., 17(3):82-89. 1988. 7. Miller, R.J., Ioannidis, Y.E. and Ramakrishnan, R. The use of information capacity in schema integration and translation. In Proc. Nineteenth International Conference on Very Large Databases, 120-133. 1993. 8. Roddick, J.F. A model for temporal inductive inference and schema evolution in relational database systems. Ph.D. Thesis, Department of Computer Science and Computer Engineering, La Trobe University, 1994. 9. Roddick, J.F., Craske, N.G. and Richards, T.J. Handling discovered structure in database systems. IEEE Trans. Knowl. and Data Eng., 8(2 (April)):227-240. 1996. 10. Roddick, J.F. The use of overcomplete logics in summary data management. In Proc. Eighth Australasian Conference on Information Systems, 288-298. 1997. 11. Roddick, J.F. and Rice, S. Towards induction in databases. In Proc. Ninth Australasian Information Systems Conference, 2, 534-542. 1998. 12. Sato, H. Handling summary information in a database. SIGMOD Rec., 1981. 13. Suciu, D. Semistructured Data and XML. In Proc. 5th International Conference on Foundations of Data Organisation, 1-12. 1998. 14. World Wide Web Consortium Extensible Markup Language (xml). Version 1.0. 1998. 15. Zaniolo, C. Database relations with null values. J. Comput. Syst. Sci., 28(1):142-166. 1984.