Using lattice-structured domains to represent

1 downloads 0 Views 205KB Size Report
rice@cis.unisa.edu.au http://www.cis.unisa.edu.au/people/spr/. John F. Roddick. School of Informatics and Engineering, Flinders University of South Australia,.
Using lattice-structured domains to represent imperfect information in databases Sally Rice



School of Computer and Information Science, Mawson Lakes Campus, Mawson Lakes 5095 South Australia. [email protected] http://www.cis.unisa.edu.au/people/spr/ John F. Roddick

School of Informatics and Engineering, Flinders University of South Australia, GPO Box 2100, Adelaide 5001, South Australia. roddick@cs. inders.edu.au http://www.cs.flinders.edu.au/People/John Roddick/

Abstract

The relational model, as proposed by Codd, contained the concept of relations as tables composed of tuples of single valued attributes taken from a domain. In most of the early literature this domain was assumed to consist of elementary items such as simple (atomic) values, de ned complex data types or arbitrary length binary objects. Subsequently the nested relational or non- rst normal form model allowing set-valued or relation-valued attributes was proposed. Within this model an attribute could take multiple values or complete relations as values. This paper presents a further extension to the relational model which allows domains to be de ned as a hierarchy (speci cally a lattice) of concepts, shows how di erent types of imperfect knowledge can be represented in attributes de ned over such domains, and demonstrates how lattices allow the accommodation of some forms of inductive queries. While our model is applied to at relations, many of the results given are applicable also to nested relations. Necessary extensions to the relational algebra and the query language SQL are also discussed.

Keywords: Inductive queries, concept lattices, imperfect information in databases, incomplete information, inconsistency, imprecision.

1 Introduction Attribute values in relational databases which are not numeric, spatial or temporal are commonly interpreted as labels with no semantic connection between  Some of the work for this paper was undertaken while on sabbatical at Flinders University of South Australia

1

them beyond that of simple ordering. For example, consider the attribute domain fMonday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday, weekday, weekendg. As well as the order of the days of the week, there is an inclusion relationship between Sunday and weekend. This type of relationship can be expressed in a concept lattice, which can be represented in the database as a domain concept relation { a simple child-parent binary relation between pairs of labels from the domain [16, 18, 19]. Kedad and Metais [10] have looked at hierarchical domains. Their approach for implementing queries on the structured domain is to develop queries using special semantic operators for joins and the set operations, and translate these into classical relational algebra queries. They restrict the hierarchies under consideration to trees. This means that only one path can exist from any element in the hierarchy to the top of the structure, which simpli es algorithms for making inductive queries on the structure at the expense of limiting the type of structure that can be represented. We address the added complexity incurred by allowing a lattice-structured hierarchy in Section 4. The work by Hilderman, Hamilton and others on domain generalization graphs [7, 8, 9, 15] is also relevant. These graphs are used to describe hierarchical domains which are used in attribute-oriented generalization techniques for data mining. The hierarchies are used to generate summaries to obtain knowledge about higher level concepts. Our approach di ers by attempting to adapt the database paradigm and the query language to permit the de nition and use of hierarchical domains, and by widening the scope of data imperfection under consideration to include interval data and inconsistent information. Mannila [12] de nes inductive databases as databases that contain inductive generalisations about data. The purpose of these inductive generalisations is to improve the e ectiveness of data mining algorithms. The inductive queries discussed in this paper, although relevant to these inductive generalisations (for example [16, 17, 18, 19]), are not restricted to inductive databases, but can be used in any database which has the ability to represent hierarchical domains and store imperfect data. There are many terms used in the literature for di erent types of imperfect information. Parsons [13] o ers a classi cation of imperfect information into ve categories: uncertainty (lack of information), imprecision (lack of granularity), vagueness (the fuzziness implied by terms such as `young' or `short'), incompleteness (lack of relevant information) and inconsistency (contradictory information). We wish to consider three cases of imperfection which we present in detail in the next section of this paper. The type of imperfect information captured by the use of weekend instead of Sunday we term incomplete. Our rst example using a concept lattice built on geographic locations gives rise to imperfect information of this type. Our second example uses a lattice of temporal intervals, with intervals at a higher level of the lattice containing those at lower levels. This type of information we term imprecise. Our third example is of inconsistent information, where one piece of data contradicts another. The term uncertainty in this paper is reserved for describing the e ect of matching domain values at di erent levels of the hierarchy with the attribute value in an inductive query. Parson's uncertainty and vagueness are not addressed explicitly in this paper. Vagueness could be modelled in a hierarchical domain, at least in part, by using a combination of numeric intervals and labels on the lattice. And perhaps uncertainty could be modelled hierarchically as a restricted type 2

of inconsistency, if information with 70% certainty was interpreted as weights of 0.7 on the information and 0.3 on its negation. The rest of this paper is organised in the following way. Section 2 introduces lattice-structured domains via our three examples, and shows how the structure information can be stored in the database. Section 3 introduces inductive queries (queries on attributes de ned on lattice-structured domains) and discusses the uncertainty inherent in these queries. Truth tables for a logic containing the 3 values true, uncertain and false are given so that truth values can be calculated for combining conditions involving uncertainty [3, 6]. Section 4 de nes some functions that are needed to evaluate inductive queries, and the computational complexity of these algorithms is brie y discussed. Section 5 investigates the implications of inductive queries for relational algebra operations, and proposes some extensions to the algebra. These extensions, and necessary changes to the de nition and data update facilities of query languages for inductive queries, are discussed within the framework of an implemented query language (SQL) in Section 6. Section 8 closes the paper with a discussion of application areas and future directions for this work.

2 Lattice-structured domains Temporal or spatial information at a coarse granularity can be considered as incomplete knowledge. For example \I will arrive on the weekend" is less informative than \I will arrive on Sunday afternoon". Attribute values of di erent granularity form a domain hierarchy, whose structure can be described by a domain concept relation DS = (element, concept). Here each tuple describes a relation between two elements of the domain { the rst element belongs to the concept that the second element describes. Consider a domain D with the twelve elements fAmericas, North America, Central America, South America, USA, Canada, Mexico, Nicaragua, Brazil, Chile, northern hemisphere, southern hemisphereg. Figure 1 shows a tree-structured domain built from ten of them,

describing the geographical relationship between the elements. Each element at the lower levels of the structure has a single parent at the previous level, leading to unique paths from any node to the root of the structure to which it belongs. Allowing children to have multiple parents in a structure changes Americas

North America

USA

Central America

Canada

Mexico

Nicaragua

Brazil

South America

Chile

Figure 1: Tree-structured domain the structure from a tree to a lattice: a simple example is that shown in Figure 3

2, where one child, Brazil, has two parents. Table 1 shows a domain concept northern hemisphere

USA

Canada

Mexico

southern hemisphere

Nicaragua

Brazil

Chile

Figure 2: Lattice-structured domains relation combining the hierarchical information from Figures 1 and 2. Top level Table 1: domain concept relation element USA Canada Mexico Nicaragua Brazil Chile North America Central America South America USA Canada Mexico Nicaragua Brazil Brazil Chile

concept North America North America Central America Central America South America South America Americas Americas Americas northern hemisphere northern hemisphere northern hemisphere northern hemisphere northern hemisphere southern hemisphere southern hemisphere

concepts (Americas, northern hemisphere and southern hemisphere) are identi ed by never appearing in the element column, base level concepts (the names of the individual countries) by never appearing in the concept column. Middle level concepts appear in both columns. In our second example, a lattice has been used to describe a domain of imprecise information represented by intervals. For the lattice of time intervals shown in Figure 3, the nodes represent intervals and the edges represent (sometimes partial) inclusion relationships between those intervals. Intervals at lower 4

levels are smaller (more precise) than those at higher levels. The idea of representing interval data in lattices is not new, see [4] for example. Intervals could be represented in the database and in the domain either by de ning both the end points, or by de ning a start value and a length. The semantic information contained in the lattice is inherent in the intervals. It would not be necessary to explicitly create the domain concept relation { the necessary entries to describe the relationship between di erent elements in the domain could be built from the elements as they were de ned. (9am,5pm] (2pm,3pm]

(1pm,2pm]

(2:10pm,2:20pm]

(1:55pm,2:05pm]

Figure 3: Lattice of intervals Our third example shows how inconsistent information can be represented in a hierarchical domain. The lattice shown in Figure 4 shows sets of attributes. If the same object is described as being both red, small and square, and blue, small and square, we can represent this inconsistency by the node marked fcolourg, since its size and shape are known, but there is doubt about its colour. Once again it would not be necessary to explicitly create the domain concept relation. A lattice to describe inconsistencies in a relation could either be de ned automatically to include every possible inconsistency type (as in this example) or be built from the inconsistencies that exist in tuples in the relation as they are de ned. Representing the inconsistent data itself is another issue. One method would be to include a set containing the di erent values (e.g. fblue, redg) as the attribute value. Retaining information about the origin of each con icting value would also be useful. Alagar, Sadri and Said [1] discuss the use of information source vectors for maintaining source information at the individual attribute level. Retaining information about the origin of con icting values held in a set within an attribute would require the vectors to be attached to each element in the set.

3 Uncertainty in inductive queries For the domain shown in Figure 2, the semantic interpretation appears to be that Canada, USA, Nicaragua and Mexico are wholly within the northern hemisphere, Chile is wholly within the southern hemisphere, and Brazil is split between both hemispheres. However, the way the domain is used may change the meaning, introducing uncertainty into relations using this domain. Consider relation R = (name, location) shown in Table 2, where location is de ned on the latticestructured domain described in Table 1. If the locations refer to a speci c, if unknown, address, then this relation says that John lives somewhere in Brazil, and almost certainly either in the northern hemisphere or the southern hemisphere, 5

{colour, size, shape}

{colour, shape}

{colour, size}

{colour}

{size, shape}

{shape} {size}

{}

Figure 4: Lattice of attribute sets Table 2: Relation R using a lattice-structured domain name John Susan Peter Betty

location Brazil Chile southern hemisphere Nicaragua

but not both. This introduces a degree of uncertainty into whether or not conditions on these elements hold. Given any condition C on an attribute de ned on lattice-structured domain D, that condition can be categorised as known to be true, uncertain (may or may not be true), or known to be false. For example, the condition location = 'southern hemisphere' is known true for Susan and Peter, known false for Betty, and uncertain for John. These categories create three mutually exclusive sets T, U and F for the elements e 2 D, where T = fejC trueg, U = fejC uncertaing and F = fejC falseg. In addition to these sets, we can de ne another three (useful) groupings by combining them in pairs. This leads to the de nition of the six inductive conditions shown in Table 3. The symbol ? indicates that condition C may be true (i.e. true or uncertain), and the symbol ! that the value of C is not uncertain (i.e. true or false). The inductive conditions degenerate as expected where no uncertainty exists. In the case of unstructured domains U = , so ?C and :(?C) are equivalent to C and :C respectively, and !C is always true and :(!C) always false. As well as the uncertainty introduced in relation R by saying that John lives somewhere in Brazil, there is another type of uncertainty in saying that Peter lives in the southern hemisphere. This means that Peter might live in either Chile or Brazil, but not both. One type of uncertainty is introduced because of the use of an element in the tuple at a higher level than that in the query (type 1 uncertainty), and another because of the use of an element in the tuple that is a member of more than one higher level concept (type 2 uncertainty { this 6

Table 3: Inductive conditions condition C :C :(!C) ?C :(?C) !C

meaning included elements C true T C false F C uncertain U C true or uncertain T [U C uncertain or false U [F C true or false T [F

does not arise if structures are limited to trees). Consider the conditions C1 (location = 'Brazil') and C2 (location = 'northern hemisphere'), where C1 leads to type 1 uncertainty, and C2 to type 2. Table 4 shows the names in sets T, U and F for each condition. Only T Table 4: Sets T, U and F for conditions C1 and C2 C1 T fJohng U fPeterg F fSusan, Bettyg

C2 fBettyg fJohng fSusan, Peterg

and U need to be calculated by examining the concept lattice de ned by relation DS, since F = ; T ; U. Once T and U have been determined, the type of uncertainty is no longer of importance. Consider a simple condition of the type a = e, where a is an attribute de ned on a structured domain, e is an element from that domain, R is the relation in which a is de ned, and r is a tuple from R. For domains with type 2 uncertainty, T = frjr 2 R ^ (r:a = e _ r:a must be a descendent of e)g and U = frjr 2 R ^ (r:a is an ancestor of e _ r:a may be a descendent of e)g. For domains with only type 1 uncertainty, or if e has no descendent concepts with multiple parents, the algorithm needed to construct sets T and U can be simpli ed. Structurally, the di erence between type 1 and type 2 uncertainty for domain value e is whether any descendents of the node e in the concept lattice have a path to the top of the lattice which does not go through e. Algorithms to construct these sets are de ned in the next Section. In the case of compound conditions such as C ^ C or C _ C , the set to which a given tuple belongs can be determined by considering the 3-valued logic truth tables shown in Table 5. Here t, u and f represent the truth values true, uncertain and false respectively. It should be noted that in the special case of disjunction when C _ C includes all possible values for an uncertain attribute for a tuple t that the table is not correct. Although t will be in U for both C and C , it will be in T for their disjunction C _ C [11]. For example, for the a

a

b

a

b

b

a

b

a

7

b

conditions C (location = 'Brazil') and C (location = 'Chile'), C and C are both uncertain for Peter, but C _ C is true (assuming domain D is a closed world). a

b

b

a

a

b

Table 5: 3-valued logic truth tables C :C t f u u f t

C _C t u f a

b

t t t t

u t u u

f t u f

C ^C t u f a

b

t t u f

u u u f

f f f f

4 Determining truth values for inductive queries In order to construct sets T and U, we need to determine truth values associated with elements from the structured domain, which means that we rst need to identify the higher and lower level concepts for particular elements. Let e be an elemental value from a lattice-structured domain D whose structure is de ned by domain concept relation DS. H(e) is the set of higher-level concepts of e, i.e. H(e) = fcjc 2 D ^ c is an ancestor of eg, and L(e) is the set of lowerlevel concepts of e, i.e. L(e) = fcjc 2 D ^ c is a descendent of eg. If e is a base concept, then L(e) = , and if e is a top-level concept, then H(e) = . If D is an unstructured domain, then L(e) = H(e) =  for all e. For our example, H(South America) = fAmericasg, L(South America) = fBrazil, Chileg, H(Brazil) = fSouth America, Americas, southern hemisphere, northern hemisphereg and L(Brazil) = . Another useful hierarchical function A(e1 ; e2 ) is de ned as the set of all elements from D which are ancestors of e2 which are not related to e1 . If fe2 je2 2 L(e1 ) ^ A(e1 ; e2 ) 6= g is not empty, then type 2 uncertainty exists for a query using e1 . The function ancestor is used to formally de ne the set-valued functions H, L and A, and thus determine the sets T and U for a given domain element value e. These functions are de ned below. ancestor(x; y) = (x; y) 2 DS _ ((x; z) 2 DS ^ ancestor(z; y)) H(x) = fyj ancestor(x; y)g L(x) = fyj ancestor(y; x)g A(x; y) = fz jz 2 H(y) ^ z 62 fxg [ H(x) [ L(x)g T = fxjx = e _ (x 2 L(e) ^ A(x; e) = )g U = fxjx 2 H(e) _ (x 2 L(e) ^ A(x; e) 6= )g Evaluating functions H, L and A involves the recursive function ancestor. Determining the sets T and U involves determining A(e; x) for all x 2 L(e). To determine T, U and F for a query with type 2 uncertainty involving domain element e: 1. Evaluate H(e) and L(e). 2. Evaluate A(e; x)8x 2 L(e). 8

3. Calculate T = feg [ fxjx 2 L(e) ^ A(e; x) = g. 4. Calculate U = H(e) [ fxjx 2 L(e) ^ A(e; x) 6= g. 5. Calculate F = ; T ; U. For a domain with only type 1 uncertainty, A(e; x) =  for all x 2 L(e). In this case the process can be simpli ed to: 1. Evaluate H(e) and L(e). 2. Calculate T = feg [ L(e). 3. Calculate U = H(e). 4. Calculate F = ; T ; U. The complexity of these algorithms can be related to the number of elements n in the domain. Consider the following non-recursive algorithms for functions H(e), L(e) and A(e; x):

Algorithm calc H to calculate H(e)

BEGIN Initialise H to  Initialise elements to feg WHILE (elements 6= ) Initialise concepts to  FOR (each element s in elements) FOR (each row r in DS) IF (r.element = s AND r:concept 2= H) Add r.concept to H Add r.concept to concepts ENDIF ENDFOR ENDFOR elements = concepts ENDWHILE END

Algorithm calc L to calculate L(e)

BEGIN Initialise L to  Initialise concepts to feg WHILE (concepts 6= ) Initialise elements to  FOR (each concept s in concepts) FOR (each row r in DS) IF (r.concept = s AND r:element 2= L) Add r.element to L Add r.element to elements ENDIF 9

ENDFOR ENDFOR concepts = elements ENDWHILE END

Algorithm calc A to calculate A(e; x)

BEGIN Initialise A to  Call calc H to calculate Hx = H(x) Calculate related to e = feg [ H(e) [ L(e) FOR (each element h in Hx) Set inA to TRUE FOR (each element r in related to e) IF (h = r) Set inA to FALSE ENDIF ENDFOR IF (inA) add h to A ENDIF ENDFOR END The complexity of calc H is controlled by the number of levels above e in the concept lattice (the WHILE loop), the size of the set elements at each level (the outer FOR loop) and the number of rows m in DS (the inner FOR loop). The number of levels and the size of elements are both at most n, and in fact each element can only appear in elements once. This means that the inner FOR loop is executed at most n times, so the complexity of calc H is O(nm). The upper limit on m is n2 , when every element is related to every other element, so calc H is at most O(n3 ). (Strictly speaking, this is the upper limit for a graph { a lattice must be less completely connected than this.) Since an upper limit on the number of elements in Hx and related to e is also n, the complexity of calc A is dominated by the calculation of Hx, i.e. it is also O(n3 ). From the similarity of the algorithms, it is clear that the complexity of calc L is the same as that of calc H. Hence determining the sets T , U and F is at most O(n3) if type 2 uncertainty is guaranteed not to be involved, and O(n4) otherwise (since calc A is called for every element in L(e)).

5 Relational algebra operations Introducing lattice-structured domains necessitates some extensions to the relational languages underpinning the relational model. Consider the ve basic operations of the relational algebra:  (selection),  (projection),  (cartesian product), [ (set union) and ; (set di erence). Working with these domains will not a ect  and , but will a ect , [ and ;. All other operations, such as \ (set intersection) and ./ (join), can be constructed from the basic operations. 10

For simplicity, the relations in this section all contain exactly one attribute de ned on a lattice-structured domain. The examples use the relations R and S de ned in Tables 2 and 7 (primary key name), and relation Q de ned in Table 6 (primary key (location, language). The attribute location in Q is de ned on a at subset of domain D which has been restricted to countries. As is usual for SQL, results of queries are allowed to include duplicate tuples. The changes required for  are due to the e ect of structured domains on Table 6: relation Q using a lattice-structured domain location Canada Canada USA Mexico Nicaragua Brazil Chile

language English French English Spanish Spanish Portugese Spanish

Table 7: Relation S using a lattice-structured domain name location Susan South America Peter Brazil conditions as previously described. This can be indicated in the algebra by using the inductive condition notation developed in Section 3. For example, ? 2 (R) would return all tuples from R where C2 is true or uncertain, i.e. f(Betty, Nicaragua), (John, Brazil)g. Join operations where join-columns are de ned on lattice-structured domains can be expressed using inductive conditions in a similar way. Consider relation Q which shows the ocial languages for all six countries. Table 8 shows some joins involving relations Q and R. Note that the last two joins require both conditions to be expressed using the ? operator so that southern hemisphere will match with Brazil. R ./?( location = location) Q would return all tuples from R  Q where the location values are or may be equal, whereas R ./ location = location Q only includes the tuples where the location values are known to be equal. Inductive outer joins can be de ned which will include the unmatched tuples determined by the appropriate inner join. See Table 9 for the results of the inductive left outer join of R with Q for two di erent join conditions. For the rst of these, the unmatched tuples are those from R that are not known for certain to match locations in Q, and for the second, the unmatched tuples are those from R which de nitely are not in the northern hemisphere. C

R:

Q:

R:

11

Q:

Table 8: Joins involving relations R and Q name R.location Q.location language R ./ location = location Q Betty Nicaragua Nicaragua Spanish John Brazil Brazil Portugese Susan Chile Chile Spanish R ./?( location = location) Q Betty Nicaragua Nicaragua Spanish John Brazil Brazil Portugese Peter southern hemisphere Brazil Portugese Peter southern hemisphere Chile Spanish Susan Chile Chile Spanish R ./?( location = location)^?( location ='Brazil') Q John Brazil Brazil Portugese Peter southern hemisphere Brazil Portugese R ./?( location = location)^?( location ='northern hemisphere') Q Betty Nicaragua Nicaragua Spanish John Brazil Brazil Portugese R:

Q:

R:

R:

R:

Q:

Q:

Q:

R:

R:

For the set operations, uncertainty can be introduced when relations with di erent levels of re nement for the same information are combined. Table 10 shows R [ S, R \ S and R ; S. Without further knowledge, it is reasonable to choose the information with the nest granularity when combining these relations, but sometimes this involves a loss of information. Susan has di erent locations in each relation, Chile and South America, only one of which can appear in R [ S and R \ S. Chile is the most speci c, so it is chosen. Peter on the other hand has the locations southern hemisphere and Brazil. Choosing either location causes the loss of some information, either the hemisphere, or the country. Without allowing attributes to take on multiple values, the best we can do is to make a decision about which information is the most useful. In cases such as this where one of the values is a descendent of the other in the concept lattice, the best method is likely to be to choose the descendent, and include or exclude the tuple depending on the set operation being performed. Where neither of two di erent values representing the same information is the descendent of the other, they represent inconsistent information.

12

Table 9: Inductive outer joins of R and Q name R.location

Q.location language R:location = Q:location Betty Nicaragua Nicaragua Spanish John Brazil Brazil Portugese Peter southern hemisphere NULL NULL Susan Chile Chile Spanish ?(R:location = Q:location)^?(R:location = 'northern hemisphere') Betty Nicaragua Nicaragua Spanish Susan Chile NULL NULL Peter southern hemisphere NULL NULL John Brazil Brazil Portugese Table 10: Results of set operations using R and S R[S R\S R;S name location name location name location John Brazil Susan Chile John Brazil Susan Chile Peter Brazil Betty Nicaragua Peter Brazil Betty Nicaragua

6 Query Language Implications Query languages such as SQL include statements for data de nition and data update as well as querying the database. In this section we use SQL to demonstrate how lattice-structured domains can be included in query languages. First we suggest some clauses to be added to facilitate de nition of concept lattices and an extension to the WHERE clause for inductive queries, then we investigate how the syntax of multi-table queries using the JOIN keyword might be adapted. To accommodate lattice-structured domains, the CREATE DOMAIN and ALTER DOMAIN commands need to be adapted so that the lattice structure can be de ned as well as the elements of the domain. This could be achieved by adding a DEFINE HIERARCHY clause such as the following to both commands: [DEFINE HIERARCHY [[hierarchy_name] {element | concept_definition [, ...] } ], ...

13

where concept de nition is a concept name and a set of element names and/or other concept de nitions: concept_name = {element | concept_definition [, ...] }

For the example in Table 1, a suitable CREATE DOMAIN command would be CREATE DOMAIN location_domain AS CHAR(20) DEFINE HIERARCHY continent {'Americas' = {'North America' = {'USA', 'Canada'}, 'Central America' = {'Mexico', 'Nicaragua'}, 'South America' = {'Brazil', 'Chile'} }, hemispheres {'northern hemisphere' = {'USA', 'Canada', 'Mexico', 'Nicaragua', 'Brazil'}, 'southern hemisphere' = {'Brazil', 'Chile'} };

Clauses are also required in ALTER DOMAIN to add, drop or change hierarchies for a domain, such as the DROP HIERARCHY, ADD HIERARCHY and CHANGE HIERARCHY clauses de ned below: [DROP HIERARCHY hierarchy_name] [ADD HIERARCHY | CHANGE HIERARCHY [hierarchy_name {element | concept_definition [, ...] } ], ...]

For example, to drop the hemispheres hierarchy from the location domain domain: ALTER DOMAIN location_domain DROP HIERARCHY hemispheres;

The WHERE clause needs to be adapted to incorporate inductive queries using the ? and ! symbols. The keyword MAYBE can be used as a condition modi er to represent ? with no confusion, since ? and : are associative (i.e. :(?C) = ?(:C)). Since !C =!(:C) care needs to be taken with choosing a keyword for !. TRUEORFALSE, although clumsy, has the advantage that the equivalence between TRUEORFALSE C and TRUEORFALSE NOT C is clear, and NOT TRUEORFALSE C can be used for :(!C). UNCERTAIN C could be used instead of NOT TRUEORFALSE C, but this might lead to semantic confusion between UNCERTAIN and MAYBE. Using just MAYBE and TRUEORFALSE leads to WHERE clauses of the type shown below. ... ... ... ...

WHERE WHERE WHERE WHERE

NOT TRUEORFALSE location = 'northern hemisphere' ... MAYBE location = 'northern hemisphere' ... MAYBE NOT location = 'northern hemisphere' ... TRUEORFALSE location = 'northern hemisphere' ...

The keyword JOIN in SQL incorporates a hidden condition based on equality of the named join columns in both relations. The keyword MAYBE can be used to indicate an inductive join, where rows with possibly equal attributes should be included. Its omission means that only exact matches will be returned. It is unlikely that a TRUEORFALSE JOIN will be very useful: joins requiring conditions of this type are more clearly expressed using a WHERE clause. The following examples show the commands for the joins in Table 8. 14

SELECT * SELECT * SELECT * WHERE SELECT * WHERE

FROM R JOIN Q ON location; FROM R MAYBE JOIN Q ON location; FROM R MAYBE JOIN Q ON location MAYBE R.location = 'Brazil'; FROM R MAYBE JOIN Q ON location MAYBE R.location = 'northern hemisphere';

With the NATURAL JOIN, only one copy of the join column is included in the result. The least speci c value should be chosen when the join column value is di erent in each table so that knowledge of the uncertainty is not lost. The results for the following command are shown in Table 11. SELECT * FROM R MAYBE NATURAL JOIN Q;

Table 11: Inductive natural join of R and Q name Betty John Peter Peter Susan

location Nicaragua Brazil southern hemisphere southern hemisphere Chile

language Spanish Portugese Portugese Spanish Spanish

Inductive outer joins using the JOIN keyword can be expressed as shown below. SELECT * FROM R MAYBE LEFT JOIN Q ON location; SELECT * FROM R MAYBE RIGHT JOIN Q ON location; SELECT * FROM R MAYBE FULL JOIN Q ON location;

The rst of these has the same result as the equivalent inner join, since this has no excluded tuples from R, and similarly the last has the same results as the second (shown in Table 12). The SQL commands for the outer joins in Table 9 are SELECT * FROM R LEFT JOIN Q ON location; (SELECT * FROM R MAYBE JOIN Q ON location WHERE MAYBE R.location = 'northern hemisphere') UNION (SELECT name, location, NULL, NULL FROM R WHERE R.location NOT = 'northern hemisphere' AND R.location NOT IN (SELECT location FROM Q);

The second of these has been written without using LEFT JOIN because the WHERE clause would eliminate the unmatched southern hemisphere tuples from R.

15

Table 12: Result of right and full outer joins of R with Q name Betty John Peter Peter Susan NULL NULL NULL NULL

R.location Nicaragua Brazil southern hemisphere southern hemisphere Chile NULL NULL NULL NULL

Q.location Nicaragua Brazil Brazil Chile Chile Canada Canada USA Mexico

language Spanish Portugese Portugese Spanish Spanish English French English Spanish

7 Application Areas and Further Research Hierarchical domains are useful whenever there is a need to combine data from relations with di erent schemas, whether that di erence arises from schema evolution over time in the same database or from relations from di erent sources. One important area of application is likely to be queries of web databases. Concept hierarchies used for web-based queries are discussed by Davulcu et al in [5]. Incomplete and imprecise information of the type that can be represented in concept and interval lattices can arise in many ways, such as by summarising or sampling data, from space restrictions imposed by mobile equipment or main memory databases, by needing to infer values for points not explicitly stored, and performance considerations in multimedia databases. The relational operators ? and ! introduced in this paper can be used to express the six inductive conditions that arise from the inclusion of a third truth value `uncertain'. Algorithms to evaluate these queries are at worst O(n4 ) for simple conditions. Where conditions are combined with disjunction, an anomaly can arise when the combined conditions cover all possibilities for elements from a structured domain: even though the truth value for each individual condition is uncertain for a tuple t, the value for their disjunction is true. The a ect of this anomaly on inductive queries of lattice-structured domains needs to be investigated. Allowing domain hierarchies to be lattice-structured rather than simple trees does not add much complexity to hierarchical domains: their de nition and representation in the database is no more complex, and the additional (type 2) uncertainty introduced does not greatly increase the complexity of the algorithm for determining the results of inductive queries. Further work needs to be done to investigate the use of lattices for imprecise and inconsistent information, and how best to represent and reason with inconsistent information in databases. It is also important to test the behaviour of the hierarchical algorithms on realistic examples, and improve them where necessary. 16

References [1] V.S. Alagar, F. Sadri, and J.N. Said. Semantics of an extended relational model for managing uncertain information. In Fourth International Conference on Information and Knowledge Management, 1995. [2] P. Besnard and T.H. Schaub. Circumscribing inconsistency. In Fifteenth International Joint Conference on Arti cial Intelligence, 1997. [3] E.F. Codd. Extending the database relational model to capture more meaning. ACM Trans. On Database Systems, 4(4):397{434, 1979. [4] D. Corbett and R. Woodbury. Uni cation over constraints in conceptual graphs. In W. Tepfenhat and W. Cyre, editors, Seventh International Conference on Conceptual Structures. Springer, 1999. [5] H. Davulcu, J. Freire, M. Kifer, and I.V. Ramakrishnan. A layered architecture for querying dynamic web content. In ACM SIGMOD, pages 491{502, Philadelphia, USA, 1999. ACM. [6] M. Fitting. A kripke-kleene semantics for logic programs. The Journal of Logic Programming, 2(4):295{312, 1985. [7] H.J. Hamilton, R.J. Hilderman, and N. Cercone. Attribute-oriented induction using domain generalization graphs. In Proc. 8th IEEE International Conference on Tools with Arti cial Intelligence, pages 246{253. IEEE Computer Society, 1996. [8] R.J. Hilderman, H.J. Hamilton, and N. Cercone. Data mining in large databases using domain generalization graphs. Journal of Intelligent Information Systems, 13(3):195{234, 1999. [9] R.J. Hilderman, H.J. Hamilton, R.J. Kowalchuk, and N. Cercone. Parallel knowledge discovery using domain generalization graphs. In J. Komorowski and J. Zytkow, editors, Proc. First European Symposium on the Principles of Data Mining and Knowledge Discovery, PKDD 97, pages 25{35, Paris, 1997. Springer. [10] Z. Kedad and E. Metais. Dealing with semantic heterogeneity during data integration. In Jacky Akoka, Mokrane Bouzeghoub, Isabelle ComynWattiau, and Elisabeth Metais, editors, 18th International Conference on Conceptual Modeling, volume 1, pages 325{339, Paris, 1999. Springer. [11] D. Maier. Chapter 12: Null values, partial information and database semantics. In The Theory of Relational Databases, pages 371{438. Computer Science Press, 1983. [12] H. Mannila. Inductive databases and condensed representations for data mining. In Jan Maluszynski, editor, International Logic Programming Symposium. MIT Press, 1997. [13] S. Parsons. Current approaches to handling imperfect information in data and knowledge bases. IEEE Transactions on Knowledge and Data Engineering, 8(3):353{372, 1996. 17

[14] D. Perlis. Sources of, and exploiting, inconsistency: preliminary report. Applied Non-Classical Logics, 7(1):13{24, 1997. [15] D.J. Randall, H.J. Hamilton, and R.J. Hilderman. Generalization for calendar attributes using domain generalization graphs. In L. Khatib and R. Morris, editors, Proc. 5th International Workshop on Temporal Representation and Reasoning, TIME 98, pages 177{184. IEEE Computer Society, 1998. [16] J.F. Roddick. A Model for Temporal Inductive Inference and Schema Evolution in Relational Database Systems. Doctor of philosophy, La Trobe University, 1994. [17] J.F. Roddick. The use of overcomplete logics in summary data management. In Eighth Australasian Conference on Information Systems, pages 288{298, 1997. [18] J.F. Roddick, N.G. Craske, and T.J. Richards. Handling discovered structure in database systems. IEEE Transactions on Knowledge and Data Engineering, 8(2):227{240, 1996. [19] J.F. Roddick and S.P. Rice. Towards inductive queries. In Ninth Australasian Conference on Information Systems, volume 2, pages 534{542, Sydney, Australia, 1998. University of New South Wales. [20] P. Smets. Probability, possibility and belief: which and where? In Dov M. Gabbay and Philippe Smets, editors, Handbook of Defeasible Reasoning and Uncertainty Management Systems: Quanti ed Representation of Uncertainty and Imprecision, volume 1, pages 1{24. Kluwer Academic Pub-

lishers, 1998. [21] C. Zaniolo, S. Ceri, C. Faloutsos, R.T. Snodgrass, V.S. Subrahmanian, and R. Zicari. Part v: Uncertainty in databases and knowledge bases. In Advanced Database Systems, pages 315{409. Morgan Kaufmann Publishers Inc., 1997.

18

Suggest Documents