On an Indexing Mechanism for Imprecise Numerical Data for Fuzzy

0 downloads 0 Views 239KB Size Report
and Artificial Intelligence. University of Granada ... Atomic fuzzy types (AFT): this group in- ... framework for defining user classes whose attribute types could be ...
On an Indexing Mechanism for Imprecise Numerical Data for Fuzzy Objet-Relational Database Management Systems Carlos D. Barranco

School of Engineering Pablo de Olavide University Utrera Rd. Km. 1 41013 Sevilla (Spain) [email protected]

Jesús R. Campaña

Dept. of Computer Science and Articial Intelligence University of Granada Daniel Saucedo Aranda s/n 18071 Granada (Spain) [email protected]

Abstract The paper proposes an indexing mechanism for imprecise numerical data, fuzzy data dened on an ordered numerical domain. The proposal is based on a classical indexing mechanism for numerical crisp data, b+trees, included in most of recent DBMS. This fact, makes the proposed mechanism more suitable than other to be integrated in a Fuzzy Object-Relational DBMS, which will improve the performance processing queries on imprecise data.

Keywords: Fuzzy data indexing, Object-Relational Fuzzy Databases.

1 Introduction The database world has taken advantage of fuzzy sets theory using it as a way to represent, store and manage imprecise data. As a result of this trend, there are a signicant number of database models and implementations of Database Management Systems (DBMS) [3, 5, 6, 7, 8, 10] in which imprecise data can be represented and managed. Nowadays, the emergence of Fuzzy DBMS (FDMBS) means the availability of a new tool for developing novel applications which process very large imprecise data sets using fuzzy methods. As these novel applications prove their potential as prototypes, they are integrated in real-world environments. In this

Juan M. Medina

Dept. of Computer Science and Articial Intelligence University of Granada Daniel Saucedo Aranda s/n 18071 Granada (Spain) [email protected]

kind of environments high performance, scalability and availability are required for applications, therefore for each of its component, and particularly for its underneath FDBMS. Most of the existing FDBMS are research prototypes, and they do not oer the high performance, scalability and availability required for real-world applications yet. A solution for this problem can be brought by a recent proposal of a Fuzzy Object-Relational DBMS (FORDBMS) model [4]. This model proposes to take advantage of the extension mechanisms of recent commercial ORDBMS in order to build user dened datatypes to seamlessly represent, store and manage imprecise data by modeling it as native database objects. As a result of this architecture, the resulting FORDBMS combines the high performance, scalability and availability of a commercial DBMS with the power of imprecise data handling of FDBMS. Indexing mechanisms are the key for high performance in databases. Although a great research eort has been spent on database models, there is not so much on indexing mechanisms for ecient access to imprecise data. Most of the existing proposals of indexing mechanisms for imprecise data [2, 9] require specic data structures and algorithms which are not suitable of being incorporated to an ORDBMS using its extension mechanisms. This paper proposes an indexing mechanism for numerical imprecise data based on [1]. The data structure and search algorithm of the proposed indexing mechanism is based on classical indexing structures for numeri-

cal crisp data, b+trees, which are available in most of commercial DBMS. As the basis of the proposed indexing mechanism is eciently implemented on recent DBMS, the inclusion of it in FORDBMS will be possible and less complicated than other proposals. The inclusion of the proposed indexing mechanism on a FORDBMS will improve its performance on processing queries on imprecise numerical data, which directly means an improvement of the performance of applications using the imprecise data management services oered by the FORDBMS. This paper is organized as follows. Section 2 introduces briey the FORDBMS for which the proposed indexing mechanism is initially designed. The ordered atomic fuzzy datatype is presented in Section 3. In Section 4 the proposed indexing mechanism is depicted. An example of query processing by using the proposed indexing mechanism is shown in Section 5. Finally, in Section 6 some concluding remarks and future works are pointed out.

2 Fuzzy objet-relational DBMS The FORDBMS, on which the proposed indexing mechanism will be implemented, is designed to handle and represent a wide variety of fuzzy datatypes, which let model any sort of fuzzy data easily. A brief description of the supported fuzzy data type is the following:

• Atomic fuzzy types (AFT): this group includes datatypes for representing possibility distributions over ordered (OAFT datatype) and non ordered (NOAFT datatype) domains. • Fuzzy collections (FC): this group is composed by datatypes for representing fuzzy collections as fuzzy sets of objects with conjunctive (CFC datatype) or disjunctive (DFC datatype) semantics. • Fuzzy objects (FO): This datatype is the framework for dening user classes whose attribute types could be crisp or fuzzy, and where each attribute is associated with a degree to measure its importance in object comparison.

Figure 1: Trapezoidal possibility distribution There is a wide variety of associated operators, for all previous datatypes, to ease the creation, storage, manipulation, and exible condition denition on them. This FORDBMS is being implemented on a commercial crisp ORDBMS by taking advantage of its extension features, such as user dened datatypes, operators and indexing mechanisms. This usage of extended datatypes, operators and indexing mechanisms, for including fuzzy data management capabilities in classical databases, matches with SQL:1999 standard. This standard compliance lets express a fuzzy query, by means of the implemented user extensions, in a fully SQL compliant sentence, which lets avoid incompatible proprietary SQL extensions used in other fuzzy DBMS implementations.

3

Ordered atomic fuzzy data

This paper is focused on describing an indexing mechanism and structure for atomic data of OAFT type. This type provides the data infrastructure and behavior, by means of its associated procedures, for representing, storing, querying and managing atomic fuzzy data represented as a possibility distribution dened on an ordered domain. This kind of data is modeled in the described FORDBMS by means of a trapezoidal possibility distribution, similar to the one shown in Figure 1, and noted as [a, b, c, d]. Beyond of general trapezoidal possibility distributions, and by means of them, the OAFT datatype is able to represent and handle a variety of particular imprecise values, such as:

• Intervals: When the trapezoidal possibility distribution is dened as [a, a, b, b], where a and b are the lower and upper bounds of the interval. • Approximate values: When the trapezoidal possibility distribution is dened as [n − m, n, n, n + m], where n is the value which is approximated, and m is the margin dening the looseness of the approximation. • Precise number: A particular case when the possibility distribution is dened as [n, n, n, n], where n is the precise number.

4 Index for OAFT It is very common that fuzzy queries contain some conditions based on the fuzzy equality comparator (FEQ) dened in equation 1.

F EQ(A, B) = sup min(A(d), B(d)) d∈U

(1)

where, A and B are two OAFT values (i.e. fuzzy numbers dened as trapezoidal possibility distributions), and U the underlying domain on which A and B are dened. The FEQ comparator returns a value in [0, 1] representing the possibility degree in which A is equal to B. This degree, in the rest of the paper, is called the fullment degree of the condition. This kind of conditions are expressed by setting a minimum (i.e. a threshold) for the fullment degree of each FEQ condition. In general, this kind of conditions are expressed as shown in equation 2.

F EQ(A, B) > T

(2)

an OAFT attribute, the other must be a constant value representing a fuzzy number dened as a trapezoidal fuzzy distribution. An example of a query using this kind of condition is expressed as follows:

SELECT id,att FROM table WHERE FEQ(att,[a,b,c,d])>T where, id is the primary key of the table table, att is an OAFT attribute, and [a,b,c,d] is a constant which represents a fuzzy number dened like the trapezoidal possibility distribution shown in Figure 1.

4.1 Preselection mechanism In the least restrictive case, when the threshold T is set to zero, the fullment degree must be greater than zero in order to admit that the FEQ condition is satised in some degree. In fact, independently from the threshold, when the fullment degree is equal to zero it can be assured that the FEQ condition is not satised. An indexing mechanism must try to preselect those results which can satisfy the FEQ condition, in order to lter out those rows which surely not satisfy the condition. To accomplish that goal, the mechanism is based on the rule shown in equation 5, which is founded on the remarks shown in equations 3 and 4.

∀d ∈ / supp(A) ⇒ A(d) = 0 ⇒ min(A(d), B(d)) = 0

(3)

∀d ∈ / supp(B) ⇒ B(d) = 0 ⇒ min(A(d), B(d)) = 0

(4)

F EQ(A, B) > 0 ⇔ ∃d, d ∈ supp(A), d ∈ supp(B)

(5)

The equation 5 can be reformulated as shown in equation 6.

where, A and B are two fuzzy numbers, T is the threshold dened by the user.

F EQ(A, B) > 0 ⇔ supp(A) ∩ supp(B) 6= ∅

When this kind of conditions are applied to select data from a fuzzy database, A and/or B are OAFT attributes. When only A, or B , is

Taking into account that A and B are two fuzzy numbers dened respectively by the trapezoidal possibility distributions

(6)

culation of a preselection row set using an index structure, which avoids a sequential disk reading process, and therefore the disk reads of rows not satisfying the condition in any degree. The computing time is saved because the fullment degree of the condition is only calculated for those rows in the preselection set.

4.3 Increasing preselection accuracy Figure 2: Query processing by using the preselection mechanism

[αA , βA , γA , δA ] and [αB , βB , γB , δB ], it is clear that the supports of A and B correspond respectively to the open intervals (αA , δA ) and (αB , δB ) . In this case, the condition shown in equation 6 can be reformulated in the way shown in equation 7. F EQ(A, B) > 0 ⇔ (αA , δA ) ∩ (αB , δB ) 6= ∅ ⇔ δA > αB ∧ αA < δB

(7)

4.2 Processing queries using the preselection mechanism

The accuracy of the preselection mechanism described earlier is very high when the threshold of the condition is near to zero. In this case if a non unusual distribution of the data is assumed, the number of rows in the preselection row set that are discarded, by checking if its fullment degree is greater than the threshold, is relatively low. The problem arises when the condition threshold is far from zero. In this case, a signicant amount of rows in the preselection row set must be discarded because their fullment degree is bellow the threshold. In order to solve this problem, a new denition for the rule shown in equation 7 can be formulated. This new formulation, dened in equation 8, is based on the usage of alphacuts. This new solution approximate much more the preselection row set from the query result set.

The last rule shown in equation 7 can be used to obtain a preselection row set of query result candidates, each satisfying the condition in some degree, by using some kind of index structure. Once obtained the preselection row set, the FEQ condition is applied on each row of this set to discard those rows whose fullment degree of the condition is less or equal to the threshold in the query. This procedure to calculate the query result set from the preselection set is shown in Figure 2, in which the rows not satisfying in any degree a FEQ condition are represented using white color, the rows satisfying the FEQ condition with a fullment degree greater than zero and not over the threshold are represented using grey color, and the rows with a fullment degree over the threshold are represented in black color.

4.4 Determining the preselection row set

The described procedure can save a great amount of disk reads and some computing time. Disk reads are reduced due to the cal-

As it was remarked earlier, the preselection mechanism for obtaining query results is useful only if this mechanism can reduce disk

F EQ(A, B) ≥ T ⇔ AT ∩ BT 6= ∅ ⇔ δ(AT ) ≥ α(BT ) ∧ α(AT ) ≤ δ(BT )

(8)

where AT and BT are the T -cuts of A and B , and α(X) and δ(X), in which X is a crisp set, are dened as shown in equations 9 and 10 respectively.

α(X) = inf(X)

(9)

δ(X) = sup(X)

(10)

reads and computing time in contrast to sequential read of table rows. The key to accomplish this goal is the process by which the preselection row set is calculated. Let us consider the scenario in which the rows in the result set must satisfy the condition F EQ(A, B) > T and A is an OAFT attribute, dened as a trapezoidal possibility distribution [αA , βA , γA , δA ], and B is a constant possibility distribution [αB , βB , γB , δB ]. In the basic case, T = 0, the condition shown in equation 7 is used to obtain the preselection row set. This condition can be divided in two basic conditions, δA > αB and αA < δB . Commercial DBMS indexing techniques, i.e. b+tree indexes, are very ecient to calculate which rows have an attribute value between a range of values. Using this kind of indexing techniques, we propose to use a b+tree index dened on αA , which lets a fast calculation of the row set RαA in which every row satisfy the condition αA < δB . Respectively, by using a b+tree index dened on δA , a row set RδA of rows satisfying the condition δA > αB can be quickly calculated. Once RαA and RδA are obtained, the preselection row set PF EQ(A,B) is calculated as show in equation 11.

PF EQ(A,B) = RαA ∩ RδA

(11)

As it is shown in equation 8, it is possible to adjust the preselection mechanism in order to obtain a preselection row set that is, in fact, equal to the result set. This way to calculate the preselection row set is similar to the one shown earlier, but using b+tree indexing structures dened on α(AT ) and δ(AT ) in order to calculate RαA and RδA respectively. Despite that this method allows to obtain a more accurate preselection set, and therefore a faster calculation of the query result set, it requires to maintain an index structure for every possible threshold value used in queries. If the number of possible values of threshold is large, this solution can be very space demanding, due to the high number of indexing structures, and demands a great amount of computing time when the data is inserted,

updated or removed, due to the index maintenance processes. In order to solve this drawback, we propose that the user must select the dierent threshold values for which an index structure will be dened. This solution reduces the space and computing time required for index maintenance. When this solution is applied, the preselection row set is calculated by using the index structures with the nearest associated threshold value, lesser than the query threshold value. If the threshold values, on which the indexes are built, are selected correctly, and for most of the queries the thresholds are near to the selected threshold values for the indexes, the preselection row set will be very similar to the nal query result set meaning a signicant reduction of query processing time. A good threshold selection can be achieved by determining the most frequent threshold values employed in queries by users.

4.5 Index data structures The proposed indexing mechanism requires for each threshold value considered in the index a pair of b+trees, one for α(AT ) (or αA if the threshold is zero) and the other for δ(AT ) (or δA when the threshold is zero), where A is the indexed attribute, T the dierent selected threshold values. Taking the previous remark into account, the index data structure seems to be a collection of b+trees. This paper proposes to organize this collection as a matrix whose cells are b+trees. This matrix has two rows, one for α(AT ) and one for δ(AT ), and nT columns, where nT corresponds to the number of user selected threshold values of the index. In this paper, each b+tree in this matrix will be labeled with the letter α or δ depending on the row in which is stored, and a threshold number associated with the column in which is stored. For instance, the index built on α(A0.6 ) is called  α − 0.6 index for the attribute A, and the index built on δA is called  δ − 0 index for the attribute A. It must be

Table 1: Sample data table rowID 1 2 3 4 5

att [8,10,10,20] [42,46,47,51] [1,9,15,19] [40,50,55,57] [38,46,47,47]

rowID 6 7 8 9 10

att [10,12,12,14] [26,28,30,32] [51,57,60,60] [18,22,27,37] [25,35,39,45]

Figure 3: α − 0 b+tree index

Table 2: αatt and δatt values for attribute att rowID 1 2 3 4 5

αatt 8 42 1 40 38

δatt 20 51 19 57 47

rowID 6 7 8 9 10

αatt 10 26 51 18 25

δatt 14 32 60 37 45

taken into account that in this special case, when the threshold is zero, building and index on δ(A0 ) does not make any sense.

5 Example of query processing This example shows how the result set of two fuzzy queries is calculated using the indexing technique described before.

5.1 Example data and indexes For this example purpose, an example data table, whose data is shown in Table 1, is dened. This example table is composed by two columns: rowID and att. The rowID column is the primary key of the table and stored integer values. The att attribute is a column of OAFT type, hence stores fuzzy numbers dened by trapezoidal possibility distributions. In order to show how the proposed indexing mechanism works, an index for OAFT data is dened on the attribute att of the example table. This index is dened for the threshold values 0 and 0.50. Table 2 shows αatt and δatt values used to build the index structures corresponding to the threshold value 0. The two (α and δ ) b+trees of order 5 (this order is arbitraryly choosen only for example purposes) built on these values are shown in Figures 3 and 4.

Figure 4: δ − 0 b+tree index Table 3 shows the α(att0.5 ) and δ(att0.5 ) values on which the α−0.5 and δ −0.5 b+trees of order 5 are dened. These b+trees are shown in Figures 5 and 6 respectively.

5.2 Query processing The rst example query, for which the steps of query processing mechanism will be depicted, is the following:

SELECT rowID,att FROM table WHERE FEQ(att,[20,25,35,40])>0.3 This query includes the condition FEQ(att,[20,25,35,40])>0.3 which is dened on attribute att and whose threshold value is 0.3. When the query processor detects that there is an index dened on attribute att it starts the query processing mechanism of the proposed indexing method, following these steps: Table 3: α(att0.5 ) and δ(att0.5 ) values rowID 1 2 3 4 5 6 7 8 9 10

α(att0.5 ) 9 44 5 45 42 11 27 54 20 30

δ(att0.5 ) 15 49 17 56 47 13 31 60 32 42

Table 4: Fullment degree for rows in the preselection set

Figure 5: α − 0.5 b+tree index

Figure 6: δ − 0.5 b+tree index 1. B+tree selection: The threshold value is checked in order to selected the most appropriate α and δ index pair. The index dened on attribute att is dened for 0 and 0.5. The nearest threshold value to 0.3, and less than 0.3, is 0, and therefore the α − 0 and δ − 0 indexes are selected. 2. Search ranges calculation: The αB and δB values are extracted from the second argument of the fuzzy condition operator. The second condition operator is the trapezoidal possibility distribution [20,25,35,40], therefore αB = 20 and δB = 40. 3. Rαatt calculation: Rαatt is quickly calculated by using the α − 0 b+tree, obtaining all rows which satisfy the condition αatt < 40. In this example, the Rαatt set is composed by the rows {3, 1, 6, 9, 10, 7, 5}. 4.

Rδatt

Rδatt

calculation: The row set is calculated by means of the δ − 0 b+tree. This set contains all rows satisfying the δatt > 20 condition. The calculated Rδatt is the row set {7, 9, 10, 5, 2, 4, 8}.

5. Preselection row set calculation: This set is calculated as the intersection of Rαatt and Rδatt . In this example the preselection row set is {9, 10, 7, 5} 6. Query result calculation: The preselection set is ltered in order to select only those rows whose fullment degree satisfy de query threshold. The fullment degree for each row in the preselection set

rowID 5 7 9 10

F EQ(att, [20, 25, 35, 40]) 0.1538 1 1 1

is shown in Table 4. The query result is composed by the rows {7, 9, 10}, the row 5 is excluded because its fullment degree in under the threshold. The second query in the example is the following:

SELECT rowID,att FROM table WHERE FEQ(att,[30,34,35,49])>0.75 The condition included in this query is FEQ(att,[30,34,35,49])>0.75, which is dened on the attribute att and whose threshold value is 0.75. In this case, and because the condition threshold is 0.75 (0.5 is the nearest, and lower than, threshold value from those available in the index structure), the b+trees α−0.5 and δ−0.5 are selected from the matrix in the index data structure. By means of α − 0.5 b+tree, Rαatt is calculated using the condition α(att0.5 ) < δ([30, 34, 35, 49]0.5 ), equivalent to α(att0.5 ) < 37. In this case, Rαatt = {3, 1, 6, 9, 7, 10}.

Rδatt is calculated by means of α − 0.5 b+tree. The rows in Rδatt must satisfy the condition δ(att0.5 ) > α([30, 34, 35, 49]0.5 ) or the equivalent δ(att0.5 ) > 32. After searching in α − 0.5 b+tree, Rδatt is {10, 5, 2, 4, 8}. The preselection row set is obtained by intersecting Rαatt and Rδatt . After the intersection, the preselection row set is {10}. Finally, the preselection is ltered to exclude all those candidates with a fullment degree lower than the threshold. The fullment degree of the row 10 is 1, so this row is the only member of the query result set. As a nal remark, note that when the nal result set has relatively few rows compared

to the data table, the usage of the proposed indexing mechanism reduces drastically the number of disk reads, and CPU usage, as it happens in this example case.

6 Conclusions and future works In this paper an indexing mechanism for ordered atomic fuzzy data is proposed starting from the ideas proposed in [1]. This indexing mechanism is based on a classical indexing mechanism for numerical crisp data, and therefore does not require special data structures or search algorithms not included in recent DBMS. Because of this fact, the proposal is more suitable than others for applying it by a FORDBMS built on a recent ORDBMS, by taking advantage its extension mechanisms. The proposed indexing mechanism reduces the number of disk reads, and CPU time, for queries on imprecise numerical data when they return a relative low number of table rows, in contrast to a sequential read, which is the common way of actual FDBMS prototypes for processing queries including exible conditions on fuzzy numerical data. Future works should focus on theoretically determining and empirically quantifying of the signicance of disk reads and CPU cost reduction achieved by the proposed indexing mechanism. Additionally, indexing mechanisms for other kind of imprecise data, such as non ordered atomic fuzzy data, fuzzy collections and fuzzy objects should be studied.

Acknowledgements This work has been partially supported by the Spanish Ministerio de Ciencia y Tecnología and the European Social Fund under grants TIC2002-00480 and TIC2003-08687-C02-2.

References [1] P. Bosc, M. Galibourg: Indexing principles for a fuzzy data base. Inf. Syst. 14(6): 493-499 (1989) [2] P. Bosc, O. Pivert, Fuzzy Querying in Conventional Databases in Fuzzy Logic

for Management of Uncertainty, Eds. L.A. Zadeh & J. Kacprzyk. John Wiley and Sons Inc, 645-671. 1992. [3] R. D. Caluwe, Fuzzy and Uncertain Object-Oriented Databases: Concepts and Models, Advances in Fuzzy SystemsApplications and Theory. Vol 13. World Scientic, 1997. [4] J.C. Cubero, N. Marín, J.M. Medina, O. Pons, M.A. Vila, Fuzzy object Management in an Object-Relational Framework, X Intl. Conf. of information processing and management of uncertainty in knowledge-based systems, pp.1767-1774. 2004. Perugia (Italy). [5] S. Fukami, M. Umano, M. Muzimoto, H. Tanaka, Fuzzy Database Retrieval and Manipulation Language, IEICE Technical Reports, Vol. 78, N. 233, pp. 65 72, AL-78-85 (Automata and Language). 1979. [6] J. Galindo, J.M. Medina, O.Pons, J.C. Cubero, A Server for Fuzzy SQL Queries, Flexible Query Answering Systems, eds. T. Andreasen, H. Christiansen and H.L. Larsen, Lecture Notes in Articial Intelligence (LNAI) 1495, pp. 164 174. Ed. Springer, 1998. [7] H. Prade, C. Testemale, Generalizing Database Relational Algebra for the Treatment of Incomplete or Uncertain Information and Vague Queries, Information Sciences Vol. 34, 1984, pp. 115-143. [8] M. Umano, Freedom-O: A Fuzzy Database System, Fuzzy Information and Decision Processes. Gupta-Sanchez edit. North-Holand Pub. Comp. 1982. [9] Yazici A., George R., Fuzzy Database Modeling. Physica-Verlag, 1999. Chapter 2.3. [10] M. Zemankova-Leech, A. Kandel, Fuzzy Relational Databases  A Key to Expert Systems, Köln, Germany, TÜV Rheinland, 1984.

Suggest Documents