A Prototype System for Rule Generation in Lipski's ... - Springer Link

4 downloads 1232 Views 237KB Size Report
Faculty of Management and Information Science, .... For handling information incompleteness, the attribute values of Age and Salary are intervals, and the ...
A Prototype System for Rule Generation in Lipski’s Incomplete Information Databases Hiroshi Sakai1 , Michinori Nakata2 , and Dominik Ślęzak3,4 1

Mathematical Sciences Section, Department of Basic Sciences, Faculty of Engineering, Kyushu Institute of Technology Tobata, Kitakyushu 804, Japan [email protected] 2 Faculty of Management and Information Science, Josai International University Gumyo, Togane, Chiba 283, Japan [email protected] 3 Institute of Mathematics, University of Warsaw Banacha 2, 02-097 Warsaw, Poland 4 Infobright Inc., Poland Krzywickiego 34 pok. 219, 02-078 Warsaw, Poland [email protected]

Abstract. This paper advances rule generation in Lipski’s incomplete information databases, and develops a software tool for rule generation. We focus on three kinds of information incompleteness. The first is non-deterministic information, the second is missing values, and the third is intervals. For intervals, we introduce the concept of a resolution. Three kinds of information incompleteness are uniformly handled by N IS-Apriori algorithm. An overview of a prototype system in Prolog is presented. Keywords:Lipski’s incomplete information databases, Rule generation, Apriori algorithm, Rough sets, Prolog.

1

Introduction

In our previous research, we coped with rule generation in N on-deterministic Inf ormation Systems (N ISs) [9]. In contrast to Deterministic Inf ormation Systems (DISs) [8,12], N ISs were proposed by Pawlak [8] and Orłowska [7] to better handle information incompleteness in data. Recently, we focused on Lipski’s Incomplete Inf ormation Databases (IIDs) [5,6], and proposed rule generation in IIDs [11]. We treat the obtained methodology as a step toward more general rule-based data analysis, where both data values and descriptors take various forms of incompleteness, vagueness or non-determinism. In this paper, we advance the previous rule generation in IIDs, and develop a prototype system, which can handle three kinds of information incompleteness. The first kind of information incompleteness is non-deterministic information [8,7], the second is missing values [3,4], and the third is intervals. S.O. Kuznetsov et al. (Eds.): RSFDGrC 2011, LNAI 6743, pp. 175–182, 2011. c Springer-Verlag Berlin Heidelberg 2011 

176

H. Sakai, M. Nakata, and D. Ślęzak

Fig. 1. A N IS and 24 derived DISs. The number of derived DISs is finite. However, it usually increases in the exponential order with respect to the level of incompleteness of N IS  s values.

The paper is organized as follows: Section 2 recalls data representation and rule generation in DISs and N ISs. Section 3 introduces the same for IIDs, and presents implementation and execution. Section 4 concludes the paper.

2

Rule Generation in DISs and NISs

We omit formal definitions of DISs and N ISs. Instead, we show an example in Figure 1. We identify a DIS with a standard table. In a N IS, each attribute value is a set. If the value is a singleton, there is no incompleteness. Otherwise, we interpret it as a set of possible values, i.e., each set includes the actual value but we do not know which of them is the actual one. A rule (more correctly, a candidate for a rule) is an implication τ in the form of Condition_part ⇒ Decision_part. We employ support(τ ) and accuracy(τ ) to express the rule’s appropriateness as follows [1,8] (see also Figure 2): Specification of the rule generation task in a DIS For threshold values α and β (0 < α, β ≤ 1), find each implication τ satisfying support(τ ) ≥ α and accuracy(τ ) ≥ β. In N ISs, the same τ may be generated by different tuples, so we use notation τ x to express that τ is generated by an object x. Let DD(τ x ) denote {ψ | ψ is a derived DISs and τ x occurs in ψ }, and we define the next task. Specification of the rule generation task in a N IS (The lower system) Find each implication τ such that support(τ x ) ≥ α and accuracy(τ x ) ≥ β (for an object x) hold in each ψ ∈ DD(τ x ). (The upper system) Find each implication τ such that support(τ x ) ≥ α and accuracy(τ x ) ≥ β (for an object x) hold in some ψ ∈ DD(τ x ). Both above systems depend on |DD(τ x )|. In [10], we proved some simplifying results illustrated by Figure 3. We also showed how to effectively compute support(τ x ) and accuracy(τ x ) for ψmin and ψmax independently from |DD(τ x )|.

A Prototype System for Rule Generation in Lipski’s IIDs

177

Fig. 2. A pair (support,accuracy) corresponding to the implication τ

Due to Figure 3, we have the following equivalent specification. Equivalent specification of the rule generation task in a N IS (The lower system). Find each implication τ such that minsupp(τ x ) ≥ α and minacc(τ x ) ≥ β for an object x (see Figure 3). (The upper system). Find each implication τ such that maxsupp(τ ) ≥ α and maxacc(τ ) ≥ β for an object x (see Figure 3). In [10], we extended rule generation onto N ISs and implemented a software tool called N IS-Apriori. N IS-Apriori does not depend upon the number of derived DISs. This paper is extending this software tool to Lipski’s Incomplete Inf ormation Databases.

3

Rule Generation in Incomplete Information Databases

Now, we advance from N ISs to IIDs. We introduce an example of an IID, and consider it. The formal definitions of an IID are in [11]. 3.1

An Example of an Incomplete Information Database

In Table 1, we have DomainAge ={20, 21, ..., 70}, DomainSex ={male, f emale}, DomainDepartment ={dp1, dp2, dp3} and DomainSalary={400, 401, 402, ..., 2000}. For handling information incompleteness, the attribute values of Age and Salary are intervals, and the attribute values of Sex and Department are either a value, a subset of the domain or a missing value ∗. Missing values ∗ and intervals are often employed for handling information incompleteness. 3.2

Non-deterministic Information and Missing Values

In Table 1, we have two missing values, i.e., two ∗ symbols. In rough sets, the domain DOM is usually a finite set, therefore we identify ∗ with non-deterministic

178

H. Sakai, M. Nakata, and D. Ślęzak

Fig. 3. A distribution of pairs (support,accuracy) for τ x . There exists ψmin ∈ DD(τ x ) which makes both support(τ x) and accuracy(τ x ) the minimum. There exists ψmax ∈ DD(τ x ) which makes both support(τ x) and accuracy(τ x ) the maximum. We denote such quantities as minsupp, minacc, maxsupp and maxacc, respectively.

information DOM . Namely, we replace two ∗ symbols with {dp1, dp2, dp3} and {male, f emale}, and we obtain a N IS for a set of attributes {Sex, Department}. Thus, we consider 96 (=25 × 3) derived DISs like in Figure 1, and we see that an actual DIS exists within 96 derived DISs. 3.3

Information Incompleteness about Intervals and Derived DISs

Now, we consider information incompleteness for intervals. We usually interpret an interval [lower, upper] as that the actual value is between lower and upper. Information incompleteness for intervals is a relative concept. For example, let us consider number π. The interval [3.14, 3.15] will be enough for students, but it will be too simple for researcher. This example is also related to granularity and granular computing in general [13]. Consider the following definition. Definition 1. [11] For an attribute A whose values are intervals, let us fix a threshold value γA > 0. We say that an interval [lower, upper] is “def inite”, if its length (upper − lower) is not higher than γA . Otherwise, we say that it is “indef inite”. We call γA a resolution of V ALA . Example 1. In Table 1, consider γAge =1. Then information about x4 , x6 and x8 is definite, and information about other objects is indefinite. For x1 , there are three possible intervals: [22, 23], [23, 24], [24, 25]. For γAge =10, information about all objects is definite, and there is no information incompleteness. According to Definition 1 and Example 1, we can re-define derived DISs (depending upon the resolution) for intervals. We can also consider a figure Figure 1 for Table 1. However, the number of derived DISs may not be finite. For example, for an interval [0, 1]={x : real_number|0 ≤ x ≤ 1} and γ=0.1, the number of definite intervals is infinite.

A Prototype System for Rule Generation in Lipski’s IIDs

179

Table 1. A example of Incomplete Information Database (IID) OB x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

3.4

Age [22, 25] [20, 25] [25, 30] [36, 36] [37, 40] [43, 43] [45, 50] [52, 52] [53, 57] [60, 70]

Sex f emale f emale male ∗ male f emale male male male male

Department ∗ {dp2, dp3} {dp1, dp2} dp2 {dp1, dp2} {dp2, dp3} dp3 dp3 dp3 dp3

Salary [400, 500] [500, 600] [400, 700] [700, 750] [500, 800] [600, 800] [700, 900] [800, 900] [1000, 1500] [1100, 2000]

Descriptors and Rule Generation in IIDs

In a DIS, we consider each implication τ from a table. If τ satisfies support(τ ) ≥ α and accuracy(τ ) ≥ β, we pick up this τ as a candidate of rule. In a N IS, we followed this strategy, and defined DD(τ x ) in Section 2. For handling categorical data in rough sets, we usually suppose that each domain of attribute values is finite. So, we implicitly handled finite number of descriptors, and we did not specify any descriptor for rule generation. However, we may need to specify descriptors in an IID, because there may be infinite number of them. Also, each rule is expressed by descriptors, so the selection of descriptors is very important. We see this is the next important issue for rule generation in IIDs. In the current prototype system in Prolog, we explicitly specify each descriptor in a data set. Our rule generation basically depends upon the consistency in rough sets, and we also need to consider the Dominance based Rough Sets Approach (DRSA) [2]. By using the property of the ordered set, we will be able to generate a software with more general functionality. This is the next important issue, too. The following is the tentative rule generation task in IIDs. Specification of the tentative rule generation task in an IID (Assumption). Descriptors are given, and each implication τ is defined by given descriptors. Each DD(τ x ) is a set of derived DISs with definite intervals. (The lower system). The same definition in N ISs. (The upper system). The same definition in N ISs. 3.5

Data Expression and Equivalence Classes

The following is the real data for Table 1. The prototype system in Prolog can handle any data set in the following syntax.

180

H. Sakai, M. Nakata, and D. Ślęzak

object(10,4). /* #object=10, #attribute=4 */ support(0.2). /* constraint: support is more than 0.2 */ accuracy(0.5). /* constraint: accuracy is more than 0.5 */ decision(4). /* decision attribute */ attrib(1,age,5,[[25,30],[30,40],[40,50],[50,60],[60,100]]). resolution(1,interval,5). /* resolution of age */ attrib(2,sex,2,[male,female]). resolution(2,set,1). attrib(3,department,3,[dp1,dp2,dp3]). resolution(3,set,1). attrib(4,salary,4,[[300,600],[600,800],[800,1000],[1000,2000]]). resolution(4,interval,100). data(1,[[22,25],female,nil,[400,500]]). data(2,[[20,25],female,[dp2,dp3],[500,600]]). data(3,[[25,30],male,[dp1,dp2],[400,700]]). data(4,[[36,36],nil,dp2,[700,750]]). data(5,[[37,40],male,[dp1,dp2],[500,800]]). data(6,[[43,43],female,[dp2,dp3],[600,800]]). data(7,[[45,50],male,dp3,[700,900]]). data(8,[[52,52],male,dp3,[800,900]]). data(9,[[53,57],male,dp3,[1000,1500]]). data(10,[[60,70],male,dp3,[1100,2000]]).

In this data set, five descriptors for an attribute Age and four descriptors for an attribute Salary are specified. For attributes Sex and Department, [sex, male], [sex, f emale], [department, dp1], [department, dp2] and [department, dp3] are specified. According to the values of support and resolution, this data set is at first translated to the internal data. The following is a part of it: upper(3,1,[department,dp1],[],[1,3,5]). upper(3,2,[department,dp2],[4],[1,2,3,4,5,6]). lower(3,3,[department,dp3],[7,8,9,10],[1,2,6,7,8,9,10]). lower(4,1,[salary,[300.0,600.0]],[1,2],[1,2,3,5]). lower(4,2,[salary,[600.0,800.0]],[4,6],[3,4,5,6,7]). upper(4,3,[salary,[800.0,1000.0]],[8],[7,8]). lower(4,4,[salary,[1000.0,2000.0]],[9,10],[9,10]).

The fourth and fifth arguments mean the minimum equivalence class and the maximum equivalence class for a descriptor. For Sex and Department, if attribute value of an object x is definite, x is added to fourth and fifth arguments of the descriptor. If attribute value is indefinite, x is added to the fifth argument of the related descriptors. For Age and Salary, we suppose the intervals IN Tx of an object x and IN Tdesc of a descriptor. If IN Tx ⊆ IN Tdesc , x is added to fourth and fifth arguments of the descriptor. Otherwise, if [lower, upper]=IN Tx ∩ IN Tdesc = ∅ and upper − lower ≥ γ, x is added to the fifth argument of the related descriptors. By using the fourth argument inf and the fifth argument sup, we can easily obtain minsupp(τ x ), minacc(τ x ),

A Prototype System for Rule Generation in Lipski’s IIDs

181

maxsupp(τ x ) and maxacc(τ x ), and we may apply N IS-Apriori algorithm by using these four criterion values [10] to rule generation in Table 1. 3.6

Execution for Table 1

Now, we show the example of real execution for Table 1. By step1 command, we obtain rules in the form of [AttributeA , valA ] ⇒ [Salary, valSalary ]. In the lower system, we obtained a rule (minsupp(τ )=0.2, minacc(τ )=0.5), which we call certain rule. This rule τ satisfies the constraints of support and accuracy in each derived DIS, where τ occurs. Object 1 and 2 support this τ . In the upper system, we obtained 11 rules, which we call possible rules. ––- 1st STEP ––––––––––––––––––––––––––– File=tlip.pl, Support=0.2, Accuracy=0.5 ===== Lower System ================================================== [13] [sex,female]=>[salary,[300,600]] (0.2,0.5) [1,2] The Rest Candidates:[[[2,1],[4,4]],[[3,3],[4,4]]] (Next Candidates are Remained) ===== Upper System ================================================== [2] [age,[30,40]]=>[salary,[600,800]] (0.2,1.0) [4,5] IGC [5] [6] [age,[40,50]]=>[salary,[600,800]] (0.2,1.0) [6,7] IGC [7] [14] [sex,male]=>[salary,[600,800]] (0.4,0.5714285714) [3,4,5,7] [17] [sex,female]=>[salary,[300,600]] (0.2,0.6666666667) [1,2] [18] [sex,female]=>[salary,[600,800]] (0.2,0.5) [4,6] IGC [4] : : : [32] [department,dp3]=>[salary,[1000,2000]] (0.2,0.5) [9,10] The Rest Candidates:[[[2,1],[4,1]],[[2,1],[4,3]],[[2,1],[4,4]],::: (Next Candidates are Remained) EXEC_TIME=0.0(sec)

In order to obtain rules in the form of [AttributeA , valA ] ∧ [AttributeB , valB ] ⇒ [Salary, valSalary ], we execute step2, and we have the following: ––- 2nd STEP ––––––––––––––––––––––––––– ===== Lower System ================================================== [1] [sex,male]&[department,dp3]=>[salary,[1000,2000]] (0.2,0.5) [9,10] The Rest Candidates:[] (Lower System Terminated) ===== Upper System ================================================== [3] [sex,male]&[dep::,dp3]=>[salary,[800,1000]] (0.2,0.5) [7,8] IGC [7] [4] [sex,male]&[department,dp3]=>[salary,[1000,2000]] (0.2,0.5) [9,10] The Rest Candidates:[] (Upper System Terminated) EXEC_TIME=0.0(sec)

182

4

H. Sakai, M. Nakata, and D. Ślęzak

Concluding Remarks

In this paper, we proposed how to formulate and solve the rule generation problem for Incomplete Information Databases. Our prototype was examined for an exemplary practical data set (mammographic.csv, the object size is 150, the attribute size is 6, the number of derived DISs is about 1046 ). Acknowledgment. The first author is supported by the Grant-in-Aid for Scientific Research (C) (No.18500214, No.22500204), Japan Society for the Promotion of Science. The third author is supported by the grant N N516 077837 from the Ministry of Science and Higher Education of the Republic of Poland.

References 1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proc. of VLDB, pp. 487–499 (1994) 2. Dembczyński, K., Greco, S., Słowiński, R.: Rough Set Approach to Multiple Criteria Classification with Imprecise Evaluations and Assignments. European J. Operational Research 198, 626–636 (2009) 3. Grzymała-Busse, J.: Data with Missing Attribute Values: Generalization of Indiscernibility Relation and Rule Induction. Transactions on Rough Sets 1, 78–95 (2004) 4. Kryszkiewicz, M.: Rules in Incomplete Information Systems. Information Sciences 113, 271–292 (1999) 5. Lipski, W.: On Semantic Issues Connected with Incomplete Information Data Base. ACM Trans. DBS. 4, 269–296 (1979) 6. Lipski, W.: On Databases with Incomplete Information. Journal of the ACM 28, 41–70 (1981) 7. Orłowska, E., Pawlak, Z.: Representation of Nondeterministic Information. Theoretical Computer Science 29, 27–39 (1984) 8. Pawlak, Z.: Rough Sets. Kluwer Academic Publishers, Dordrecht (1991) 9. Sakai, H., Okuma, A.: Basic Algorithms and Tools for Rough Non-deterministic Information Analysis. Transactions on Rough Sets 1, 209–231 (2004) 10. Sakai, H., Ishibashi, R., Nakata, M.: On Rules and Apriori Algorithm in Nondeterministic Information Systems. Transactions on Rough Sets 9, 328–350 (2008) 11. Sakai, H., Nakata, M., Śl¸ezak, D.: Rule Generation in Lipski’s Incomplete Information Databases. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 376–385. Springer, Heidelberg (2010) 12. Skowron, A., Rauszer, C.: The Discernibility Matrices and Functions in Information Systems. In: Intelligent Decision Support - Handbook of Advances and Applications of the Rough Set Theory, pp. 331–362. Kluwer Academic Publishers, Dordrecht (1992) 13. Zadeh, L.A.: Toward a Theory of Fuzzy Information Granulation and its Centrality in Human Reasoning and Fuzzy Logic. Fuzzy Sets and Systems 90, 111–127 (1997)