On Granular Rough Computing: Factoring Classifiers through Granulated Decision Systems Lech Polkowski1,2 , Piotr Artiemjew2 Department of Mathematics and Computer Science University of Warmia and Mazury2 , Olsztyn,Poland Polish–Japanese Institute of Information Technology1 Koszykowa 86, 02008 Warszawa, Poland email:
[email protected];
[email protected]
Abstract. The paradigm of Granular Computing has quite recently emerged as an area of research on its own; in particular, t is pursued within rough set theory initiated by ZdzisÃlaw Pawlak. Granules of knowledge consist of entities with a similar in a sense information content. An idea of a granular counterpart to a decision/information system has been put forth, along with its consequence in the form of the hypothesis that various operators, aimed at dealing with information, should factorize sufficiently faithfully through granular structures [7], [8]. Most important such operators are algorithms for inducing classifiers. We show results of testing few well-known algorithms for classifier induction on well–used data sets from Irvine Repository in order to verify the hypothesis. The results confirm the hypothesis in case of selected representative algorithms and open a new prospective area of research. Keywords rough inclusion, similarity, granulation of knowledge, granular systems and classifiers
1
Rough Computing
Knowledge is represented as a pair (U, A), called an information system [4], where U is a set of objects, and A is a collection of attributes, each a ∈ A construed as a mapping a : U → Va from U into the value set Va . The collection IN D = {ind(a) : a ∈ A} of a–indiscernibility relations, where ind(a) = {(u, v) : u, v ∈ U, a(u) = a(v)} for a ∈ A, can T be restricted to any set B ⊆ A, yielding the B–indiscernibility relation ind(B) = a∈B ind(a) . A concept is any subset of the set U . By a proper rough entity, we mean any entity e constructed from objects in U and relations in R such that its action e·u on each object u ∈ U satisfies the condition: if (u, v) ∈ r then e·u = e·v for each r ∈ R; in particular, proper rough concepts are called exact, improper rough concepts are called rough. A particular case of an information system is a decision system, i.e., a pair (U, A ∪ {d}) in which d is a singled out attribute called the decision. Basic primitives in any reasoning based on rough set theory, are descriptors, see, e.g., [4], of the form (a = v), with semantics of the form [(a = v)] = {u ∈ U : a(u) = v}, extended to the set of formulae by means of sentential connectives, with appropriately extended semantics. In order to relate the conditional
knowledge (U, IN D) to the world knowledge (U, {ind(d)}), decision rules are in use; a decision rule is an implication of the form, ^ (a = va ) ⇒ (d = w). (1) a∈A
A classifier is a set of decision rules.
2
Rough Mereology. Rough Inclusions
We outline it here as a basis for discussion of granules in the wake of [7], [8]. Rough Mereology is concerned with the theory of the predicate of Rough Inclusion. 2.1
Rough Inclusions
A rough inclusion µπ (x, y, r), where x, y are individual objects, r ∈ [0, 1], does satisfy the following requirements, relative to a given part relation π on a set U of individual objects,see [6], [7], [8], [9], 1. µπ (x, y, 1) ⇔ x ingπ y; 2. µπ (x, y, 1) ⇒ [µπ (z, x, r) ⇒ µπ (z, y, r)]; 3. µπ (x, y, r) ∧ s < r ⇒ µπ (x, y, s).
(2)
Those requirements seem to be intuitively clear: 1. demands that the predicate µπ is an extension to the relation ingπ of the underlying system of Mereology; 2. does express monotonicity of µπ , and 3. assures the reading: ”to degree at least r”. We use here only one rough inclusion, albeit a fundamental one, viz., see [6],[7] for its derivation, µL (u, v, r) ⇔
|IN D(u, v)| ≥ r, |A|
(3)
where IN D(u, v) = {a ∈ A : a(u) = a(v)}.
3
Granules
A granule gµ (u, r) about u ∈ U of the radius r, relative to µ, is defined by letting, gµ (u, r) is ClsF (u, r),
(4)
where the property F (u, r) is satisfied with an object v if and only if µ(v, u, r) holds, and Cls is the class operator, see, e.g., [6]. Practically, in case of µL , the granule g(u, r) collects all v ∈ U such that |IN D(v, u)| ≥ r · |A|. G For a given granulation radius r, we form the collection Ur,µ = {gµ (u, r)}.
3.1
Granular decision systems
The idea of a granular decision system was posed in [7]; for a given information G system (U, A), a rough inclusion µ, and r ∈ [0, 1], the new universe Ur,µ is given. G We apply a strategy G to choose a covering Covr,µ of the universe U by granules G from Ur,µ . We apply a strategy S in order to assign the value a∗ (g) of each attribute G a ∈ A to each granule g ∈ Covr,µ : a∗ (g) = S({a(u) : u ∈ g}). The granular G counterpart to the information system (U, A) is a tuple (Ur,µ , G, S, {a∗ : a ∈ A}); analogously, we define granular counterparts to decision systems by adding the factored decision d∗. The heuristic principle that objects, similar with respect to conditional attributes in the set A, should also reveal similar (i.e., close) decision values, and therefore, granular counterparts to decision systems should lead to classifiers satisfactorily close in quality to those induced from original decision systems, was stated in [7], and borne out by simple hand examples. In this work we verify this hypothesis with real data sets.
4
Classifiers: Rough set methods
Classifiers are evaluated by error which is the ratio of the number of correctly classified objects to the number of recognized test objects (called also total acrec curacy) and total coverage, test , where rec is the number of recognized test cases and test is the number of test cases. We test LEM2 algorithm due to Grzymala–Busse, see, e.g., [2] and covering as well as exhaustive algorithm in RSES package [12], see [1], [13], [16],[17]. 4.1
On the approach in this work
For g(u, r) with r fixed and attribute a ∈ A ∪ {d}, the factored value a∗ (g) is defined as S({a(u) : u ∈ g}) for a strategy S, each granule g does produce a new object g ∗ , with attribute values a(g ∗ ) = a∗ (g) for a ∈ A, possibly not in the data set universe U . G From the set Ur,µ , see sect.3.1, of all granules of the form gµ (u, r), by means G of the universe U . Thus, a decision of a strategy G, we choose a covering Covr,µ ∗ ∗ G ∗ ∗ system D ={g : g ∈ Covr,µ }, A ∪ {d }) is formed, called the granular counterpart relative to strategies G, S to the decision system D = (U, A ∪ {d}); this new system is substantially smaller in size for intermediate values of r, hence, classifiers induced from it have correspondingly smaller number of rules. As stated above, the hypothesis is that the granular counterpart D∗ at sufficiently large granulation radii r preserves knowledge encoded in the decision system D to a satisfactory degree so given an algorithm A for rule induction, classifiers obtained from the training set D(trn) and its granular counterpart D∗ (trn) should agree with a small error on the test set D(tst).
5
Experiments
In experiments with real data sets, we accept total accuracy and total coverage coefficients as quality measures in comparison of classifiers given in this work. We make use of some well–known real life data sets often used in testing of classifiers. Due to shortage of space, we include only a very few results. The following data sets have been used: Credit card application approval data set (Australian credit), see [14]; Pima Indians diabetes data set [14]. As representative and well–established algorithms for rule induction in public domain,we have selected – – –
the RSES exhaustive algorithm, see [12]; the covering algorithm of RSES with p=.1[12]; LEM2 algorithm, with p=.5, see [2], [12].
Table 1 shows a comparison of these algorithms on the data set Australian credit split into the training and test sets with the ratio 1:1. Table 1. Comparison of algorithms on Australian credit data. 345 training objects, 345 test objects algorithm accuracy coverage rule number covering(p = .1) 0.670 0.783 589 covering(p = .5) 0.670 0.783 589 covering(p = 1.0) 0.670 0.783 589 exhaustive 0.872 1.0 5597 LEM 2(p = .1) 0.810 0.061 6 LEM 2(p = .5) 0.906 0.368 39 LEM 2(p = 1.0) 0.869 0.643 126
In rough set literature there are results of tests with other algorithms on Australian credit data set; we recall some best of them in Table 2 and we include also best granular cases from this work. Table 2. Best results for Australian credit by some rough set based algorithms; in case ∗, reduction in object size is 40.6 percent, reduction in rule number is 43.6 percent; in case ∗∗, resp. 10.5, 5.9; in case ∗ ∗ ∗, resp., 3.6, 1.9 source method accuracy coverage Bazan[1] SN AP M (0.9) error = 0.130 − S.H.N guyen[13] simple.templates 0.929 0.623 S.H.N guyen[13] general.templates 0.886 0.905 S.H.N guyen[13] closest.simple.templates 0.821 .1.0 S.H.N guyen[13] closest.gen.templates 0.855 1.0 S.H.N guyen[13] tolerance.simple.templ. 0.842 1.0 S.H.N guyen[13] tolerance.gen.templ. 0.875 1.0 J.W roblewski[17] adaptive.classif ier 0.863 − this.work granular ∗ .r = 0.642857 0.867 1.0 this.work granular ∗∗ .r = 0.714826 0.875 1.0 this.work granular ∗∗∗ .concept.dependent.r = 0.785714 0.9970 0.9985
For any granule g and any attribute b in the set A ∪ d of attributes, the reduced attribute’s b value at the granule g has been estimated by means of the majority voting strategy and ties have been resolved at random; majority voting is one of most popular strategies and was frequently applied within rough set theory, see, e.g., [13], [16]. We also use the simplest strategy for covering finding, i.e., we select coverings by ordering objects in the set U and choosing sequentially granules about them in order to obtain an irreducible covering; a random choice of granules is applied in sections in which this is specifically mentioned. The only enhancement of the simple granulation is discussed in sect. 6 where the concept–dependent granules are considered; this approach yields even better classification results.
5.1
Train–and–test at 1:1 ratio for Australian credit
We include here results for Australian credit. Table 3 shows size of training and test sets in non–granular and granular cases as well as results of classification versus radii of granulation. Table 4 shows absolute differences between non–granular case (r=nil) and granular cases as well as fraction of training and rule sets in granular cases against those in non–granular case.
Table 3. Australian credit dataset:r=granule radius,tst=test sample size,trn=training sample size,rulcov=number of rules with covering algorithm,rulex=number of rules with exhaustive algorithm, rullem=number of rules with LEM2,acov=total accuracy with covering algorithm,ccov=total coverage with covering algorithm,aex=total accuracy with exhaustive algorithm,cex=total coverage with exhaustive algorithm,alem=total accuracy with LEM2, clem=total coverage with LEM2 r nil 0.0 0.0714286 0.142857 0.214286 0.285714 0.357143 0.428571 0.5 0.571429 0.642857 0.714286 0.785714 0.857143 0.928571 1.0
tst 345 345 345 345 345 345 345 345 345 345 345 345 345 345 345 345
trn rulcov rulex rullem acov ccov aex clex 345 571 5597 49 0.634 0.791 0.872 0.994 1 14 0 0 1.0 0.557 0.0 0.0 1 14 0 0 1.0 0.557 0.0 0.0 2 16 0 1 1.0 0.557 0.0 0.0 3 7 7 1 0.641 1.0 0.641 1.0 4 10 10 1 0.812 1.0 0.812 1.0 8 18 23 2 0.820 1.0 0.786 1.0 20 29 96 2 0.779 0.826 0.791 1.0 51 88 293 2 0.825 0.843 0.838 1.0 105 230 933 2 0.835 0.930 0.855 1.0 205 427 3157 20 0.686 0.757 0.867 1.0 309 536 5271 45 0.629 0.774 0.875 1.0 340 569 5563 48 0.629 0.797 0.870 1.0 340 570 5574 48 0.626 0.791 0.864 1.0 342 570 5595 48 0.628 0.794 0.867 1.0 345 571 5597 49 0.634 0.791 0.872 0.994
alem 0.943 0.0 0.0 1.0 0.600 0.0 0.805 0.913 0.719 0.918 0.929 0.938 0.951 0.951 0.951 0.943
clem 0.354 0.0 0.0 0.383 0.014 0.0 0.252 0.301 0.093 0.777 0.449 0.328 0.357 0.357 0.357 0.354
Table 4. Australian credit dataset:comparison; r=granule radius,acerr= abs.total accuracy error with covering algorithm,ccerr= abs.total coverage error with covering algorithm,aexerr=abs.total accuracy error with exhaustive algorithm,cexerr=abs.total coverage error with exhaustive algorithm,alemerr=abs.total accuracy error with LEM2, clemerr=abs.total coverage error with LEM2, sper=training sample size as fraction of the original size,rper= max rule set size as fraction of the original size r nil 0.0 0.0714286 0.142857 0.214286 0.285714 0.357143 0.428571 0.5 0.571429 0.642857 0.714286 0.785714 0.857143 0.928571 1.0
acerr 0.0 0.366+ 0.366+ 0.366+ 0.007+ 0.178+ 0.186+ 0.145+ 0.191+ 0.201+ 0.052+ 0.005 0.005 0.008 0.006 0.0
ccerr aexerr cexerr alemerr clemerr sper rper 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.234 0.872 0.994 0.943 0.354 0.003 0.024 0.234 0.872 0.994 0.943 0.354 0.003 0.024 0.234 0.872 0.994 0.057+ 0.029+ 0.0058 0.028 0.209+ 0.231 0.006+ 0.343 0.340 0.009 0.02 0.209+ 0.06 0.006+ 0.943 0.354 0.012 0.02 0.209+ 0.086 0.006+ 0.138 0.102 0.023 0.04 0.035+ 0.081 0.006+ 0.03 0.053 0.058 0.05 0.052+ 0.034 0.006+ 0.224 0.261 0.148 0.154 0.139+ 0.017 0.006+ 0.025 0.423+ 0.304 0.403 0.034 0.005 0.006+ 0.014 0.095+ 0.594 0.748 0.017 0.003+ 0.006+ 0.005 0.026 0.896 0.942 0.006+ 0.002 0.006+ 0.008+ 0.003+ 0.985 0.994 0.0 0.008 0.006+ 0.008+ 0.003+ 0.985 0.998 0.003+ 0.005 0.006+ 0.008+ 0.003+ 0.991 0.999 0.0 0.0 0.0 0.0 0.0 1.0 1.0
With covering algorithm, accuracy is better or within error of 1 percent for all radii, coverage is better or within error of 4.5 percent from the radius of 0.214860 on where training set size reduction is 99 percent and reduction in rule set size is 98 percent. With exhaustive algorithm, accuracy is within error of 10 percent from the radius of 0.285714 on, and it is better or within error of 4 percent from the radius of 0.5 where reduction in training set size is 85 percent and reduction in rule set size is 95 percent. The result of .875 at r = .714 is among the best at all (see Table 2). Coverage is better from r = .214 in the granular case, reduction in objects is 99 percent, reduction in rule size is almost 100 percent. LEM2 gives accuracy better or within 2.6 percent error from the radius of 0.5 where training set size reduction is 85 percent and rule set size reduction is 96 percent. Coverage is better or within error of 7.3 percent from the radius of .571429 on where reduction in training set size is 69.6 percent and rule set size is reduced by 96 percent.
5.2
CV-10 with Pima
We have experimented with Pima Indians diabetes data set using 10–fold cross–validation and random choice of a covering for exhaustive and LEM2 algorithms. Results are in Tables 5, 6. Table 5. 10-fold CV; Pima; exhaustive algorithm. r=radius,macc=mean accuracy,mcov=mean coverage,mrules=mean rule number, mtrn=mean size of training set r nil 0.125 0.250 0.375 0.500 0.625 0.750 0.875
macc 0.6864 0.0618 0.6627 0.6536 0.6645 0.6877 0.6864 0.6864
mcov 0.9987 0.0895 0.9948 0.9987 1.0 0.9987 0.9987 0.9987
mrules 7629 5.9 450.1 3593.6 6517.6 7583.6 7629.2 7629.2
mtrn 692 22.5 120.6 358.7 579.4 683.1 692 692
Table 6. 10-fold CV; Pima; LEM2 algorithm. r=radius,macc=mean accuracy,mcov=mean coverage,mrules=mean rule number, mtrn=mean size of training set r nil 0.125 0.250 0.375 0.500 0.625 0.750 0.875
macc 0.7054 0.900 0.7001 0.6884 0.7334 0.7093 0.7071 0.7213
mcov 0.1644 0.2172 0.1250 0.2935 0.1856 0.1711 0.1671 0.1712
mrules 227.0 1.0 12.0 74.7 176.1 223.1 225.9 227.8
mtrn 692 22.5 120.6 358.7 579.4 683.1 692 692
For exhaustive algorithm, accuracy in granular case is 95.4 percent of accuracy in non–granular case, from the radius of .25 with reduction in size of the training set of 82.5 percent, and from the radius of .5 on, the difference is less than 3 percent with reduction in size of the training set of about 16.3 percent. The difference in coverage is less than .4 percent from r = .25 on, where reduction in training set size is 82.5 percent. For LEM2, accuracy in both cases differs by less than 1 percent from r = .25 on, and it is better in granular case from r = .5 on with reduction in size of the training set of 16.3 percent; coverage is better in granular case from r = .375 on with the training set size reduced by 48.2 percent.
5.3
A validation by a statistical test
We have also carried out the test with Pima Indian Diabetes dataset [14], and random choice of coverings, taking a sample of 30 granular classifiers at the radius of .5 with train-and-test at the ratio 1:1 against the matched sample of classification results without granulation, with the covering algorithm for p=.1. The Wilcoxon [15] signed rank test for matched pairs in this case has given the p–value of .14 in case of coverage,so the null hypothesis of identical means should not be rejected, whereas for accuracy, the hypothesis that the mean in granular case is equal to .99 of the mean in non–granular case may be rejected (the p–value is .009), and the hypothesis that the mean in granular case is greater than .98 of the mean in non–granular case is accepted (the p–value is .035) at confidence level of .03.
6
Concept–dependent granulation
A modification of the approach presented in results shown above is the concept dependent granulation; a concept in the narrow sense is a decision/classification class, cf., e.g.,
[2]. Granulation in this sense consists in computing granules for objects in the universe U and for all distinct granulation radii as previously, with the only restriction that given any object u ∈ U and r ∈ [0, 1], the new concept dependent granule g cd (u, r) is computed with taking into account only objects v ∈ U with d(v) = d(u), i.e., g cd (u, r) =g(u, r) ∩ {v ∈ U : d(v) = d(u)}. This method increases the number of granules in coverings but it is also expected to increase quality of classification, as expressed by accuracy and coverage. We show that this is the case indeed, by including results of the test in which exhaustive algorithm and random choice of coverings were applied tenfold to Australian credit data set, once with the ”standard” by now granular approach and then with the concept dependent approach. The averaged results are shown in Table 7. Table 7. Standard and concept dependent granular systems for Australian credit data set; exhaustive RSES algorithm:r=granule radius, macc=mean accuracy, mcov=mean coverage, mrules=mean number of rules, mtrn=mean training sample size; in each column first value is for standard, second for concept dependent r nil 0.0 0.0714286 0.142857 0.214286 0.285714 0.357143 0.428571 0.5 0.571429 0.642857 0.714286 0.785714 0.857143 0.928571 1.0
macc mcov 1.0; 1.0 1.0; 1.0 0.0; 0.8068 0.0; 1.0 0.0; 0.7959 0.0; 1.0 0.0; 0.8067 0.0; 1.0 0.1409; 0.8151 0.2; 1.0 0.7049; 0.8353 0.9; 1.0 0.7872; 0.8297 1.0; 0.9848 0.8099; 0.8512 1.0; 0.9986 0.8319; 0.8466 1.0; 0.9984 0.8607; 0.8865 0.9999; 0.9997 0.8988; 0.9466 1.0; 0.9998 0.9641; 0.9880 1.0; 0.9988 0.9900; 0.9970 1.0; 0.9995 0.9940; 0.9970 1.0; 0.9985 0.9970; 1.0 1.0; 0.9993 1.0; 1.0 1.0; 1.0
mrules 12025; 12025 0; 8 0; 8.2 0; 8.9 1.3; 11.4 8.1; 14.8 22.6; 32.9 79.6; 134 407.6; 598.7 1541.6; 2024.4 5462.5; 6255.2 9956.4; 10344.0 11755.5; 11802.7 11992.7; 11990.2 12023.5; 12002.4 12025.0; 12025.0
mtrn 690; 690 1; 2 1.2; 2.4 2.4; 3.6 2.6; 5.8 5.2; 9.6 10.1; 17 22.9; 35.4 59.7; 77.1 149.8; 175.5 345.7; 374.9 554.1; 572.5 662.7; 665.7 682; 683 684; 685 690; 690
Conclusions for concept dependent granulation Concept dependent granulation, as expected, involves a greater number of granules in a covering, hence, a greater number of rules, which is perceptible clearly up to the radius of .714286 and for greater radii the difference is negligible. Accuracy in case of concept dependent granulation is always better than in the standard case, the difference becomes negligible at the radius of .857143 when granules become almost single indiscernibility classes. Coverage in concept dependent case is almost the same as in the standard case, the difference between the two not greater than .15 percent from the radius of .428571, where the average number of granules in coverings is 5 percent of the number of objects. Accuracy at that radius is better by .04 i.e. by about 5 percent in the concept dependent case. It follows that concept dependent granulation yields better accuracy whereas coverage is the same as in the standard case.
7
Conclusions
The results shown in this work confirm the hypothesis put forth in [7], [8] that granular counterparts to data sets preserve the encoded information to a very high degree. The search for theoretical explanation for this as well as work aimed at developing original algorithms for rule induction based on the discovered phenomenon are in progress to be reported.
References 1. J. G. Bazan, A comparison of dynamic and non–dynamic rough set methods for extracting laws from decision tables, In: Rough Sets in Knowledge Discovery 1, L. Polkowski, A.Skowron, Eds., Physica Verlag, Heidelberg, 1998, 321–365.
2. J.W. Grzymala–Busse, Data with missing attribute values: Generalization of rule indiscernibility relation and rule induction, Transactions on Rough Sets I, Springer Verlag, Berlin, 2004, 78–95. 3. S. Le´sniewski, On the foundations of set theory, Topoi 2, 1982, 7–52. 4. Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer, Dordrecht, 1991. 5. L. Polkowski, Rough Sets. Mathematical Foundations, Physica Verlag, Heidelberg, 2002. 6. L. Polkowski, Toward rough set foundations. Mereological approach (a plenary lecture), in: Proceedings RSCTC04, Uppsala, Sweden, 2004, LNAI vol. 3066,Springer Verlag, Berlin, 2004, 8–25. 7. L. Polkowski, Formal granular calculi based on rough inclusions (a feature talk), in: [10], 57–62. 8. L.Polkowski, Formal granular calculi based on rough inclusions (a feature talk), in: [11], 9–16. 9. L.Polkowski, A. Skowron, Rough mereology: a new paradigm for approximate reasoning,International Journal of Approximate Reasoning 15(4), 1997, 333–365. 10. Proceedings of IEEE 2005 Conference on Granular Computing,GrC05, Beijing, China, July 2005, IEEE Press, 2005. 11. Proceedings of IEEE 2006 Conference on Granular Computing, GrC06, Atlanta, USA, May 2006, IEEE Press, 2006. 12. A. Skowron et al., RSES: A system for data analysis; available at http: logic.mimuw.edu.pl˜rses 13. Sinh Hoa Nguyen, Regularity analysis and its applications in Data Mining, in: Rough Set Methods and Applications, L.Polkowski, S.Tsumoto, T.Y.Lin, Eds., Physica Verlag, Heidelberg, 2000,289–378. 14. http://www.ics.uci.edu./ mlearn/databases/iris 15. F. Wilcoxon, Individual comparisons by ranking method, Biometrics 1, 1945, 80– 83. 16. A. Wojna, Analogy–based reasoning in classifier construction, Transactions on Rough Sets IV, LNCS 3700, Springer Verlag, Berlin, 2005, 277–374. 17. J. Wr´ oblewski, Adaptive aspects of combining approximation spaces, In:Rough Neural Computing, S.K.Pal, L.Polkowski,A.Skowron, Eds., Springer Verlag, 2004, 139–156.