Liu Shao-Hui, Sheng Qiu-Jian, Wu Bin, Shi Zhong-Zhi, Hu Fei. Research on Ef- ficient Algorithms for Rough Set Methods. Chinese Journal of Computer 26(5).
A Divide-and-Conquer Discretization Algorithm Fan Min, Lijun Xie, Qihe Liu, and Hongbin Cai College of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610051, China {minfan, xielj, qiheliu, caihb}@uestc.edu.cn
Abstract. The problem of real value attribute discretization can be converted into the reduct problem in the Rough Set Theory, which is NP-hard and can be solved by some heuristic algorithms. In this paper we show that the straightforward conversion is not scalable and propose a divide-and-conquer algorithm. This algorithm is fully scalable and can reduce the time complexity dramatically especially while integrated with the tournament discretization algorithm. Parallel versions of this algorithm can be easily written, and their complexity depends on the number of objects in each subtable rather than the number of objects in the initial decision table. There is a tradeoff between the time complexity and the quality of the discretization scheme obtained, and this tradeoff can be made through adjusting the number of subtables, or equivalently, the number of objects in each subtable. Experimental results confirm our analysis and indicate appropriate parameter setting.
1
Introduction
The majority of machine learning algorithms can be applied only to data described by discrete numerical or nominal attributes (features). In the case of continuous attributes, there is a need for a discretization algorithm that transforms continuous attributes into discrete ones [1]. And the discretization step determines how coarsely we want to view the world [2]. The problem of real value attribute discretization can be converted into the reduct problem in the Rough Set Theory [3][5][6]. But some existing algorithms is not scalable in practice [2][6] when the decision table has many continuous attributes and/or many possible attribute values. Recently we have proposed an algorithm called the tournament discretization algorithm for situations where the number of attributes is large. In this paper we propose a divide-and-conquer algorithm that can dramatically reduce the time complexity and is applicable especially for situations where the number of objects in the decision table is large. By integrating these two algorithm together, we can essentially cope with decision tables with any size. A parallel version of this algorithm can be easily written and run to further reduce the time complexity. There is a tradeoff between the time complexity and the quality of the discretization scheme obtained, and this tradeoff can be made through adjusting the number of subtables, or equivalently, the number of objects in each subtable. Experimental results confirm our analysis and indicate appropriate parameter setting. L. Wang and Y. Jin (Eds.): FSKD 2005, LNAI 3613, pp. 1277–1286, 2005. c Springer-Verlag Berlin Heidelberg 2005
1278
F. Min et al.
The rest of this paper is organized as follows: in Section 2 we enumerate relative concepts about decision tables, discretization schemes and discernibility. In Section 3 we analyze existing rough set approaches for discretization and point out their scalability problem. In Section 4 we present and analyze our divide-andconquer discretization algorithm. Some experimental results are given in Section 5. Finally, we conclude and point out further research works in Section 6.
2
Preliminaries
In this section we emulate relative concepts after Nguyen [3] and Komorowski [2], and propose the definition of discernibility for attributes and cuts. 2.1
Decision Tables
Formally, a decision table is a triple S = (U, A, {d}) where d ∈ A is called the decision attribute and elements of A are called conditional attributes. a : U → Va for any a ∈ A ∪ {d}, where Va is the set of all values of a called the domain of a. For the sake of simplicity, throughout this paper we assume that U = {x1 , . . . , x|U| }, A = {a1 , . . . , a|A| } and d : U → {1, . . . , r(d)}. Table 1 lists a decision table. Table 1. A decision table S U x1 x2 x3 x4 x5
2.2
a1 1.1 1.3 1.5 1.5 1.7
a2 0.2 0.4 0.4 0.2 0.4
a3 0.4 0.2 0.5 0.2 0.3
d 1 1 1 2 2
Discretization Schemes
We assume Va = [la , ra ) ⊂ to be a real interval for any a ∈ A. Any pair (a, c) where a ∈ A and c ∈ is called a cut on Va . Let Pa be a partition on Va (for a ∈ A) onto subintervals Pa = {[ca0 , ca1 ), [ca1 , ca2 ), . . . , [caka , caka +1 )} where la = ca0 < ca1 < . . . < caka < caka +1 = ra and Va = [ca0 , ca1 ) ∪ [ca1 , ca2 ) ∪ . . . ∪ [caka , caka +1 ). Hence any partition Pa is uniquely defined and often identified as the set of cuts: {(a, ca1 ), (a, ca2 ), . . . , (a, caka )} ⊂ A × . Any set of cuts P = ∪a∈A Pa defines from S = (U, A, {d}) a new decision table S P = (U, AP , {d}) called P -discretization of S, where AP = {aP : aP (x) = i ⇔ a(x) ∈ [cai , cai+1 ) for x ∈ U and i ∈ {0, . . . , ka }}. P is called a discretization scheme of S. While selecting cut points, we only consider midpoints of adjacent attribute values, e.g., in the decision table listed in Table 1, (a1 , 1.2) is a possible cut while (a1 , 1.25) is not.
A Divide-and-Conquer Discretization Algorithm
2.3
1279
Discernibility
The ability to discern between perceived objects is important for constructing many entities like reducts, decision rules or decision algorithms [8]. The discernibility is often addressed through the discernibility matrix and the discernibility function. In this subsection we propose definitions of discernibility for both attributes and cuts from another point of view to facilitate further analysis. Given a decision table S = (U, A, {d}), the discernibility of any attribute a ∈ A is defined as the set of object pairs discerned by a, this is formally given by DP (a) = {(xi , xj ) ∈ U × U |d(xi ) = d(xj ), i < j, a(xi ) = a(xj )},
(1)
where i < j is required to ensure that the same pair does not appear in the same set twice. The discernibility of an attribute set B ⊆ A is the union of the discernibility of each attribute, namely, DP (a). (2) DP (B) = a∈B
Similarly, the discernibility of a cut (a, c) is defined by the set of object pairs it can discern, DP ((a, c)) = {(xi , xj ) ∈ U × U | d(xi ) = d(xj ), i < j, a(xi ) < c ≤ a(xj ) or a(xj ) < c ≤ a(xi )}.
(3)
The discernibility of a cut set P is the union of the discernibility of each cut, namely, DP ((a, c)). (4) DP (P ) = (a,c)∈P
It is worth noting that maintaining discernibility is the most strict requirement used in the rough set theory because it implies no loss of information [11].
3 3.1
Related Works and Their Limitations The Problem Conversion
In this subsection we explain the main idea of the rough set approach for discretization [3] briefly, and we point out that this conversion is independent of the definition of reduction and the reduction algorithm employed. Given a decision table S = (U, A, {d}), denote the cut set containing all possible cuts by C(A). Construct a new decision table S(C(A)) = (U, C(A), {d}) called C(A)-cut attribute table of S, where ∀ct = (a, c) ∈ C(A), ct : U → {0, 1}, and 0, if a(x) < c; ct(x) = { (5) 1, otherwise . Table 2 lists the C(A)-cut attribute table, denoted by S(C(A)), of the decision table S listed in Table 1. We have proven the following two theorems [6]:
1280
F. Min et al. Table 2. S(C(A)) U (a1 , 1.2) (a1 , 1.4) (a1 , 1.6) (a2 , 0.3) (a3 , 0.25) (a3 , 0.35) (a3 , 0.45) d x1 0 0 0 0 1 1 0 1 1 0 0 1 0 0 0 1 x2 1 1 0 1 1 1 1 1 x3 1 1 0 0 0 0 0 2 x4 1 1 1 1 1 0 0 2 x5
Theorem 1. DP (C(A)) = DP (A).
(6)
Theorem 2. For any cut set P , DP (AP ) = DP (P ).
(7)
Therefore the discretization problem (constructing AP from A) is converted into the reduction problem (selecting P from C(A)). Nguyen [3][5] integrated the conversion process with reduction process and employed the Boolean approach. According to above proved theorems, this conversion maintains the discernibility of decision tables, hence it is independent from the definition of reduction or reduction algorithm employed. For example, if the definition of reduction requires that the positive region is maintained, i.e., the decision tables (U, C(A), {d}) and (U, P, {d}) have the same positive region, then S and S P would have the same positive region; if the definition of reduction requires that the generalized decision is maintained, i.e., ∂C(A) = ∂P , then ∂A = ∂AP . 3.2
The Scalability Problem of Existing Approaches
Although the above-mentioned approach seems perfect because the discretization problem has been converted to another problem which is solved by many efficient algorithms, using it directly is not scalable in practice when the decision table has many continuous attributes and/or many possible attribute values. The decision table S(C(A)) = (U, C(A), {d}) has |U | rows and |C(A)| = O(|A||U |) columns, which may be very large. For example, in the data table WDBC of the UCI library [10] (stored in the file uci/breast-cancer-wisconsin/ wdbc.data.txt), there are 569 objects and 31 continuous attributes, each attribute having 300 - 569 different attribute values, and S(C(A)) should contain 15,671 columns, which is simply not supported by the ORACLE system run in our computer. Nguyen [4] also proposed a very efficient algorithm based on Boolean approach. But from our point of view, this approach is not flexible because only object pairs discerned by given cuts are used as heuristic information. Moreover, this approach may not be applicable for mixed-mode data [7]. In previous works [6] we have developed the tournament discretization algorithm. This algorithm has some rounds, during round i the discretization schemes
A Divide-and-Conquer Discretization Algorithm
1281
of decision tables containing 2i conditional attributes (may be less for the last one) are computed on the basis of cut sets constructed in round i − 1, resulting in |A| 2i cut sets. In this process the number of candidate cuts of current decision table could be kept under a relative low degree. By using this algorithm we have computed discretized scheme of WDBC, but it took my compute 10,570 seconds for the plain version and 1,886 seconds for the parallel version. Both are rather long. Moreover, this algorithm may be invalid when the number of possible cuts of any attribute exceeds 1000, namely, the upper bound of columns supported by the ORACLE systems.
4
A Divide-and-Conquer Discretization Algorithm
We can use the divide-and-conquer approach to the other dimension of the decision table, namely, the number of objects. In this section we firstly propose the algorithm structure, then analyze parameter setting. 4.1
The Algorithm Structure
Firstly we list our discretization algorithm in Fig. 1. DivideAndConquerDiscretization (S = (U, A, {d})) {input: A decision table S.} {output: A discretization scheme P .} Step 1. divide S into K subtables; Step 2. compute discretization schemes of subtables; Step 3. compute the discretization scheme P of S based on discretization schemes of subtables; Fig. 1. A Divide-and-Conquer Discretization Algorithm
Generally, in Step 1 we require the family of subtables to be a partition of S. Namely, let the set of subtables be {S1 , S2 , . . . , SK } where Si = (Ui , A, {d}) for all i ∈ {1, 2, . . . , K}, K i=1 Ui = U and ∀i, j ∈ {1, 2, . . . , K}, i = j, Ui ∩ Uj = ∅. Moreover, we require all subtables except the last one to be the same size, namely, |U1 | = |U2 | = . . . = |UK−1 | = |U| K . We have these requirements because: 1. Our algorithm is intended for data tables with large amount of rows, subtables containing the same row may not be preferred; 2. Loss of rows, i.e., some rows are not included in any subtable may incur too much loss of information. However, for very huge data tables we may lose this requirement; 3. Subtables with almost the same size is preferred from both statistical and implementation points of view.
1282
F. Min et al.
Moreover, it is not encouraged to construct subtables using adjacent rows, e.g., U1 = {u1 , u2 , . . . , u |U | }, because many data tables such as IRIS are well K organized according to decision attribute values. We propose the following scheme to meet these requirement. Firstly, select a prime number p such that |U |%p = 0. Then generate a set of numbers N = {n1 , n2 , . . . , n|U| } where nj = (j ∗ p)%|U | + 1 for all j ∈ {1, 2, . . . , |U |}. Because |U |%p = 0, it is easy to prove that N = {1, 2, . . . , |U |}. At last we let Ui = {un
(i−1)∗
|U | +1 K
, un
(i−1)∗
|U | +2 K
, . . . , un
i∗
|U | K
}
for all i ∈ {1, 2, . . . K − 1} and UK = {un
(K−1)∗
|U | +1 K
, un
(K−1)∗
|U | +2 K
, . . . , un|U | }.
For example, if U = 8, K = 2 and p = 3, then U1 = {u1 , u4 , u7 , u2 }, U2 = {u5 , u8 , u3 , u6 }. It is easy to see that objects of any subtable are distributed in S evenly with no more tendency, and this scheme of constructing subtables can easily break any bias of S by choosing appropriate p (relatively larger ones such as 73, 97 are preferred). In Step 2 any discretization algorithm could be employed, while in this paper only Nguyen’s algorithm [3] (employed while |A| ≤ 8) and our tournament discretization algorithm [6] (employed while |A| > 8) are concerned to keep a unified form. In Step 3 we use the same idea mentioned in Subsection 3.1. Instead of K using C(A), we use cuts selected from all subtables, i.e., i=1 Pi where Pi is the discretization scheme of Si . 4.2
Time Complexity Analysis
Time required for Step 1 is simply ignorable compared with that of Step 2, and time required for Step 3 is also ignorable if K is not very large. It is worth noting that Step 2 can be easily distributed into K computers/processors and run in parallel. In order to specify its relationship with respective decision table, we use P (S) instead of P to denote the discretization scheme of S, and P (S1 ) to denote the discretization scheme of S1 . Obviously, the time complexity of computing discretization scheme of any subtable is equal to that of S1 . The time complexity of the most efficient reduction algorithm is O(M 2 N log N ) where N is the number of objects and M is the number of attribute [9]. We have developed an entropy-based algorithm with the same complexity. In the following we assume that this algorithm is employed and give time complexity of different algorithms or combinations of algorithms: If we apply Nguey’s algorithm [3] directly to S, because S(C(A)) has O(|A| |U |) cut attributes, the time complexity of computing P (S) is O(|A|2 |U |3 log |U |).
(8)
A Divide-and-Conquer Discretization Algorithm
1283
If we apply the tournament discretization algorithm [6] directly to S, the time complexity of computing P (S) is O(|A||U |3 log |U |),
(9)
O(log |A||U |3 log |U |)
(10)
which is reduced to if the parallel mechanism is employed and there are |A| 2 computers/processors to use [6]. If we apply Nguey’s algorithm [3] to subtables, the time complexity of computing P (S1 ) is |U | 3 |U | O(|A|2 ( ) log( )), (11) K K which is also the time complexity of the parallel version of computing P (S) if there are K computers/processors to use. And the time complexity for the plain version of computing P (S) is O(K|A|2 (
|U | 3 |U | ) log( )). K K
(12)
If we apply the tournament discretization algorithm [6] to subtables, the time complexity of computing P (S1 ) is O(|A|(
|U | |U | 3 ) log( )), K K
(13)
which is reduced to O(log |A|(
|U | 3 |U | ) log( )). K K
(14)
if the parallel mechanism is employed and there are |A| 2 computers/processors to use. And this is also the time complexity of computing P (S) if there are |A| 2 ∗ K computers/processors to use. If the parallel mechanism is not employed, the time complexity for computing P (S) is |U | |U | 3 ) log( )). (15) O(K|A|( K K By comparing equations (8) and (14) or (15) it is easy to see that our algorithms have made great progress on deducing time complexity of the discretization algorithm. 4.3
Tradeoff for Deciding Suitable Parameter
According to above analysis, larger K will incur lower time complexity. But larger K, or equivalently, smaller |U| K has some drawbacks. When we divide S into subtables, we are essentially losing some candidate cuts. For example, if we
1284
F. Min et al.
divide S listed in Table 1 into 2 subtables S1 = ({x1 , x2 }, {a1 , a2 , a3 }, d) and S2 = ({x3 , x4 , x5 }, {a1 , a2 , a3 }, d), cuts (a1 , 1.4), (a3 , 0.35) and (a3 , 0.45) will be lost. When subtables are large enough, the loss of cuts may be trivial and ignorable. But when subtables are relatively small, this kind of loss may be unendurable. For one extreme, if K = |U |, any subtable will contain exactly one object, and there will be no candidate cut at all. Obviously, appropriate K may varies directly as |U | for different applications, and it may be suitable to investigate on appropriate setting of |U| K . In fact, if we keep |U| in a certain range for different applications, according to equations K (11), (13) and (14), the time complexities of parallel versions are not influenced by |U |. We will analyze this issue through examples in the next section.
5
Experimental Results
We are developing a tool called Rough set Developer’s kit (RDK) using the Java platform and the ORACLE system to test our algorithms and also as a basis of application development. For convenience we run our algorithm in my notebook PC with an Intel Centrino 1.4G CPU and 256M memory. The reducing algorithm employed throughout this paper is entropy based. And the discretization algorithm for subtables is the tournament discretization algorithm. Table 3 lists some results of the WDBC dataset. Specifically, K = 1 indicates that we do not divide S into subtables. P OS(S P ) denotes the number of objects in the positive region of the discretized decision table S P . Total time indicates time used for the plain version of our algorithm, while Parallel time indicates time used for the parallel version of our algorithm. Processors indicates the number of computers/processors required for running the parallel version. Step 3 time indicates time required for executing Step 3 of our algorithm. Time units are all seconds. Experiment analysis: 1. While K ≤ 5, the positive region of S P is the same as that of S, i.e., all 569 objects are in the positive region. But this no longer hold true for K > 5. This difference is essential from the Rough Set point of view. 2. If the discretization quality is estimated by the number of selected cuts |P |, suboptimal results could be obtained especially when K ≤ 6. 3. The total time and parallel time decreases as K increases, but this trend does not continue significantly when K > 4. This is partly because that more subtables will incur more overheads. 4. The run time of Step 3 is no longer ignorable while K ≥ 7. 5. In this example, K = 4 is the best setting. 6. For generalized situations, we recommend that |U| K to be between 100 and 150 (corresponding to K = 4 or 5 in this example) because subtables of such size tend to maintain most cuts, while at the same time easy to discretize.
A Divide-and-Conquer Discretization Algorithm
1285
Table 3. Some results of the WDBC data set K 1 2 3 4 5 6 7 8 9 10 11 12
6
P OS(S P ) 569 569 569 569 569 550 550 567 566 554 565 565
|P | 11 13 13 12 11 13 18 13 15 12 12 13
Total time 10,570 2,858 2,354 1,395 1,189 1,225 920 865 942 994 1093 912
Parallel time 1,886 278 207 107 80 108 46 52 58 46 76 36
Processors 16 32 48 64 80 96 112 128 144 160 176 192
Step 3 time 0 11 51 25 24 34 16 26 34 21 49 21
Conclusions and Further Works
In this paper we proposed a divide-and-conquer discretization scheme that divides the given decision table into subtables and combine discretization schemes of subtables. While integrated with the tournament discretization algorithm, this algorithm can discretize decision tables with any size. Moreover, the time complexity of parallel versions of our algorithm is influenced by |U1 | rather than |U| processors/computers to use. We have also given |U | if there are K = |U 1| suggestion of |U1 | to be between 100 and 150 through an example. Further research works include applying our algorithms along with parameter settings on applications to test their validity.
References 1. L. Kurgan and K.J. Cios.: CAIM Discretization Algorithm. IEEE Transactions on Knowledge and Data Engeering 16(2) (2004) 145–153. 2. J. Komorowski, Z. Pawlak, L. Polkowski, A. Skowron.: Rough sets: A Tutorial. S. Pal, A. Skowron (Eds.) Rough Fuzzy Hybridization (1999) 3–98. 3. H.S. Nguyen and A. Skowron.: Discretization of Real Value Attributes. Proc. of the 2nd Joint Annual Conference on Information Science, Wrightsvlle Beach, North Carolina (1995) 34–37. 4. H.S. Nguyen.: Discretization of Real Value Attributes, Boolean Reasoning Approach. PhD thesis, Warsaw University, Warsaw, Poland (1997). 5. H.S. Nguyen.: Discretization Problem for Rough Sets Methods. RSCTC’98, LNAI 1424 (1998) 545–552. 6. F. Min, H.B. Cai, Q.H. Liu, F. Li.: The Tournament Discretization Algrithm, already submitted to ISPA 2005. 7. F. Min, S.M. Lin, X.B. Wang, H. B. Cai, “Attribute Extraction from Mixed-Mode Data,” already submitted to ICMLC 2005.
1286
F. Min et al.
8. R.W. Swiniarski, A. Skowron: Rough Set Methods in Feature Selection and Recognition. Pattern Recognition Letters 24 (2003) 833–849. 9. Liu Shao-Hui, Sheng Qiu-Jian, Wu Bin, Shi Zhong-Zhi, Hu Fei. Research on Efficient Algorithms for Rough Set Methods. Chinese Journal of Computer 26(5) (2003) 524–529 (in Chinese). 10. C.L. Blake, C.J. Merz. UCI Repository of Machine Learning Databases. http://www.ics.uci.edu/ mlearn/MLRepository. html, UC Irvine, Dept. Information and Computer Science, 1998. 11. M. Cryszkiewicz.: Comparative Studies of Alternative Type of Knowledge Reduction in Inconsistent Systems. International Journal of Intelligent Systems 16(1) (2001) 105 – 120.