Classification Based on Pruning and Double Covered Rule Sets for ...

6 downloads 27504 Views 632KB Size Report
Nov 25, 2013 - However, many traditional rule-based classifiers cannot guarantee that all instances can be covered by at least two classification rules. ThusΒ ...
Hindawi Publishing Corporation ξ€ e Scientific World Journal Volume 2014, Article ID 984375, 6 pages http://dx.doi.org/10.1155/2014/984375

Research Article Classification Based on Pruning and Double Covered Rule Sets for the Internet of Things Applications Shasha Li, Zhongmei Zhou, and Weiping Wang Department of Computer Science and Engineering, Minnan Normal University, Zhangzhou 363000, China Correspondence should be addressed to Shasha Li; [email protected] Received 14 October 2013; Accepted 25 November 2013; Published 5 January 2014 Academic Editors: W. Sun, G. Zhang, and J. Zhou Copyright Β© 2014 Shasha Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Internet of things (IOT) is a hot issue in recent years. It accumulates large amounts of data by IOT users, which is a great challenge to mining useful knowledge from IOT. Classification is an effective strategy which can predict the need of users in IOT. However, many traditional rule-based classifiers cannot guarantee that all instances can be covered by at least two classification rules. Thus, these algorithms cannot achieve high accuracy in some datasets. In this paper, we propose a new rule-based classification, CDCR-P (Classification based on the Pruning and Double Covered Rule sets). CDCR-P can induce two different rule sets 𝐴 and 𝐡. Every instance in training set can be covered by at least one rule not only in rule set 𝐴, but also in rule set 𝐡. In order to improve the quality of rule set 𝐡, we take measure to prune the length of rules in rule set 𝐡. Our experimental results indicate that, CDCR-P not only is feasible, but also it can achieve high accuracy.

1. Introduction The Internet of things is one of the hot topics in recent years. It has integrated many kinds of modern technology. By these kinds of technology, it produces large-scale data in IOT. In order to handle these large data, it requires techniques and methods of data mining and machine learning [1–6]. As one of the most important tasks of data mining, classification has been widely applied in IOT. The main idea of classification is that builds classification rules. According these rules, we can predict the class label for unknown objects. Traditional rule-based classifications usually use greedy approach, such as FOIL [7], CPAR [8], and CMER [9]. These methods repeatedly search for the current best one rule or best-π‘˜ rules and remove examples covered by the rules. They cannot guarantee that all instances can be covered by at least two classification rules. As a result, some traditional classifiers have less classification rules. Their accuracy may not be high. Decision tree classifiers produce classification rules by constructing classification trees, such as ID3 [10], C4.5 [11], and TASC [12]. The process of building a decision tree does not need to delete any examples. All examples can find only one matching rule in the classification rule set. That

is why decision trees often generate small rule sets and cannot achieve high accuracy in some data. Aiming at these weaknesses, we propose a novel Double Covered Rule sets classifier called CDCR-P (Classification based on the Pruning and Double Covered Rule sets). CDCRP generates two different rule sets 𝐴 and 𝐡 and then prunes the rule set 𝐡. Each instance can be covered by at least one rule from rule set 𝐴. At the same time, each instance can be covered by at least one rule from rule set 𝐡. CDCR-P has four aspects. First, CDCR-P generates rule set 𝐴. We select several best values which can just cover the training set to construct a candidate set. CDCR-P employs candidate set to produce rule set 𝐴. Second, in order to induce rule set 𝐡, we remove the values of candidate set in training data and select other several best values to induce rule set 𝐡. Rule set 𝐴 is fully different from rule set 𝐡. Third, each instance can find at least two matching rules. One of the rules is from rule set 𝐴, and another is from rule set 𝐡. Forth, we prune the length of rules in rule set 𝐡, so as to improve the quality of rule set 𝐡. Our method has the following advantages. (1) CDCR-P can produce two rule sets. Thus, CDCR-P can generate large number of classification rules.

2

The Scientific World Journal

Input: Training data 𝑇 = {𝑑1 , 𝑑2 , . . . , 𝑑𝑛 } Output: 𝑇1 , candidate value CV1 Method: (1) 𝑇1 = πœ™, CV1 = πœ™, Length = 0; (2) compute the information gain of each sample 𝑠; (3) sort 𝑠 according to the information gain in descending order. If two sample have the same information gain, sort the two sample according to the support; (4) Length = 𝑇 . Count Γ— 1/3; (5) While 𝑇 =ΜΈ πœ™ (6) 𝑆 = 𝑆 βˆ’ {π‘Ž}, where π‘Ž is the first element of 𝑆; (7) for each tuple 𝑑 in 𝑇 (8) While 𝑑 . Contains(π‘Ž) && 𝑇1 . Count max rule length continue; (9) compute the information gain of π‘₯; (10) if π‘₯ . information gain == 0 (11) 𝑅 = 𝑅 βˆͺ {π‘₯}; (12) else (13) dataset = dataset βˆͺ each 𝑑 . contains(π‘₯); (14) found cv in dataset; (15) connect π‘₯ with cv; (16) 𝑄 . push(); (17) dataset = πœ™, cv = πœ™; (18) end if (19) end while (20) return 𝑅 Algorithm 3: Inducing rule set 𝐴.

Step 2, according to V1 , V2 , . . . , Vπ‘˜ , 𝑇1 is split into datasets 𝑑1 , 𝑑2 , . . . , 𝑑𝑛 . We find cv𝑖 (a set of cover values) from 𝑑𝑖 on the basis of information gain. The measure of cv𝑖 is the same as CVCR, shown as Algorithm 2. CDCR connects V𝑖 with cv𝑖 to produce patterns. If the information gain of pattern is equal to 0, 𝑋 β†’ 𝐢 belongs to rule set 𝐴, shown as Algorithm 3. Step 3, CDCR recalculates the CV (cover values) in 𝑇1 excluding V1 , V2 , . . . , Vπ‘˜ . CV splits 𝑇1 into some datasets and connects covered value in each dataset to produce new rules. These rules belong to rule set 𝐡, shown as Algorithm 4. Finally, we remove 𝑇1 from 𝑇 and iterate the process until 𝑇2 , 𝑇3 are trained. Rule set 𝐴 is the same as CVCR. Both rule sets 𝐴 and 𝐡 belong to CDCR.

3.2. Pruning Rule Set 𝐡. In order to improve the quality of rule set 𝐡, we introduce a new method CDCR-P (Classification based on the Pruning and Double Covered Rule sets). Definition 2 (confidence). The confidence of sample 𝑋 is defined as follows: conf (𝑋) =

count (𝑋𝑐 ) Γ— 100%, count (𝑋)

(2)

where count (𝑋𝑐 ) means the number of tuples which contain sample 𝑋 in class 𝑐. The confidence of rules that CDCR generated is equal to 100%. We modify the length of rule set 𝐡. The rules are

4

The Scientific World Journal

Input: Training data 𝑇𝑖 , CV𝑖 Output: Rule set 𝐡 Method: (1) Rule set 𝑅 = πœ™, dataset = πœ™, cover value cv = πœ™, candidate set cs = πœ™, IniteQueue 𝑄; (2) add cover value which can cover 𝑇𝑖 to cs, and cs βˆ‰ CV𝑖 (3) while cs =ΜΈ πœ™ (4) 𝑄 . push(π‘Ž) where π‘Ž is first element of cs; (5) cs = cs βˆ’ {π‘Ž}; (6) end while (7) while !𝑄 . empty() (8) π‘₯ = 𝑄 . front(); 𝑄 . pop() (9) If π‘₯ . length > max rule length continue; (10) compute the information gain of π‘₯; (11) if π‘₯ . information gain == 0 (12) 𝑅 = 𝑅 βˆͺ {π‘₯}; (13) else (14) dataset = dataset βˆͺ each 𝑑 . Contains(π‘₯); (15) Found cv in dataset, and cv βˆ‰ CV𝑖 ; (16) connect π‘₯ with cv; (17) 𝑄 . push(); (18) dataset = πœ™, cv = πœ™; (19) end if (20) end while (21) return 𝑅 Algorithm 4: Inducing rule set 𝐡.

generated when the confidence is 100% in the small dataset 𝑇𝑖 instead of in the whole training set 𝑇. Each rule is marked with the confidence in 𝑇. Thus, rule set 𝐡 in CDCR-P is shorter than rule set 𝐡 in CDCR. 3.3. Classifying Unknown Examples. In this part, we give the method of how to use CDCR and CDCR-P to classify unknown instances. Definition 3 (support). The support of sample 𝑋 is denoted by sup (𝑋) =

count (𝑋) Γ— 100%, |𝑇|

(3)

where count(𝑋) means the number of tuples which contain sample 𝑋. |𝑇| is the number of tuples in training data. When testing unknown examples, CDCR selects the matched rule with the highest support. If some rules have the same support, we select the maximum number of matched rules in each class. CDCR-P first considers the rule with the highest confidence. If two rules have the same confidence, CDCR-P sorts the two rules according to the support. Definition 4 (missing match rate). If the test instance cannot find any match rule, this unclassified instance is considered mismatch. The missing match rate is defined as count (unclassified instance) Γ— 100%, |𝑇|

(4)

Table 1: Characteristics of UCI datasets. Dataset No. of instances Balance 625 Breast 699 Car 1728 Lymph 148 Monks 432 Mushroom 400 Soybean 307 SPECT 267 Tic-tac 958 Zoo 101 Cleve 303 Heart 270 Iris 150 Wine 178

No. of attributes 5 10 7 18 7 23 36 23 10 16 13 13 4 14

No. of classes 3 2 4 4 2 2 19 2 2 7 2 2 3 3

where count (unclassified instance) means the number of tuples which cannot be matched by rules.

4. Experiments We show the experimental results in 14 UCI datasets. The character of each data is shown in Table 1. All the experiments are performed on a 2.2 GHz PC with 2.84 G main memory, running Microsoft Windows XP. Experiments run tenfold cross validation method for each data.

The Scientific World Journal

5 Table 2: The accuracy of ID3, FOIL, CVCR, CDCR, and CDCR-P.

Dataset Balance Breast Car Lymph Monks Mushroom Soybean SPECT Tic-tac Zoo Cleve Heart Iris Wine Average

ID3 0.3716 0.9042 0.7298 0.7148 0.9448 0.985 0.4102 0.7181 0.8215 0.97 0.7426 0.8148 0.7733 0.96 0.7758

FOIL 0.4929 0.9342 0.7714 0.7424 0.8146 0.995 0.4172 0.752 0.9875 0.9409 0.7423 0.8148 0.9533 0.9379 0.8069

CVCR 0.7810 0.9571 0.8837 0.8181 0.8959 0.99 0.8180 0.7303 0.8684 0.9709 0.8152 0.7556 0.8133 0.983 0.8629

CDCR-P 0.8289 0.9585 0.8767 0.81 0.9537 0.99 0.8601 0.8053 0.9530 0.8918 0.8482 0.7963 0.8533 0.9882 0.8867

The accuracy of CVCR, CDCR, and CDCR-P

The accuracy of ID3, FOIL, and CVCR 1 0.9 0.8 0.7

ID3 FOIL CVCR

Datasets

Iris

Wine

Cleve

Heart

zoo

tic-tac

SPECT

Soybean

Monks

Mushroom

car

CVCR CDCR CDCR-P

Lymph

Wine

Iris

Heart

zoo

Cleve

tic-tac

SPECT

Soybean

Mushroom

Monks

Lymph

car

Breast

0.5

Breast

0.6 Balance

Accuracy

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Balance

Accuracy

CDCR 0.7985 0.9556 0.9248 0.81 0.9514 0.99 0.8276 0.7453 0.9541 0.9609 0.8216 0.7852 0.8133 0.9712 0.8793

Datasets

Figure 1: The accuracy of ID3, FOIL, and CVCR.

Figure 2: The accuracy of CVCR, CDCR, and CDCR-P.

In Table 2, we give the accuracy of ID3, FOIL, CVCR, CDCR, and CDCR-P. Figure 1 gives the accuracy of ID3, FOIL, and CVCR. CVCR employs the idea of covered values; these cover values are the global optimal attribute values in training data. From Figure 1 and Table 2 we can see that CVCR can achieve higher accuracy than ID3 and FOIL. Figure 2 gives the accuracy of CVCR, CDCR, and CDCR-P. CDCR not only uses the method of covered values, but also produces two rule sets 𝐴 and 𝐡. Each instance can be matched at least by one rule from rule set 𝐴 and rule set 𝐡. From Figure 2 and Table 2 we can see that CDCR can achieve higher accuracy than CVCR. Based on all advantages of CVCR and CDCR, CDCR-P take measure to prune the length of rule set 𝐡. The experimental results show that CDCR-P has the highest accuracy. Table 3 displays the missing match rate of ID3, FOIL, CVCR, CDCR, and CDCR-P. CVCR can produce more rules than ID3 and FOIL. From Table 3 we can see that the missing match rate is decreased obviously by CVCR. CDCR produces two rule sets. Therefore, CDCR produces more rules than

CVCR. From Table 3 we can see that the missing match rate of CDCR is lower than CVCR. CDCR-P modifies the length of rule set 𝐡; the quality of rules in CDCR-P is higher than CDCR. The experiments indicate that the mismatch rate of CDCR-P is the lowest. Through all the above experimental results, we can conclude the following. (1) It is necessary for us to construct two rule sets. (2) It is necessary to prune rule set 𝐡. (3) CDCRP can achieve high accuracy and has an excellent result in missing match rate.

5. Conclusions Classification has been widely applied in IOT. The accuracy of classification is an important factor in classification task. The traditional rule-based classifications cannot guarantee that all test cases can be matched by two rules. They usually generate less classification rules. Thus, the accuracy of these algorithms may be low in some data. In this paper, a novel approach CDCR-P is proposed. CDCR-P generates two rule

6

The Scientific World Journal Table 3: The missing match rate of ID3, FOIL, CVCR, CDCR, and CDCR-P.

Dataset Balance Breast Car Lymph Monks Mushroom Soybean SPECT Tic-tac Zoo Cleve Heart Iris Wine Average

ID3 0.422 0.0529 0.2165 0.0624 0 0.005 0.0559 0.0566 0.0366 0.01 0.0231 0 0.1733 0 0.0796

FOIL 0.3518 0.0443 0.1945 0.1014 0.1023 0.005 0.2787 0.0634 0.0094 0.0591 0.0795 0.0148 0.0067 0.0454 0.0969

sets: rule set 𝐴 and rule set 𝐡. All instances can be matched by at least one rule not only in rule set 𝐴, but also in rule set 𝐡. This method greatly increases the number of extracted rules. Thus, it gets more information from training data. Our experimental results show that the methods of CDCR-P can produce more rules and achieve high accuracy. In future research, we will perform an in-depth study on combining distributed data mining with IOT in order to improve the efficiency of CDCR-P.

CVCR 0.0910 0.0029 0.0231 0.0267 0.0486 0 0.0097 0.1387 0.0052 0 0.0298 0.0704 0.1533 0.0056 0.0432

[5]

[6]

[7]

Conflict of Interests The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments This work is funded by China NFS program (no. 61170129), Fujian Province NSF Program (no. 2013J01259), and Minnan Normal University Postgraduate Education Project (no. 13001314).

References [1] L. Hu, Z. Zhang, F. Wang, and K. Zhao, β€œOptimization of the deployment of temperature nodes based on linear programing in the internet of things,” Tsinghua Science and Technology, pp. 250–258, 2013. [2] M. A. Feki, F. Kawsar, M. Boussard, and L. Trappeniers, β€œThe internet of things: the next technological revolution,” IEEE Computer, vol. 46, no. 2, pp. 24–25, 2013. [3] W. Zhu, J. Yu, and T. Wang, β€œA Security and privacy model for mobile RFID systems in the internet of things,” in Proceedings of the IEEE 14th International Conference on Communication Technology (ICCT ’12), pp. 726–732, November 2012. [4] C. K. Tham and T. Luo, β€œSensing-driven energy purchasing in smart grid cyber-physical system,” IEEE Transactions in

[8]

[9]

[10] [11]

[12]

CDCR 0.0799 0 0.0156 0 0.0486 0 0 0.1313 0 0 0.0265 0.0444 0.1533 0 0.0357

CDCR-P 0.0016 0 0 0 0 0 0 0 0 0 0 0.0111 0.06 0 0.0052

Systems, Man and Cybernetics A, vol. 43, no. 4, pp. 773–784, 2013. X. Xing, J. Wang, and M. Li, β€œServices and key technologies of the internet of things,” ZTE Communications, no. 2, pp. 26–29, 2010. Z. Qureshi, J. Bansal, and S. Bansal, β€œA survey on association rule mining in cloud computing,” International Journal of Emerging Technology and Advanced Engineering, vol. 3, no. 4, pp. 318–321, 2013. J. R. Quinlan and R. M. Cameron-Jones, β€œFOIL: a midtern report,” in Proceedings of the European Conference Machine Learning, pp. 3–20, Vienna, Austria. X. Yin and J. Han, β€œCPAR: classification based on predictive association rules,” in Data Mining, The SIAM (Society for Industrial and Applied Mathematics) International Conference, May 2003. X. Wang, Z. Zhou, and G. Pan, β€œCMER: classification based on multiple excellent rules,” Journal of Theoretical and Applied Information Technology, pp. 661–665, 2013. J. R. Quinlan, β€œInduction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986. J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Series in Machine Learning, Kaufmann Academic, 1993. Y.-L. Chen, W.-H. Hsu, and Y.-H. Lee, β€œTASC: two-attributeset clustering through decision tree construction,” European Journal of Operational Research, vol. 174, no. 2, pp. 930–944, 2006.