Hindawi Publishing Corporation ξ e Scientiο¬c World Journal Volume 2014, Article ID 984375, 6 pages http://dx.doi.org/10.1155/2014/984375
Research Article Classification Based on Pruning and Double Covered Rule Sets for the Internet of Things Applications Shasha Li, Zhongmei Zhou, and Weiping Wang Department of Computer Science and Engineering, Minnan Normal University, Zhangzhou 363000, China Correspondence should be addressed to Shasha Li;
[email protected] Received 14 October 2013; Accepted 25 November 2013; Published 5 January 2014 Academic Editors: W. Sun, G. Zhang, and J. Zhou Copyright Β© 2014 Shasha Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Internet of things (IOT) is a hot issue in recent years. It accumulates large amounts of data by IOT users, which is a great challenge to mining useful knowledge from IOT. Classification is an effective strategy which can predict the need of users in IOT. However, many traditional rule-based classifiers cannot guarantee that all instances can be covered by at least two classification rules. Thus, these algorithms cannot achieve high accuracy in some datasets. In this paper, we propose a new rule-based classification, CDCR-P (Classification based on the Pruning and Double Covered Rule sets). CDCR-P can induce two different rule sets π΄ and π΅. Every instance in training set can be covered by at least one rule not only in rule set π΄, but also in rule set π΅. In order to improve the quality of rule set π΅, we take measure to prune the length of rules in rule set π΅. Our experimental results indicate that, CDCR-P not only is feasible, but also it can achieve high accuracy.
1. Introduction The Internet of things is one of the hot topics in recent years. It has integrated many kinds of modern technology. By these kinds of technology, it produces large-scale data in IOT. In order to handle these large data, it requires techniques and methods of data mining and machine learning [1β6]. As one of the most important tasks of data mining, classification has been widely applied in IOT. The main idea of classification is that builds classification rules. According these rules, we can predict the class label for unknown objects. Traditional rule-based classifications usually use greedy approach, such as FOIL [7], CPAR [8], and CMER [9]. These methods repeatedly search for the current best one rule or best-π rules and remove examples covered by the rules. They cannot guarantee that all instances can be covered by at least two classification rules. As a result, some traditional classifiers have less classification rules. Their accuracy may not be high. Decision tree classifiers produce classification rules by constructing classification trees, such as ID3 [10], C4.5 [11], and TASC [12]. The process of building a decision tree does not need to delete any examples. All examples can find only one matching rule in the classification rule set. That
is why decision trees often generate small rule sets and cannot achieve high accuracy in some data. Aiming at these weaknesses, we propose a novel Double Covered Rule sets classifier called CDCR-P (Classification based on the Pruning and Double Covered Rule sets). CDCRP generates two different rule sets π΄ and π΅ and then prunes the rule set π΅. Each instance can be covered by at least one rule from rule set π΄. At the same time, each instance can be covered by at least one rule from rule set π΅. CDCR-P has four aspects. First, CDCR-P generates rule set π΄. We select several best values which can just cover the training set to construct a candidate set. CDCR-P employs candidate set to produce rule set π΄. Second, in order to induce rule set π΅, we remove the values of candidate set in training data and select other several best values to induce rule set π΅. Rule set π΄ is fully different from rule set π΅. Third, each instance can find at least two matching rules. One of the rules is from rule set π΄, and another is from rule set π΅. Forth, we prune the length of rules in rule set π΅, so as to improve the quality of rule set π΅. Our method has the following advantages. (1) CDCR-P can produce two rule sets. Thus, CDCR-P can generate large number of classification rules.
2
The Scientific World Journal
Input: Training data π = {π‘1 , π‘2 , . . . , π‘π } Output: π1 , candidate value CV1 Method: (1) π1 = π, CV1 = π, Length = 0; (2) compute the information gain of each sample π ; (3) sort π according to the information gain in descending order. If two sample have the same information gain, sort the two sample according to the support; (4) Length = π . Count Γ 1/3; (5) While π =ΜΈ π (6) π = π β {π}, where π is the first element of π; (7) for each tuple π‘ in π (8) While π‘ . Contains(π) && π1 . Count max rule length continue; (9) compute the information gain of π₯; (10) if π₯ . information gain == 0 (11) π
= π
βͺ {π₯}; (12) else (13) dataset = dataset βͺ each π‘ . contains(π₯); (14) found cv in dataset; (15) connect π₯ with cv; (16) π . push(); (17) dataset = π, cv = π; (18) end if (19) end while (20) return π
Algorithm 3: Inducing rule set π΄.
Step 2, according to V1 , V2 , . . . , Vπ , π1 is split into datasets π‘1 , π‘2 , . . . , π‘π . We find cvπ (a set of cover values) from π‘π on the basis of information gain. The measure of cvπ is the same as CVCR, shown as Algorithm 2. CDCR connects Vπ with cvπ to produce patterns. If the information gain of pattern is equal to 0, π β πΆ belongs to rule set π΄, shown as Algorithm 3. Step 3, CDCR recalculates the CV (cover values) in π1 excluding V1 , V2 , . . . , Vπ . CV splits π1 into some datasets and connects covered value in each dataset to produce new rules. These rules belong to rule set π΅, shown as Algorithm 4. Finally, we remove π1 from π and iterate the process until π2 , π3 are trained. Rule set π΄ is the same as CVCR. Both rule sets π΄ and π΅ belong to CDCR.
3.2. Pruning Rule Set π΅. In order to improve the quality of rule set π΅, we introduce a new method CDCR-P (Classification based on the Pruning and Double Covered Rule sets). Definition 2 (confidence). The confidence of sample π is defined as follows: conf (π) =
count (ππ ) Γ 100%, count (π)
(2)
where count (ππ ) means the number of tuples which contain sample π in class π. The confidence of rules that CDCR generated is equal to 100%. We modify the length of rule set π΅. The rules are
4
The Scientific World Journal
Input: Training data ππ , CVπ Output: Rule set π΅ Method: (1) Rule set π
= π, dataset = π, cover value cv = π, candidate set cs = π, IniteQueue π; (2) add cover value which can cover ππ to cs, and cs β CVπ (3) while cs =ΜΈ π (4) π . push(π) where π is first element of cs; (5) cs = cs β {π}; (6) end while (7) while !π . empty() (8) π₯ = π . front(); π . pop() (9) If π₯ . length > max rule length continue; (10) compute the information gain of π₯; (11) if π₯ . information gain == 0 (12) π
= π
βͺ {π₯}; (13) else (14) dataset = dataset βͺ each π‘ . Contains(π₯); (15) Found cv in dataset, and cv β CVπ ; (16) connect π₯ with cv; (17) π . push(); (18) dataset = π, cv = π; (19) end if (20) end while (21) return π
Algorithm 4: Inducing rule set π΅.
generated when the confidence is 100% in the small dataset ππ instead of in the whole training set π. Each rule is marked with the confidence in π. Thus, rule set π΅ in CDCR-P is shorter than rule set π΅ in CDCR. 3.3. Classifying Unknown Examples. In this part, we give the method of how to use CDCR and CDCR-P to classify unknown instances. Definition 3 (support). The support of sample π is denoted by sup (π) =
count (π) Γ 100%, |π|
(3)
where count(π) means the number of tuples which contain sample π. |π| is the number of tuples in training data. When testing unknown examples, CDCR selects the matched rule with the highest support. If some rules have the same support, we select the maximum number of matched rules in each class. CDCR-P first considers the rule with the highest confidence. If two rules have the same confidence, CDCR-P sorts the two rules according to the support. Definition 4 (missing match rate). If the test instance cannot find any match rule, this unclassified instance is considered mismatch. The missing match rate is defined as count (unclassified instance) Γ 100%, |π|
(4)
Table 1: Characteristics of UCI datasets. Dataset No. of instances Balance 625 Breast 699 Car 1728 Lymph 148 Monks 432 Mushroom 400 Soybean 307 SPECT 267 Tic-tac 958 Zoo 101 Cleve 303 Heart 270 Iris 150 Wine 178
No. of attributes 5 10 7 18 7 23 36 23 10 16 13 13 4 14
No. of classes 3 2 4 4 2 2 19 2 2 7 2 2 3 3
where count (unclassified instance) means the number of tuples which cannot be matched by rules.
4. Experiments We show the experimental results in 14 UCI datasets. The character of each data is shown in Table 1. All the experiments are performed on a 2.2 GHz PC with 2.84 G main memory, running Microsoft Windows XP. Experiments run tenfold cross validation method for each data.
The Scientific World Journal
5 Table 2: The accuracy of ID3, FOIL, CVCR, CDCR, and CDCR-P.
Dataset Balance Breast Car Lymph Monks Mushroom Soybean SPECT Tic-tac Zoo Cleve Heart Iris Wine Average
ID3 0.3716 0.9042 0.7298 0.7148 0.9448 0.985 0.4102 0.7181 0.8215 0.97 0.7426 0.8148 0.7733 0.96 0.7758
FOIL 0.4929 0.9342 0.7714 0.7424 0.8146 0.995 0.4172 0.752 0.9875 0.9409 0.7423 0.8148 0.9533 0.9379 0.8069
CVCR 0.7810 0.9571 0.8837 0.8181 0.8959 0.99 0.8180 0.7303 0.8684 0.9709 0.8152 0.7556 0.8133 0.983 0.8629
CDCR-P 0.8289 0.9585 0.8767 0.81 0.9537 0.99 0.8601 0.8053 0.9530 0.8918 0.8482 0.7963 0.8533 0.9882 0.8867
The accuracy of CVCR, CDCR, and CDCR-P
The accuracy of ID3, FOIL, and CVCR 1 0.9 0.8 0.7
ID3 FOIL CVCR
Datasets
Iris
Wine
Cleve
Heart
zoo
tic-tac
SPECT
Soybean
Monks
Mushroom
car
CVCR CDCR CDCR-P
Lymph
Wine
Iris
Heart
zoo
Cleve
tic-tac
SPECT
Soybean
Mushroom
Monks
Lymph
car
Breast
0.5
Breast
0.6 Balance
Accuracy
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Balance
Accuracy
CDCR 0.7985 0.9556 0.9248 0.81 0.9514 0.99 0.8276 0.7453 0.9541 0.9609 0.8216 0.7852 0.8133 0.9712 0.8793
Datasets
Figure 1: The accuracy of ID3, FOIL, and CVCR.
Figure 2: The accuracy of CVCR, CDCR, and CDCR-P.
In Table 2, we give the accuracy of ID3, FOIL, CVCR, CDCR, and CDCR-P. Figure 1 gives the accuracy of ID3, FOIL, and CVCR. CVCR employs the idea of covered values; these cover values are the global optimal attribute values in training data. From Figure 1 and Table 2 we can see that CVCR can achieve higher accuracy than ID3 and FOIL. Figure 2 gives the accuracy of CVCR, CDCR, and CDCR-P. CDCR not only uses the method of covered values, but also produces two rule sets π΄ and π΅. Each instance can be matched at least by one rule from rule set π΄ and rule set π΅. From Figure 2 and Table 2 we can see that CDCR can achieve higher accuracy than CVCR. Based on all advantages of CVCR and CDCR, CDCR-P take measure to prune the length of rule set π΅. The experimental results show that CDCR-P has the highest accuracy. Table 3 displays the missing match rate of ID3, FOIL, CVCR, CDCR, and CDCR-P. CVCR can produce more rules than ID3 and FOIL. From Table 3 we can see that the missing match rate is decreased obviously by CVCR. CDCR produces two rule sets. Therefore, CDCR produces more rules than
CVCR. From Table 3 we can see that the missing match rate of CDCR is lower than CVCR. CDCR-P modifies the length of rule set π΅; the quality of rules in CDCR-P is higher than CDCR. The experiments indicate that the mismatch rate of CDCR-P is the lowest. Through all the above experimental results, we can conclude the following. (1) It is necessary for us to construct two rule sets. (2) It is necessary to prune rule set π΅. (3) CDCRP can achieve high accuracy and has an excellent result in missing match rate.
5. Conclusions Classification has been widely applied in IOT. The accuracy of classification is an important factor in classification task. The traditional rule-based classifications cannot guarantee that all test cases can be matched by two rules. They usually generate less classification rules. Thus, the accuracy of these algorithms may be low in some data. In this paper, a novel approach CDCR-P is proposed. CDCR-P generates two rule
6
The Scientific World Journal Table 3: The missing match rate of ID3, FOIL, CVCR, CDCR, and CDCR-P.
Dataset Balance Breast Car Lymph Monks Mushroom Soybean SPECT Tic-tac Zoo Cleve Heart Iris Wine Average
ID3 0.422 0.0529 0.2165 0.0624 0 0.005 0.0559 0.0566 0.0366 0.01 0.0231 0 0.1733 0 0.0796
FOIL 0.3518 0.0443 0.1945 0.1014 0.1023 0.005 0.2787 0.0634 0.0094 0.0591 0.0795 0.0148 0.0067 0.0454 0.0969
sets: rule set π΄ and rule set π΅. All instances can be matched by at least one rule not only in rule set π΄, but also in rule set π΅. This method greatly increases the number of extracted rules. Thus, it gets more information from training data. Our experimental results show that the methods of CDCR-P can produce more rules and achieve high accuracy. In future research, we will perform an in-depth study on combining distributed data mining with IOT in order to improve the efficiency of CDCR-P.
CVCR 0.0910 0.0029 0.0231 0.0267 0.0486 0 0.0097 0.1387 0.0052 0 0.0298 0.0704 0.1533 0.0056 0.0432
[5]
[6]
[7]
Conflict of Interests The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments This work is funded by China NFS program (no. 61170129), Fujian Province NSF Program (no. 2013J01259), and Minnan Normal University Postgraduate Education Project (no. 13001314).
References [1] L. Hu, Z. Zhang, F. Wang, and K. Zhao, βOptimization of the deployment of temperature nodes based on linear programing in the internet of things,β Tsinghua Science and Technology, pp. 250β258, 2013. [2] M. A. Feki, F. Kawsar, M. Boussard, and L. Trappeniers, βThe internet of things: the next technological revolution,β IEEE Computer, vol. 46, no. 2, pp. 24β25, 2013. [3] W. Zhu, J. Yu, and T. Wang, βA Security and privacy model for mobile RFID systems in the internet of things,β in Proceedings of the IEEE 14th International Conference on Communication Technology (ICCT β12), pp. 726β732, November 2012. [4] C. K. Tham and T. Luo, βSensing-driven energy purchasing in smart grid cyber-physical system,β IEEE Transactions in
[8]
[9]
[10] [11]
[12]
CDCR 0.0799 0 0.0156 0 0.0486 0 0 0.1313 0 0 0.0265 0.0444 0.1533 0 0.0357
CDCR-P 0.0016 0 0 0 0 0 0 0 0 0 0 0.0111 0.06 0 0.0052
Systems, Man and Cybernetics A, vol. 43, no. 4, pp. 773β784, 2013. X. Xing, J. Wang, and M. Li, βServices and key technologies of the internet of things,β ZTE Communications, no. 2, pp. 26β29, 2010. Z. Qureshi, J. Bansal, and S. Bansal, βA survey on association rule mining in cloud computing,β International Journal of Emerging Technology and Advanced Engineering, vol. 3, no. 4, pp. 318β321, 2013. J. R. Quinlan and R. M. Cameron-Jones, βFOIL: a midtern report,β in Proceedings of the European Conference Machine Learning, pp. 3β20, Vienna, Austria. X. Yin and J. Han, βCPAR: classification based on predictive association rules,β in Data Mining, The SIAM (Society for Industrial and Applied Mathematics) International Conference, May 2003. X. Wang, Z. Zhou, and G. Pan, βCMER: classification based on multiple excellent rules,β Journal of Theoretical and Applied Information Technology, pp. 661β665, 2013. J. R. Quinlan, βInduction of decision trees,β Machine Learning, vol. 1, no. 1, pp. 81β106, 1986. J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Series in Machine Learning, Kaufmann Academic, 1993. Y.-L. Chen, W.-H. Hsu, and Y.-H. Lee, βTASC: two-attributeset clustering through decision tree construction,β European Journal of Operational Research, vol. 174, no. 2, pp. 930β944, 2006.