Int. Journal of Math. Analysis, Vol. 6, 2012, no. 51, 2511 - 2518
Classification of Census Using Information Theoretic Measure Based ID3 Algorithm Kumar Ashok Department of Applied Sciences Chitkara University Solan, HP, India
[email protected] Taneja H C Delhi Technological University Department of Applied Mathematics Delhi, India Chitkara Ashok K Department of Applied Sciences Chitkara University Solan, HP, India Kumar Vikas Delhi Technological University Department of Applied Mathematics Delhi, India Abstract In this paper the concept of information theory is applied in the field of data mining. As in the algorithms of data mining, the classification is an essential step, using an information theoretic measure in ID3 algorithm, one of the key algorithms of decision tree algorithms, we have discussed the different steps of the development of decision tree so that the best classification criteria can be developed which is helpful in making good decisions. From the data under consideration having a set of values, a property on the basis of calculation is selected as the root of the tree and the process is repeated to develop complete decision tree. Further, this method is applied to the data of Census-2011 of India to
2512
Kumar Ashok, Taneja H C, Chitkara Ashok K and Kumar Vikas get some values worth in improving or implementing a policy and to select right policy for right people.
Mathematics Subject Classification: 94A17, 68P20 Keywords: Information Measure, Data Mining, ID3 Algorithm, Decision Tree.
1
Introduction
Shannon [2] introduced the concept of measure of information or entropy for a general finite complete probability distribution P = (p1 , p2 , p3 , ..., pn ) given by H(X) = −
n i=1
pi log pi , 0 ≤ pi ≤ 1,
n i=1
pi = 1.
(1)
The notation of entropy is of fundamental importance in different areas such as physics, probability and statistics, communication theory and economics. Shannon entropy plays an important role in the context of information theory. To date, one of the most widely benefiting application has been for data compression and transmission. Since the pioneering work of Shannon [2], the concept of entropy has been generalized in a number of different ways by different researchers. Entropy and its various generalizations are widely used in mathematical statistics, communication theory, physical and computer sciences for characterizing the amount of information in a probability distribution, refer to [11]. The Renyi entropy [1] is an additive generalization of Shannon entropy defined as
n 1 Hα (X) = log (pi )α ; α = 1, α ≥ 0. 1−α i=1
(2)
It has similar properties as that of Shannon entropy, but it contains additional parameter ’α’ which can be used to make it more or less sensitive to the shape of probability distributions. If it has large positive value this measure is more sensitive to events that occur often, while for large negative values of ’α’, it is more sensitive to the events which happen seldom. An active area of current research in the application of entropy is in data mining, refer to [6]. In the algorithms of data mining, the classification is an important mission. Decision tree algorithm, one of the key algorithms, is commonly used to build predictive model for classification. Using the different concepts of measurement of information the classification algorithms of data
Information theoretic measure based ID3 algorithm
2513
mining can be modified to get desired results with more accuracy. On the basis of gained information one property is selected as root of the tree and then the process is repeated again for its sub trees. In the algorithm of decision tree, information gain plays important role in identifying appropriate attribute of every node of the tree. This information gain is obtained by using the concept of Shannon entropy in the different algorithms employed refer to [8]. The mostly used decision tree producing algorithms are ID3 and C4.5. These algorithms produce a decision tree by calculating entropies of attributes and the values of each attribute are then used in order to derive generalized rules from a given sample set. Its objective is the analysis of data with specific constraints to learn a given model, and then to compare the examples of unknown classes [5]. In this paper we have used the concept of Renyi entropy for α=2 in ID3 Algorithm.
2
ID3 Algorithm
Algorithm of ID3 can be discussed as follows: Algorithm: Procedure Build− Decision− T ree from given sample data to generate a decision tree. Input: D-the Specific data, displayed by discrete value, attribute− list - candidate property set. The following is the pseudo code of decision tree algorithm: Create Node N; Procedure Build− Decision− T ree() If D belongs to Class S then Return N as Leaf Node and labeled by S; If attribute− list is null then Return N as Leaf Node and labeled by Class U in D; For each attribute in attribute− list, compute information gain G Select the attribute which has max (G) in attribute− list as the test− attribute of N; Set Node N as test− attribute; Set si as the aggregate of test− attribute = ai in the Example D; If si is null then Add a Leaf Node, label as normal Class in D; Else Recursive procedure Build− Decision− T ree (si, test− attribute); For more details about this algorithm we can refer to [9].
3
Algorithm of Information Gain
Information gain is used in the algorithm of decision tree, to identify appropriate attribute of every node. ID3 algorithm by the concept of information entropy theory select the property of maximum information gain in given specific data set, which has some given classes with complete observing values, as the test property. The selected attribute makes the minimum value of in-
2514
Kumar Ashok, Taneja H C, Chitkara Ashok K and Kumar Vikas
formation gain in the result of classification of example data, and reflects the minimum randomness, refer to [5]. This information entropy theory reduces the number of steps which is needed to the classification and ensures to find a simple tree. Let us consider a specific data ’D’ having ’n’ number of different values with ’Ri ’ where i=1, 2, 3, , r number of ranges for each ’Pi ’ property, where i=1, 2, 3, , p and the data is to be divided into ’Ci ’ where i=1, 2, 3, , c number of classes. Now, the quantity of the information required for the object data is calculated using Renyi entropy, as follows: c 1 I(C1 , C2 , C3 , · · · , Cc ) = log2 pαi . 1−α i=1
(3)
Suppose the property ’Pi ’ is selected from the set. Using this property, ’D’ can be divided into ’Ri ’ number of sets which has same number of values as that of in ’D’ and entropy for the particular property is calculated by following equation based upon the total number of the values of ’D’. If ’ri ’ is the number of values in the range ’Ri ’ of property ’Pi ’ then entropy of ’Pi ’ is given as: E(Pi ) =
c r ri 1 pαi . log2 1 − α i=1 n i=1
(4)
Here, ’r’ represents the number of ranges and ’ri ’ number of different values, in particular property and thus the net gained information from the property ’Pi ’ is (5) G(Pi ) = I(C1 , C2 , C, · · · , Cc ) − E(Pi ) Information acquisitions are calculated individually and accordingly entropy values are calculated. The property with the highest knowledge acquisition is selected as the root of the tree. Other properties are rearranged accordingly. The same procedure is repeated for other sub sets. Rules are formed on the basis of final decision tree. The classification mainly deals with extraction of information of the system and its systematic development. To achieve this objective the best solution is the process in which entropy is least.
4
Classification of Census 2011 of India using ID3 algorithm
For α = 2, using the Renyi entropy measure (2), on the following data of 2011 Census of India, we have some results for decision tree. Here ’s’ represents ’State’ and ’u’ represents ’Union Territory’. Table 1 contains the data of Census 2011 of India taken from the source http://www.censusindia.gov.in. The complete data is divided into two parts; one as States (’S’) and another is Union territory (’U’) on the basis of four
2515
Information theoretic measure based ID3 algorithm
Sr. no. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
Density < 120 271 ∼ 550 < 120 271 ∼ 550 551 ∼ 2170 > 2170 120 ∼ 270 551 ∼ 2170 551 ∼ 2170 271 ∼ 550 271 ∼ 550 551 ∼ 2170 120 ∼ 270 120 ∼ 270 271 ∼ 550 271 ∼ 550 551 ∼ 2170 551 ∼ 2170 120 ∼ 270 271 ∼ 550 120 ∼ 270 120 ∼ 270 < 120 < 120 > 2170 120 ∼ 270 > 2170 271 ∼ 550 120 ∼ 270 < 120 551 ∼ 2170 271 ∼ 550 551 ∼ 2170 120 ∼ 270 551 ∼ 2170
Table 1: Census-2011 INDIA Population Sex Ratio < 14 < 900 > 605 > 974 < 14 900 ∼ 946 66 ∼ 330 947 ∼ 963 > 605 900 ∼ 946 < 14 < 900 66 ∼ 330 > 974 < 14 < 900 < 14 < 900 14 ∼ 65 964 ∼ 974 331 ∼ 605 900 ∼ 946 66 ∼ 330 < 900 66 ∼ 330 964 ∼ 974 66 ∼ 330 < 900 66 ∼ 330 947 ∼ 963 > 605 964 ∼ 974 331 ∼ 605 > 974 < 14 900 ∼ 946 > 605 900 ∼ 946 > 605 900 ∼ 946 14 ∼ 65 > 974 14 ∼ 65 > 974 < 14 > 974 14 ∼ 65 900 ∼ 946 66 ∼ 330 < 900 331 ∼ 605 > 974 < 14 > 974 66 ∼ 330 < 900 > 605 900 ∼ 946 < 14 < 900 > 605 > 974 14 ∼ 65 947 ∼ 963 > 605 900 ∼ 946 66 ∼ 330 947 ∼ 963 > 605 947 ∼ 63
Literacy > 85 < 72 < 72 72 ∼ 78 < 72 > 85 < 72 72 ∼ 78 > 85 > 85 79 ∼ 81 72 ∼ 78 82 ∼ 85 < 72 < 72 72 ∼ 78 > 85 > 85 < 72 82 ∼ 85 79 ∼ 81 72 ∼ 78 > 85 79 ∼ 81 > 85 72 ∼ 78 > 85 72 78 < 72 82 ∼ 85 79 ∼ 81 > 85 < 72 79 ∼ 81 72 ∼ 78
S\U U S S S S U S U U S S S S S S S S U U U U U U S U S U S S S S S S S S
2516
Kumar Ashok, Taneja H C, Chitkara Ashok K and Kumar Vikas
properties: Density, Population, Sex ratio and Literacy having different set of ranges ’Ri ’ as ’< 120, 120 ∼ 270, 271 ∼ 550, 551 ∼ 2170 and > 2170’ for Density, ’< 14, 14 ∼ 65, 66 ∼ 330, 331 ∼ 605 and > 605’ for Population, ’< 900, 900 ∼ 946, 947 ∼ 963, 964 ∼ 974 and > 974’ for Sex ratio and ’< 72, 72 ∼ 78, 79 ∼ 81, 82 ∼ 85 and > 85’ for Literacy. Different properties like ’Density’ is the population density of person per square kilometers, ’Population’ is in Lakhs, ’Sex ratio’ represents Females per 1000 males and ’Literacy’ is the rate of literacy in percentage, of literate for 7 years and older. In Table 1, the complete data is divided into two classes i.e.c=2 as 28 number of ’s’ and 7 number of ’u’. Therefore the required information is; 2 1 28 2 7 I(C1 , C2 ) = log2 + (6) 1−2 35 35 Further, the information associated with the four properties at α=2 is given as follows:
12 42 32 92 5 3 9 + − − E(P1 ) = E(Density) = − log2 log2 log2 35 5 5 35 3 35 9 2 2 9 3 9 9 62 log2 − log2 − + (7) 35 9 35 9 9
62 32 92 52 9 9 5 log2 log2 + − − E(P2 ) = E(P opulation) = − log2 35 9 9 35 9 35 5 2 2 3 1 3 9 82 log2 − log2 − + (8) 35 3 35 9 9
52 42 12 82 9 9 log2 + − + E(P3 ) = E(Sex ratio) = − log2 35 9 9 35 9 9 2 1 52 32 9 82 5 3 log2 log2 − log2 + − − 35 9 9 35 5 35 3
92 62 42 9 10 log2 − + E(P4 ) = E(Literacy) = − log2 35 9 35 9 9 2 2 2 2 8 7 5 3 1 5 3 − log2 + − − log2 log2 35 8 8 35 5 35 3 Net Gained information; Gain (Density) = I(C1 , C2 ) − E(P1 ) = 0.258852
(9)
(10)
Information theoretic measure based ID3 algorithm
2517
Figure 1: Resultant Decision Tree
Gain (Population)= I(C1 , C2 ) − E(P2 ) = 0.256699 Gain (Sex ratio)= I(C1 , C2 ) − E(P3 ) = 0.140526 Gain (Literacy)= I(C1 , C2 ) − E(P4 ) = 0.205441 As we can notice that gain information of ’Density’ is largest and hence is the root of the tree. By repeating above process again for the different sub-trees we get the following tree as a result.
5
Rules
On the basis of decision tree some conditions can be set or rules can be formed using ’if-then’ so that the correct decision can be made during implementation of different policies or projects of development. These can be described as follows: ’If Density is ’< 120’ and Sex Ratio is ’> 974’ then ’S’; It means that for any value of ’Population’ and ’Literacy’ in case of small ’Density’ with high ’Sex Ratio’ the policy must be implemented in ’State’. Similarly, ’If ’Density’ is ’< 120’, ’Sex ratio’ is ’< 900’, and ’Literacy’ is ’> 85’ then ’U’; It can be explained as in case of small ’Density’ and ’Sex Ratio’ with high ’Literacy rate’ the project must be implemented in ’Union Territories’ and so on.
6
Conclusion
More output with minimum input is expected from every policy or project of development. In these types of situations, decision tree provides different mode of classification and ensures to find a right decision. By classifying a data purposely, we find some commercial valuable and potential information. This paper has applied an Renyi entropy for α=2 in ID3 algorithm to develop
2518
Kumar Ashok, Taneja H C, Chitkara Ashok K and Kumar Vikas
a decision tree, which can help in following the concept of right policy for right people.
References [1] A. Renyi, On measures of information and entropy, proceedings of the 4th Berkeley symposium on mathematics, statistics and probability, (1961), 547-561. [2] C. E. Shannon, A mathematical theory of communication, Bell System Technical Journal, (1948), 279-423 and 623-656. [3] I .J. Taneja, Generalized information measures and their applications, on − linebookwww.mtm.uf sc.br/ taneja/book/book.html, 2001. [4] J. Aczel and Z. Daroczy, On Measure of Information and Their Characterizations, Academic Press, New York, 1975. [5] J. Han and M. Kamber, Data Mining: concepts and techniques, Morgan Kaufmann, 2000. [6] M. M. George, Modern Data Warehousing, Mining, and Visualization: core concepts, Prentice Hall, Pearson, 2004. [7] M. Tomasz and W. Duch , Comparison of Shannon, Renyi and Tsallis entropy used in decision trees, Vol. 5097,Proceedings of Artificial Intelligence and Soft Computing - ICAISC 2008, 87-100 Torun, Poland, 2008. ¨ ¨ [8] Omer AKGOBEK , A new algorithm for knowledge discovery from data sets using cross-entropy measurement, Scientific Research and Essays Vol. 6(20), (2011) 4301-4311. [9] Q. Wang, W. Yaohua, X. Jiwei and P. Guang-Feng , The Applied research based on decision tree of data mining In third-party logistics, Proceedings of the IEEE, International Conference on Automation and Logistics, August 18 - 21, Jinan, China, 2007. [10] R.B. Ash, Information Theory, Dover Publications, New York,1990. [11] T.M. Cover and J. Thomas, Elements of Information Theory, Wiley and sons, New York, 1991. Received: May, 2012