Classifying Mobile-Phone Users
Classifying Mobile-Phone Users With An Information Theory Approach∗ Yun Zheng†
[email protected]
Wynne Hsu
[email protected]
Mong Li Lee
[email protected]
Limsoon Wong
[email protected]
Department of Computer Science School of Computing National University of Singapore Singapore 117543
Editor:
Abstract In this paper, we use a learning method based on information theory to classify mobilephone users as 2G and 3G customers. In our method, we aim at looking for informative and discriminatory feature subset. Then, we build classification models with these feature subsets. We find some general properties of promising 3G users, i.e., the false positive predictions. These properties should be useful for easily differentiating the 3G mobile phone users. Keywords: The Discrete Function Learning Algorithm, Classification, Information Theory
1. Introduction to the Problem An Asian telco operator which has successfully launched a third generation (3G) mobile telecommunications network would like to make use of existing customer usage and demographic data to identify which customers are likely to switch to using their 3G network. An original sample dataset of 20,000 2G network customers and 4,000 3G network customers has been provided with more than 200 data fields. The target categorical variable is “Customer-Type” (2G/3G). A 3G customer is defined as a customer who has a 3G Subscriber Identity Module (SIM) card and is currently using a 3G network compatible mobile phone. ∗. This research is carried out to attend the PAKDD 2006 Data Mining Competition: http://www.ntu.edu.sg/sce/pakdd2006/competition/overview.htm. †. To whom correspondence should be addressed.
1
Zheng, Hsu, Lee and Wong
Three-quarters of the dataset (15K 2G, 3K 3G) will have the target field available and is meant to be used for training/testing. The remaining portion (5K 2G, 1K 3G) will be made available with the target field missing and is meant to be used for prediction. The data mining task is a classification problem for which the objective is to accurately predict as many current 3G customers as possible (i.e. true positives) from the holdout sample provided. Entrants are also required to email a write-up that includes the following: a) Approach and understanding of the problem; b) Full technical details of algorithm(s) used; c) Details of the classification model that was produced; d) Discussion on what insights can be gained from their model in terms of identifying current 2G customers with the potential to switch to 3G (e.g. using false positives). The following of this paper is organized as follows. From Section 2 to 6, we will formally introduce our method. In Section 7, we briefly review an entropy-based discretization method. In Section 8, we show the results of our method. Finally, we discuss the results in Section 9. The proofs of the theorems proposed in the paper are given in Appendix A.
2. The Proposed Method In this research, we will solve the problem with the Discrete Function Learning algorithm (Zheng and Kwoh, 2005). In our method, the concept under consideration is regarded as a random variable, and its diversity or entropy can be obtained from other description features (variables). The aim is to find the most informative subset of features. The prediction accuracy and the complexity of the classification model is two complementary aspects, since complex models often suffer the risk of over-fitting the training data set. In this section, we first introduce primary knowledge of information theory. Second, we propose the Information Learning Approach (ILA). Third, we define the problem to solve. Fourth, we introduce the DFL algorithm for solving the problem. Fifth, we analyze the complexity of the DFL algorithm. Finally, we discuss the choice of one of the parameters of the DFL algorithm. We will first introduce some notation. We use capital letters to represent discrete random variables, such as X and Y ; lower case letters to represent an instance of the random variables, such as x and y; bold capital letters, like X, to represent a vector; and lower case bold letters, like x, to represent an instance of X. The cardinality of X is represented with |X|. The number of different instances of X is represented with ||X||. In the remainder parts of this paper, we denote the attributes except the class attribute as a set of discrete random variables V = {X1 , . . . , Xn }, the class attribute as variable Y . The entropy of X is represented with H(X), and the mutual information between X and Y is represented with I(X; Y ). The entropy and mutual information estimated from data sets are empirical ˆ ˆ ·). For legibility, we will simply use H(·) and I(·; ·) to represent their values H(·) and I(·; empirical values if their meanings are clear in the context. 2.1 Fundamental Knowledge of Information Theory The entropy of a discrete random variable or vector X is defined in terms of probability of observing a particular value x of X as (Shannon and Weaver, 1963): 2
Classifying Mobile-Phone Users
H(X) = −
X
P (X = x)logP (X = x).
(1)
x
The entropy is used to describe the diversity of X. The more diverse a variable or vector is, the larger its entropy is. Generally, vectors are more diverse than individual variables, hence have larger entropy. Hereafter, for the purpose of simplicity, we represent P (X = x) with p(x), P (Y = y) with p(y), and so on. The mutual information between a vector X and Y is defined as (Shannon and Weaver, 1963):
I(X; Y ) = H(Y ) − H(Y |X) = H(X) − H(X|Y ) XX p(x, y) = H(X) + H(Y ) − H(X, Y ) = p(x, y)log . p(x)p(y) x y
(2)
Mutual information is always non-negative and can be used to measure the relation between two variable, a variable and a vector (Equation 2), or two vectors. Basically, the stronger the relation between two variables, the larger mutual information they will have. Zero mutual information means that the two variables are independent or have no relation. 2.2 The Information Learning Approach First, we restate a theorem about the relationship between the mutual information I(X; Y ) and the number of attributes in X. Theorem 2.1 (McEliece, 1977, p. 26) I({X, Z}; Y ) ≥ I(X; Y ), with equality if and only if p(y|x) = p(y|x, z) for all (x, y, z) with p(x, y, z) > 0. In Theorem 2.1, it can be seen that {X, Z} will contain more or equal information about Y as X does. To put it another way, the more variables, the more information is provided about another variable. Theorem 2.1 reflects the learning process to some extend. Let us discuss the problem with an example. A computer in a laboratory is lost one day. Then, a policeman is sent to investigate the issue. He needs first collect information about the computer. George may tell the policeman that the computer is an IBM PC. Alice may tell the policeman that the computer is black, and so on so forth. When more descriptions about the computer are obtained, the concept, i.e., the computer, becomes more and more distinct and specific, until finally the computer can fully be determined with the descriptions. Such a process is actually a procedure to obtain information about the concept. In the ILA, the descriptions provided by the people become description variables in the data sets, the lost computer becomes the concept under consideration, and the knowledge contained in the descriptions of the people becomes the mutual information between the description variables in the data sets and the concept. Information, or knowledge, is used to eliminate uncertainty, i.e., the entropy of the concept. The more information, the more specific and deterministic the concept is, i.e., the less the uncertainty of the concept becomes. Then, to measure which subset of features is optimal, we restate the following theorem, which is the theoretical foundation of our algorithm. It has been proved that if H(Y |X) = 0, 3
Zheng, Hsu, Lee and Wong
then Y is a function of X (Cover and Thomas, 1991). Since I(X; Y ) = H(X) − H(Y |X), it is immediate to obtain Theorem 2.2. Theorem 2.2 (Cover and Thomas, 1991, p. 43, prob. 6) If the mutual information between X and Y is equal to the entropy of Y , i.e., I(X; Y ) = H(Y ), then Y is a function of X. The entropy H(Y ) represents the diversity of the variable Y . The mutual information I(X; Y ) represents the relation between vector X and Y . From this point of view, Theorem 2.2 actually says that the relation between vector X and Y is very strong, such that there is no more diversity for Y if X has been known. In other words, the value of X can fully and completely determine the value of Y . If the concept is represented with the variable Y , then the learning process is becoming a process to find a subset of the description features X, which satisfies the criterion of Theorem 2.2. The features in X are called Essential Attributes, or EAs for short. For the above example, there must exist some essential attributes which will be of primary importance and can fully determine the lost computer, like the brand and the series number. If two persons have told the policeman these two properties of the lost computer, it will be unnecessary to talk to other people. Since the policeman can correct identify the lost computer with the brand and the series number of it, other people who do not talk to the policeman will not provide any additionally necessary information for the lost computer. For another example, humankind may have many properties or characters, like straight walking and having large brains, which are different from other species. But the most fundamental difference between the human being, Homo sapiens, and other species is the DNA content within our cells. When data sets are noisy, the equality between I(X; Y ) and H(Y ) is broken up. In these cases, we have to relax the requirement of Theorem 2.2 to obtain a best estimated result. Therefore, we introduce a method called ǫ value to deal with noisy training data sets in practice. In the ǫ value method, we attribute the missing part of the H(Y ), which is not captured by X, to the noise in the data sets, and let it be smaller than or equal to ǫ × H(Y ). Finally, we consider the prediction task. People make predictions based on their knowledge, or the information obtained in their living (i.e., learning) process and stored in their brains. Consider the example about the lost computer. After the properties of the lost computer are obtained, the policeman will use these properties to predict whether they have found the lost computer. Definitely, there would be many ways to use these properties. But the most accurate and convenient way is to check the essential properties which distinctly specify the lost computer, like the brand and the series number. If other properties, like the brand and the color, are used to perform the prediction, it is possible to find incorrect targets, since there likely are many computers with the same brand and color as the lost one. However, the brand and the series number is unique for every computer. For the example of human, it is the DNA of humankind, that completely and accurately differentiate us, Homo sapiens, from other species. 2.3 Problem Definition A classification problem is trying to learn or approximate a function, which takes the values of attributes (except the class attribute) in a new sample as input and output a categorical 4
Classifying Mobile-Phone Users
value indicating its class, from a given training data set. The goal of the training process is to obtain a function which makes the output value of this function be the class value of the new sample as accurately as possible. From Theorem 2.2, the problem is converted to finding a subset of attributes U ⊆ V whose mutual information with Y is equal to the entropy of Y , where U is the EAs which we are trying to find from the data sets. For n discrete variables, there are totally 2n subsets. Clearly, it is NP-hard to examine all possible subsets exhaustively. It is often the case that there are some irrelevant and redundant features in the domain V. Therefore, it is reasonable to reduce the searching space by considering those subsets with limited number of features. Let V = {X1 , X2 , . . . , Xn } be a domain with n discrete variables, and let Y be a function of X, X ⊆ V. We denote the cardinality of X with k, i.e., |X| = k and k ≤ n. Let the training samples be generated in such a way that V is assigned with a certain distribution, but y is generated with x. We will consider the learning problem defined as follows. Definition 2.1 (The Learning Problem) Given a training data sets T = {vi → yi }, i = 1, . . . , N , find a function Y = f (U), such that f can produce the same output value as those in T with the highest empirical probability. Note that U may be different from the input X of the original function, which indicates the failure or partial failure of the learning process. Furthermore, the training data sets may include some noise, i.e., some yi in the pair {vi → yi } is not the value produced with Y = f (X), vice versa. 2.4 The Discrete Function Learning Algorithm To solve the problem in Definition 2.1 with the ILA, we introduce the Discrete Function Learning algorithm listed in Table 1. The DFL algorithm has two parameters, the expected cardinality K and the ǫ value. The ǫ value will be elaborated in the next section. The K is the expected maximum number of attributes in the classifier. The DFL algorithm uses the K to prevent the exhaustive searching of all subsets of attributes by checking those subsets with less than or equal to K attributes. When trying to find the EAs from the searching space, the DFL algorithm will examine whether I(X; Y ) = H(Y ). If so, the DFL algorithm will stop its searching process, and obtain the classifiers by deleting the non-essential attributes and duplicate rows in the training data sets. We will briefly introduce the DFL algorithm with an example, as shown in Figure 1. In this example, the set of attributes is V = {A, B, C, D} and the class attribute is determined with Y = (A·C)+(A·D), where “·” and “+” are logic AND and OR operation respectively. The expected cardinality K is set to 4 for this example. The training data set T of this example is shown in Table 3. In the DFL algorithm, Definition 2.2 to 2.4 are used to divide and define the searching space. For instance, ∆1 (A) in this example is {{A, B}, {A, C}, {A, D}}, L1 in this example is {{A}, {B}, {C}, {D}}, and the searching space in this example S4 is all the¡ subsets in ¢ n Figure 1 except the empty set. From Definition 2.3, it is known that there are i subsets P ¡n¢ K of V in Li . And there are K i=1 i ≈ n subsets of V in SK . Definition 2.2 (∆ Supersets) Let X be a subset of V = {X1 , . . . , Xn }, then ∆i (X) of X are the supersets of X so that X ⊂ ∆i (X) and |∆i | = |X| + i. 5
Zheng, Hsu, Lee and Wong
Table 1: The Discrete Function Learning algorithm.
1 2 3 4 5∗ 6 ∗
Algorithm: DFL(V, K, T) Input: a list V with n variables, indegree K, T = {vi → yi },i = 1, · · · , N . T is global. Output: f Begin: L ← all single element subsets of V; ∆T ree.F irstN ode ← L; calculate H(Y ); //from T D ← 1; //initial depth f = Sub(Y, ∆T ree, H(Y ), D, K); return f ; End Sub() is a subroutine listed in Table 2.
Table 2: The subroutine of the DFL algorithm.
∗
Algorithm: Sub(Y, ∆T ree, H, D, K) Input: variable Y , ∆T ree, entropy H(Y ) current depth D, maximum indegree K Output: function table for Y , Y = f (X) Begin: 1 L ← ∆T ree.DthN ode; 2 for every element X ∈ L { 3 calculate I(X; Y ); //from T 4 if(I(X; Y ) == H) { //from Theorem 2.2 5∗ extract Y = f (X) from T; 6 return Y = f (X) ; } } 7 sort L according to I; 8 for every element X ∈ L { 9 if(D < K){ 10 D ← D + 1; 11 ∆T ree.DthN ode ← ∆1 (X); 12 return Sub(Y, ∆T ree, H, D, K); } } 13 return “Fail(Y)”; //fail to find function for Y End By deleting unrelated variables and duplicate rows in T.
Definition 2.3 (Searching Layer L of V) Let X ⊆ V, then the ith layer Li of all subsets of V is the collective of subsets with i features, i.e., Li = ∪X, ∀|X| = i. Definition 2.4 (Searching Space) The searching space of functions with a bounded indegree K is the collective of subsets with ≤ K features, i.e., SK = ∪K i=1 Li . As shown in Figure 1, the DFL algorithm searches the first layer, then it sorts all subsets according to their mutual information with Y on the first layer. It finds that {A} shares the largest mutual information with Y among subsets on the first layer. Then, the DFL algorithm searches through ∆1 (A), . . ., ∆K−1 (A), however it always decides the search order 6
Classifying Mobile-Phone Users
Table 3: The training data set T of the example to learn Y = (A · C) + (A · D). ABCD 0000 0001 0010 0011
Y 0 0 0 0
ABCD 0100 0101 0110 0111
Y 0 0 0 0
ABCD 1000 1001 1010 1011
Y 0 1 1 1
ABCD 1100 1101 1110 1111
Y 0 1 1 1
{}
{A}
{B}
{C}
{D}
{A,B} {A,C} {A,D} {B,C} {B,D} {C,D}
{A,B,C} {A,B,D} {A,C,D}* {B,C,D}
{A,B,C,D}
Figure 1: The search procedures of the DFL algorithm when learning Y = (A · C) + (A · D). {A, C, D}∗ is the target combination. The combinations with a black dot under them are the subsets which share the largest mutual information with Y on their layers. Firstly, the DFL algorithm searches the first layer, then finds that {A}, with a black dot under it, shares the largest mutual information with Y among subsets on the first layer. Then, it continues to search ∆1 (A) on the second layer. Similarly, these calculations continue until the target combination {A, C, D} is found on the third layer.
of ∆i+1 (A) bases on the calculation results of ∆i (A). Finally, the DFL algorithm finds that the subset {A, C, D} satisfies the requirement of Theorem 2.2, and will construct the classifier with these three attributes. Firstly, the B is deleted from training data set since it is a non-essential attribute. Then, the duplicate rows of {A, C, D} → Y are removed from the training data set to obtain the final classifier f as shown in Table 4. In the meantime, the counts of different instances of {A, C, D} → Y are also stored in the classifier, which are used in the prediction process. From Table 4, it can be seen that the learned classifier f is exactly the truth table of Y = (A · C) + (A · D) along with the counts of rules. This is the reason for which we name our algorithm as the Discrete Function Learning algorithm. The DFL algorithm will continue to search the ∆1 (C), . . ., ∆K−1 (C), ∆1 (D), . . ., ∆K−1 (D) and so on, if it can not find the target subset in ∆1 (A), . . ., ∆K−1 (A) (details available at supplementary Figure S3). 2.5 Complexity Analysis Now, we analyze the worst-case complexity of the DFL algorithm. As to be discussed in Section 3.1, the complexity to compute the mutual information I(X, Y ) is O(N ). For the example in Figure 1, {A, B} will be visited twice from {A} and {B} in the worst case. {A, B, C} will be visited from {A, B}, {A, C} and {B, C}. Thus, {A, B, C} will be checked 7
Zheng, Hsu, Lee and Wong
Table 4: The learned classifier f of the example to learn Y = (A · C) + (A · D). ACD 000 001 010 011
Y 0 0 0 0
Count 2 2 2 2
ACD 100 101 110 111
Y 0 1 1 1
Count 2 2 2 2
for 3 × 2 = 3! times in the worst case. In general, for a subset ¡ n ¢ ¡with ¢ K features, ¡ n ¢ it will be n checked for K! times in the worst case. Hence, it takes O(( 1) + 2 2!+. . .+ K K!)×N ) = O(N · nK ) to examine all subsets in SK . Another computation intensive step is the sort step in line 7 of Table 2. In L1 , there is only one sort operation, which takes O(nlogn) time. In L2 , there would be n sort operations, which takes O(n2 logn) time. Similarly, in LK , the sort operation will be executed for nK−1 times, which takes O(nK logn) time. Therefore, the total complexity of the DFL algorithm is O((N + logn) · nK ) in the worst case. As described in Section 2.3, we use k to denote the actual cardinality of the EAs. After the EAs with k attributes are found in SK , the DFL algorithm will stop its searching. In our example, the K is 4, while the k is automatically determined as 3, since there are only 3 EAs for the example. Then, we analyze the expected complexity of the DFL algorithm. Contributing to sort step in the line 7 of the subroutine, the DFL algorithm makes the best choice on current layer of subsets. Since there are (n−1) ∆1 supersets for a given single element subset, (n−2) ∆ supersets for a given two element subsets, and so on. The DFL algorithm only considers P1k−1 i=0 (n − i) ≈ k · n subsets in the optimal case. Thus, the expected time complexity of the DFL algorithm is approximately O(k · n · (N + logn)), where logn is for sort step in line 7 of Table 2. Next, we consider the space complexity of the DFL algorithm. To store the information needed in the search processes, the DFL algorithm uses two data structures. The first one is a linked list, which stores the value list of every variable. Therefore, the space complexity of the first data structure is O(N n). The second one is the ∆T ree, which is a linked list of length K, and each node in the first dimension is itself a linked list. The ∆T ree for the example in Figure 1 is shown in supplementary Figure S2. The first node of this data structure is used to store the single element subsets. If the DFL algorithm is processing {Xi } and its ∆ supersets, the second node to the Kth node are to store ∆1 to ∆K−1 Pused K−1 1 supersets of {X }. If there are n variables, there would be (n − i) ≈ Kn subsets i i=0 in the ∆T ree. To store the ∆T ree, the space complexity would be O(Kn), since only the indexes of the variables are stored for each subsets. Therefore, the total space complexity of the DFL algorithm is O((K + N ) · n). 2.6 Selection of The Expected Cardinality K We will discuss the selection of K in this section. Generally, if a data set has a large number of features, like several thousands, then K can be assigned to a small constant, like 50, since the models with large number of features will be very difficult to understand. If the number of features is small, then the K can be directly specified to the number of features n. 1. Except ∆1 supersets, only a part of other ∆i (i = 2, . . . , K − 1) supersets is stored in ∆T ree.
8
Classifying Mobile-Phone Users
Another usage of K is to control model complexity. If the number of features is more important than accuracy, then a predefined K can be set. Thus, the learned model will have less than or equal to K features. The expected cardinality K can also be used to incorporate the prior knowledge about the number of relevant features. If we have the prior knowledge about the number of relevant features, then the K can be specified as the predetermined value.
3. Implementation Issues In this section, we will discuss two important implementation issues of the DFL algorithm. 3.1 The Computation of I(X; Y ) We use Equation 2 to compute I(X; Y ). H(Y ) does not change in the searching process of the DFL algorithm. To compute H(X) and H(X, Y ), we need to estimate the joint distribution of X and (X, Y ), which can be estimated from the input table T. The DFL algorithm will construct a matrix containing the values of X. Then, it scans the matrix and finds the frequencies of different instances of X, which are stored in a frequency table with a linked list. The size of the frequency table grows exponentially with the number of variables in X, but will not exceed N . Next, the DFL algorithm will obtain the estimation of H(X) with Equation 1. For each instance of X in T, we need to update its frequency in the frequency table, which takes O(min(||X||, N )) steps. The total complexity to compute H(X) is O(N · min(||X||, N )). The computation of H(X; Y ) is similar to that of H(X). Hence, if X only contains a few variables, it will need approximate O(N ) steps to compute I(X; Y ), since ||X|| is small. While |X| is large, the computation of I(X; Y ) takes O(N 2 ) steps in the worst case. However, the complexity for computing I(X; Y ) can be improved by storing the frequencies of different instances of X and (X, Y ) in a hash table. For each instance of X in T, it only takes O(1) time to update its frequency in the hash table (Cormen et al., 2001, chap. 11). Hence, the total complexity to compute H(X) is O(N ). The computation of H(X; Y ) is similar to that of H(X). Therefore, it will only need approximate O(N ) steps to compute I(X; Y ). An important issue to notice is the proper setting of the initial capacity of the hash table, since too large value brings waste but too small value may incur the dynamic increasing the capacity and reorganizing of the hash table, which is time-consuming. In summary, if |X| and N are both large and there is enough memory space available, it is more advisable to use hash table for calculating I(X; Y ). While |X| or N is small and memory space is limited, it is better to use linked list or array to compute I(X; Y ). 3.2 The Redundancy Matrix The subroutine in Table 2 is recursive, which will introduce some redundant computation when the DFL algorithm exhaustively searches the searching space SK . For instance, the {A, B} is a common ∆1 supersets of {A} and {B}. Hence, the DFL algorithm will check {A, B} twice in the worst case. However, this redundant computation can be alleviated by storing the information of whether a subset has been checked with a Boolean type matrix. Let us consider the subsets 9
Zheng, Hsu, Lee and Wong
with two variables. We introduce an n by n matrix called redundancy matrix, boolean R(n × n), after a subset {Xi , Xj } and its supersets have been checked, R[i][j] is assigned as true. Later, when the DFL algorithm reaches {Xj , Xi }, it will first check whether R[i][j] or R[j][i] is true. If yes, it will examine the original worst-case time ¡ ¢ next subset. ¡ n ¢ By doing so, 1 n K complexity becomes O((n + 2 [ 2 · 2! + . . . + K · K!])N + n logn) = O((N + logn) · nK ). Although, this alleviated worst-case time complexity is in the same order as the original one, but it saves about half of the run time. The space complexity of R is O(n2 ), but the type is boolean, which will cost very limited memory space. In addition, if run time is more critical and the memory space is sufficient, higher dimensional matrices can be introduced to further reduce the run time of the DFL algorithm. To clearly show the implementation of the redundancy matrix R, an extended version of the main steps of the DFL algorithm is provided at the supplementary website of this paper.
4. The ǫ Value Method In this section, we first introduce the ǫ value method, then discuss its relation with the overfitting problem, finally discuss the selection of ǫ value. 4.1 The ǫ Value Method In Theorem 2.2, the exact functional relation demands the strict equality between the entropy of Y , H(Y ), and the mutual information of X and Y , I(X; Y ). However, this equality is often ruined by the noisy data, like microarray gene expression data. In these cases, we have to relax the requirement to obtain a best estimated result. As shown in Figure 2, by defining a significant factor ǫ, if the difference between I(X; Y ) and H(Y ) is less than ǫ × H(Y ), then the DFL algorithm will stop the searching process, and build the classifier for Y with X at the significant level ǫ. Because the H(Y ) may be quite different for various classification problems, it is not appropriate to use an absolute value, like ǫ, to stop the searching process or not. Therefore, we use the relative value, ǫ × H(Y ) where ǫ ∈ [0, 1), as the criterion to decide whether to stop the searching process or not. The main idea of the ǫ value criterion method is to find a subset of attributes which captures not all the diversity of the class attribute H(Y ), but the major part of it, i.e. (1 − ǫ) × H(Y ), then to build classifiers with these attributes. The features in vectors, which have strong relations with Y , are expected to be selected as EAs in the ǫ value method. In the ǫ value method, the line 4 of Table 2 is modified to “if (H − I(X; Y )