Finding Essential Attributes from Binary Data∗† Endre Boros‡
Takashi Horiyama§
Toshihide Ibaraki¶
Kazuhisa Makinok
Mutsunori Yagiura‡ March 27, 2002
Abstract We consider data sets that consist of n-dimensional binary vectors representing positive and negative examples for some (possibly unknown) phenomenon. A subset S of the attributes (or variables) of such a data set is called a support set if the positive and negative examples can be distinguished by using only the attributes in S. In this paper we study the problem of finding small support sets, a frequently arising task in various fields, including knowledge discovery, data mining, learning theory, logical analysis of data, etc. We study the distribution of support sets in randomly generated data, and discuss why finding small support sets is important. We propose several measures of separation (real valued set functions over the subsets of attributes), formulate optimization models for finding the smallest subsets maximizing these measures, and devise efficient heuristic algorithms to solve these (typically NP-hard) optimization problems. We prove that several of the proposed heuristics have a guaranteed constant approximation ratio, and we report on computational experience comparing these heuristics with some others from the literature both on randomly generated and on real world data sets.
1
Introduction
We consider the problem of analyzing a data set consisting of positive and negative examples for an unknown phenomenon. We assume that each example is represented by an n-dimensional 0 − 1 vector containing the values of n binary attributes. This is a typical problem setting arising and studied in a variety of fields, such as knowledge discovery, data mining, machine learning, logical analysis of data, etc., (see e.g., [1, 4, 5, 12, 23, 24, 41, 47, 50]). ∗ This
work was partially supported by the Grants in Aid by the Ministry of Education, Science, Sports and Culture of Japan (Grants 09044160 and 10205211). The visit of the first author to Kyoto University in 1999 was also supported by this grant (Grant 09044160). The research of the first and third authors were supported in part by the Office of Naval Research (Grant N00014-92-J-1375). The first author thanks also the National Science Foundation (Grant IIS-0118365) for partial support. † An extended abstract of a preliminary version of this paper appeared in the Proceedings of the Second International Conference on Intelligent Data Engineering and Automated Learning (IDEAL 2000) (Kwong Sak Leung, Lai-Wan Chan, Helen Meng, eds.), Lecture Notes of Computer Science 1983 pp.133-138, (Springer Verlag, Berlin, Heidelberg, 2000). ‡ RUTCOR, Rutgers University, 640 Bartholomew Road, Piscataway, NJ 08854-8003, USA. Email:
[email protected] § Graduate School of Information Science, Nara Institute of Science and Technology, Nara 630-0101, Japan. Email:
[email protected] ¶ Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan. Email: {ibaraki,yagiura}@i.kyoto-u.ac.jp k Department of Systems and Human Science, Graduate School of Engineering Science, Osaka University, Toyonaka, Osaka, 560-8531, Japan. Email:
[email protected]
1
Let B = {0, 1} and, following [23], let us call a pair (T, F ), T, F ⊆ Bn , a partially defined Boolean function (or pdBf, in short), where T and F denote, respectively, the sets of positive and negative examples. We shall also assume T ∩F = ∅, since otherwise a positive and negative examples cannot be perfectly distinguished. Let us call a Boolean function f : Bn 7→ B an extension of (T, F ) if f (a) = 1 for all a ∈ T and f (b) = 0 for all b ∈ F . Such an extension can be interpreted as a logical separator between the sets T and F , or equivalently, as an explanation of the phenomenon represented by the data set (T, F ) (see e.g., [14, 15, 23]). An extension f of (T, F ) exists if and only if T ∩ F = ∅ holds (we refer the reader to [14] for models addressing cases when T ∩ F 6= ∅). For instance, a data vector x may represent symptoms of a patient for a disease: x1 denotes whether temperature is high (x1 = 1) or not (x1 = 0), and x2 denotes whether blood pressure is high (x2 = 1) or not (x2 = 0), etc. Establishing an extension f , which is consistent with the given data set, then amounts to finding a diagnostic explanation for the disease. It is quite common that the data set contains attributes which are irrelevant to the phenomenon under consideration, and also ones which are dependent on some of the essential attributes. Such attributes can be removed without losing information relevant to the phenomenon in consideration. It is an interesting problem to recognize these, and to find a (smallest) subset of the essential attributes, which still can explain the given data set (T, F ). Some models in learning theory, like the so called “attribute-efficient learning” consider learning problems in which the target concept depend only on a small number of essential attributes, and the objective is to find learning algorithms, the complexity of which depends only at a very low degree on the (large) number of irrelevant attributes (see e.g., [9, 37]). Finding a subset of attributes “as small as possible”, which can explain the input data, is in the spirit of Occam’s razor [11]. Restricting the analysis to such a small set not only decreases the computational cost of finding a logical separator, but frequently yields more compact logical explanations, which generalize better. A further “side-effect” is the reduction in the cost of future data collection, an equally important objective in many applications. The problem of finding the relevant (or essential) attributes in the presence of large number of irrelevant and/or dependent attributes is known also as feature extraction or feature selection. It has been an active research area for many years within statistics and pattern recognition, though most of the papers there dealt with linear regression [10] and used assumptions not completely valid for most learning algorithms [32]. Many related studies in machine learning considers this problem in conjunction with some specific learning algorithms [19, 32], following the wrapper model, in which the search for a “good” subset is wrapped around a learning algorithm, which is used to evaluate the “fitness” of the chosen subsets. In this study we follow the so called filter model, or in other words, we consider feature selection as a stand alone task, independent from the applied learning methods. In this model feature subsets are evaluated via some direct measures based on distance, entropy, etc. Many of the proposed algorithms in the literature are heuristics, based on some information theoretic measures (CART by [18], ID3 or C4.5 by [48], mutual information greedy by [3], Rel-FSS by [7], etc.) or on some mixture of correlation and entropy (see e.g., [28, 36]). Other methods employ an exhaustive search strategy either in a forward manner, adding relevant features one-by-one (see e.g. FOCUS [2, 3]) or in a backward manner, deleting irrelevant features one-by-one (see e.g. branch-and-bound type methods or their pruned variants [38, 44]). Another group of methods address feature selection implicitly, by computing an individual relevance weight for each of the attributes [17, 35], which can then be used in a variety of ways, for instance by a rank based feature selection. A further specialized mathematical programming based approach is proposed in [16] to find a smallest subset of the attributes which allows for a linear (threshold) separation between the two classes. Let us remark that in the majority of the above cited papers the notion of “relevance” is replaced (implicitly or explicitly) by the notion of “separation”. In other words, feature subsets
2
are evaluated on the basis of how well they distinguish positive and negative examples. Most of the proposed algorithms in fact solve set-covering type formulations corresponding to separation based models (cf. [13, 12, 23]) in order to find a small (possibly the smallest) subset of the attributes (a so called support set) which is enough to distinguish the known positive and negative examples. A naturally arising immediate question is whether such a small (smallest) subset of the attributes can indeed be accepted as “essential” or “relevant” for the phenomenon generating the input data set (T, F )? The fact is that every pdBf (T, F ) has a smallest (and possibly several smallest) support set, even if (T, F ) is randomly generated, and of course, for such random data sets we cannot talk about relevance or essentiality of its attributes! The widely accepted assumption in this respect in machine learning and knowledge discovery is that a “property” of a data set (T, F ) is “interesting”, and might be attributed to the phenomenon generating the data, rather than to random chance, if the appearance of this property in similarly sized random data sets has a very small probability. To understand better the practical usefulness of small attribute subsets, we analyzed the distribution of support sets in randomly generated data (see Appendix A). By providing upper and lower bounds on the expected number of support sets of size K for random data sets (T, F ), in terms of the input parameters n, |T | and |F |, we were able to confirm that indeed, the existence of a small enough support set is very unlikely the result of pure random chance. More precisely, we show that if |T ||F | À K2K ln n, then the expected number of support sets of size K is much smaller than 1. In fact this expectation gives an upper bound on the probability that at least one support set of size K exists, which goes to zero quickly, if |T ||F | grows. However, the problem of finding a smallest support set in a given data sets (T, F ) is computationally difficult, and the support sets we are able to find are often not unique (cf. [32]). Thus we need more refined measures of separation to guide the selection process. In this paper we propose a family of separation measures, based on Hamming distances between positive and negative examples, and formulate set-covering type models for finding a best feature subset. The proposed family of measures leads us to mathematical programming formulations, which are NP-hard optimization problems (just like most models in the literature). We present several efficient heuristic procedures for those hard optimization problems, including a family of polynomial time “greedy type” algorithms, with a guaranteed worst-case approximation ratio of 1 − 1/e. It can also be seen that some of the best heuristic procedures from the literature belong to this family (e.g. standard or weighted greedy from [2, 3]). In order to study the effect of the presence of many irrelevant and/or dependent variables on these algorithms, we performed computational tests both on randomly generated data, and on real world data sets from the UCI repository [43]. Experimental results show that the proposed heuristic algorithms are quite effective in finding small sets of essential attributes in the given data sets, as well as in removing irrelevant and/or dependent attributes.
2
Support sets
Let us first remark that most selection procedures to find a small set of essential features are based on the objective of trying to separate the positive examples from the negative ones as much as possible, and not on the relevance of those attributes. The most notable reason for this perhaps is the fact that relevance is very hard to define and evaluate. For a completely defined Boolean function (i.e. when T ∪ F = Bn ) there is a very well defined notion of relevance: a variable xj is relevant for f : Bn 7→ B if it appears in every Boolean formula representing f (or equivalently, if it appears in some of the prime implicants of f ). There are several quantified versions of relevance, known as power indices, or measures of influence (see e.g., 3
Chow parameters [20, 52], Banzaf indices [6], Shapley values [49], Winder preorder [51], etc.). These measures were introduced mostly in a game theoretic context to measure the influence (e.g. voting power) of the individuals or committees in various voting schemes. The above definitions and measures, however, cannot be simply generalized for partially defined Boolean functions. For instance, it is demonstrated in [32] with four different definitions of relevance that each can lead to unrealistic results on some examples, when the input space is restricted. On the other hand, measures based on separation (e.g. on the distances between the projections of the positive and negative examples) are intuitive, easy to compute and much more exact in the sense that they use only information present in the given data, and do not need additional assumptions on how the input was generated. Thus, as one of the simplest and most natural such definitions, we can arrive at the following notion, introduced in [23]. Let V = {1, 2, . . . , n} denote the set of indices of the attributes. For a subset S ⊆ V and a binary vector a ∈ BV , let a[S] ∈ BS denote the projection of a on S. Analogously, for a subset X ⊆ BV of vectors, we denote by X[S] = {a[S] | a ∈ X} the set of their projections on S. Following [23], let us call a subset S ⊆ V a support set of the pdBf (T, F ) if T [S] ∩ F [S] = ∅. Equivalently, S is said to be a support set of (T, F ) if all pairs of vectors a ∈ T and b ∈ F can be distinguished by using only the components in S, or in other words, if a[S] 6= b[S] for all a ∈ T and b ∈ F . Let us first recall (see [23]) that the problem of finding the smallest support set for a given pdBf (T, F ) is an NP-hard optimization problem. To see this let us consider the sets a∆b := {j ∈ V | aj 6= bj } for binary vectors a, b ∈ Bn . We can then note that a subset S ⊆ V is a support set if and only if S ∩ (a∆b) 6= ∅ for all a ∈ T and b ∈ F ; such a set S is also called a hitting set of the hypergraph {a∆b | a ∈ T, b ∈ F }. To see that an arbitrary hitting set problem can arise in this way, let us consider an arbitrary hypergraph H, and let us define T as the set of characteristic vectors of the hyperedges of H, and let F = {(0, 0, . . . , 0)}. Then, with these definitions, we have H = {a∆b | a ∈ T, b ∈ F }, and a subset S ⊆ V is a hitting set of H if and only if it is a support set of (T, F ). Thus, finding the smallest support set is not easier than finding the smallest hitting set, which is known to be NP-hard [27]. It is more customary to formulate the above problem as a set-covering problem: X (SC) minimize yj j∈V
subject to
X
yj
≥
1,
for all a ∈ T and b ∈ F,
yj
∈
B,
for all j ∈ V.
(1)
j∈a∆b
It follows form the above discussion that a binary vector y ∈ Bn is a feasible P solution to (1) if and only if it is the characteristic vector of a support set S, and then |S| = j∈V yj . Let us denote by τ (T, F ) the optimum value of the above minimization problem, that is the size of a smallest support set for (T, F ), and let τ ∗ (T, F ) denote the optimum value of the continuous relaxation of (1), in which the integrality conditions yj ∈ B are replaced by the inequalities yj ≥ 0: X (SC∗ ) minimize yj j∈V
subject to
X
yj
≥
1,
for all a ∈ T and b ∈ F,
yj
≥
0,
for all j ∈ V.
j∈a∆b
4
(2)
A well-known polynomial time approximation algorithm for this problem is the standard greedy method, in which we start with y = (0, 0, . . . , 0), and iteratively, as long as there are unsatisfied inequalities, switch the variable yj to yj = 1 with which we can satisfy the largest number of unsatisfied inequalities in (1). Let us denote by τ G (T, F ) the size of the support set obtained by this greedy procedure. Then, the following inequalities are known to hold [21, 39], τ ∗ (T, F ) ≤ τ (T, F ) ≤ τ G (T, F ) ≤ (1 + ln D)τ ∗ (T, F ), (3) P where D = maxj∈V a∈T,b∈F,aj 6=bj 1 denotes the maximum number of non-zeros in a column of (1). In other words, the approximation ratio of the greedy algorithm is (1 + ln D). It is also known that an essentially better approximation in polynomial time is not possible, unless P=NP [25]. Let us remark that, as it was pointed out in [3], the real interesting cases for us are when τ (T, F ) ¿ n = |V |, in which case the true complexity of (1) is not (yet) known. For instance, it follows from the results of [33] that if |a∆b| ≥ ²n for all a ∈ T and b ∈ F for some ² > 0, then there exists a polynomial time algorithm A = A(c, ²) for every c > 0 producing a support set of size τ A (T, F ) ≤ c(ln D)τ (T, F ). It is still unknown however, if (1) can be solved in polynomial time or if it is approximable within a constant factor in this case. Let us add that if τ (T, F ) ≤ k holds, then (1) can be solved in O(nk ) time by simply enumerating all subsets of size at most k. Thus, (1) is unlikely to be NP-hard for τ (T, F ) = o(n/ log n), unless all NP-hard problems can be solved in O(no(n/ log n) ) time. There are some other issues associated with the problem of finding a small (possibly smallest) support set. First of all, it is unclear how many of such supports sets exists in a typical example, and if there are many, how to justify choosing one and not the other? Secondly, we assume that the inessential attributes are either random noise, or dependent on the essential ones. It is also unclear how well these inessential attributes are eliminated when we choose a small (or smallest) support set? We shall study these question in the following sections and provide a probabilistic analysis of the distribution of small support sets in the Appendix A.
3
Approximation algorithms for maximum separation
Even though the probabilistic analysis suggests (see Appendix A) that using the smallest support set as the set of essential attributes is likely to work well, we still face several difficulties in implementing this approach. As it can be seen from Table 7, the bounds on the size of data set needed to “avoid randomly existent support sets of size K” increase very fast with K. This implies that to achieve good results, one has to find a support set the size of which is very close to the smallest possible one. Given the cited negative result of [25], this seems to be difficult to achieve because the approximation ratio (1 + ln D) of (3) can be as large as 5-6 even for relatively small data sets with a few hundred records. Even the improved approximations of [33] do not seem to help, since the required density conditions may not be realistic for practical examples. To overcome these difficulties one needs a finer evaluation of support sets, hoping to make meaningful distinctions between different ones of the same size. We propose here a family of such finer measures of separation, based on the Hamming distances between the pairs of positive and negative examples in the given data set. We shall represent in the sequel the characteristic vector of a subset S ⊆ V as y = χS ∈ BV defined by yj = 1 if j ∈ S and 0 otherwise. Then we can write the Hamming distance between the projections a[S] and b[S] of binary vectors a, b ∈ BV as X dy (a, b) := dH (a[S], b[S]) = yj , j∈a∆b
5
that is, as a linear function of y. For a given S ⊆ V , let us consider separation measures of the form X X (4) yj , Ψ(y) = ψa,b a∈T,b∈F
j∈a∆b
where the functions ψa,b : R+ 7→ R have the following properties for all a ∈ T and b ∈ F : ψa,b (0) = 0, ψa,b (z) is non-decreasing over R+ , and ψa,b (z) is a concave function of z ∈ R+ .
(5) (6) (7)
To further simplify notation, we sometimes write Ψ(S) instead of Ψ(χS ). Properties (5)-(7) tell that Ψ(S) increases with the Hamming distance between a[S] and b[S], a ∈ T and b ∈ F , and that the larger the distance between a[S] and b[S] the smaller the increase. We shall adopt this Ψ(S) as the index of separation power of S ⊆ V . Examples of such Ψ(S) indices will be discussed in the next section. Given Ψ(S), we shall consider the following optimization problem: (PK )
maximize subject to
Ψ(y) P j∈V
yj yj
≤ K, ∈ B, for all j ∈ V,
where K is a parameter. We shall solve problem (PK ) for K = 1, 2, . . ., until the subset S ⊆ V for which χS is an optimal solution of (PK ) becomes a support set. The next lemma tells that, by choosing the functions ψa,b , a ∈ T , b ∈ F appropriately, we can ensure that the optimal solution to (PK ) will be a support set, whenever K is large enough. Lemma 1 Let us assume that K ≥ τ (T, F ) and that the functions ψa,b , a ∈ T , b ∈ F , satisfy the conditions (5)-(6). Let us assume further that for every u ∈ T and v ∈ F the inequality X [ψa,b (n) − ψa,b (1)] < ψu,v (n) (8) a∈T,b∈F
holds. Then, any subset S ⊆ V , for which χS is an optimal solution to (PK ), is a support set of (T, F ). Proof: Consider two subsets S, S 0 ⊆ V , such that |S| ≤ K and S is a support set of (T, F ), while S 0 is not a support set, that is u[S 0 ] = v[S 0 ] for some u ∈ T and v ∈ F . Then, on one the hand, we have by (6) that X Ψ(S) ≥ ψa,b (1), since
P j∈a∆b
a∈T,b∈F
χSj ≥ 1 for all a ∈ T , b ∈ F . On the other hand, we have by (5) and (6) that X Ψ(S 0 ) ≤ ψa,b (n), a∈T,b∈F,(a,b)6=(u,v)
P P 0 0 since j∈u∆v χSj = 0 and j∈a∆b χSj ≤ n for all a ∈ T , b ∈ F . Thus, Ψ(S) > Ψ(S 0 ) follows by (8). Since K ≥ τ (T, F ), there exists a support set with cardinality not more than K, and the lemma follows immediately. 6
Let us note that conditions (5), (6) and (8) together ensure that any support set of (T, F ) has a higher evaluation than any non-support set. In this case, it follows from Lemma 1 that, by solving (PK ) for K = 1, 2, . . ., we can find a minimum size support set to (T, F ). Thus, the NP-hardness of (SC) implies that (PK ) is also NP-hard. For this reason, we shall consider here a simple greedy heuristic for (PK ). Ψ-GREEDY Input: A pdBf (T, F ), and a constant K. Step 0: Set k = 0 and S k = ∅. Step 1: If k = K then STOP, otherwise select i ∈ V \ S k for which Ψ(S k ∪ {i}) is the largest. Step 2: Set S k+1 = S k ∪ {i}, let k = k + 1, and return to Step 1. It is immediate to see that this greedy procedure generates a sequence of subsets S 1 ⊂ S 2 ⊂ · · · ⊂ S K such that |S i | = i. Moreover, increasing K to some K 0 > K does not change the first K subsets generated. Hence one can just let it run until S k becomes a support set of (T, F ), which is the more customary use of this algorithm. We shall see later that by choosing Ψ appropriately, this simple heuristic can coincide with several well known procedures from the literature. We can also show that the above procedure approximates the optimum value of (PK ) with a constant guaranteed performance ratio. Theorem 1 If Ψ is given by (4) and ψa,b , a ∈ T , b ∈ F satisfy conditions (5)-(7), then we have Ψ(S K ) 1 ≥1− , OPT Ψ(S ) e
(9)
OPT
where y = χS is an optimal solution to (PK ), S K is the subset produced by Ψ-GREEDY, and e = 2.71828 . . . is the base of the natural logarithm. We derive the above theorem from the results of [22, 45] in which (9) is shown for a nondecreasing submodular function Ψ such that Ψ(0, 0, . . . , 0) = 0 (see also [30, 34] for generalizations and more details). Let us recall (see e.g. [45]) that Ψ is called submodular if Ψ(S) + Ψ(S 0 ) ≥ Ψ(S ∪ S 0 ) + Ψ(S ∩ S 0 ) holds for all subsets S, S 0 ⊆ V . Since conditions (5) and (6) imply that the function Ψ in the above theorem is non-decreasing and that Ψ(0, 0, . . . , 0) = 0, Theorem 1 will follow from the above cited results of [22, 45] if we can show that Ψ is also submodular. Lemma 2 If Ψ is given by (4) and condition (7) holds for ψa,b , a ∈ T , b ∈ F , then Ψ is a submodular function. Proof: Let us consider two arbitrary subsets S and S 0 of V , and let us define P wa,b = j∈a∆b χSj , P 0 0 wa,b = j∈a∆b χSj , P 0 ∪ wa,b = j∈a∆b χjS∪S , P 0 ∩ wa,b = j∈a∆b χjS∩S . 7
0 ∪ ∩ Then, due to linearity, we can easily see that wa,b + wa,b = wa,b + wa,b , and hence ∩ ∪ 0 wa,b − wa,b = wa,b − wa,b ≥0
(10)
∩ 0 holds for all a ∈ T , b ∈ F . Since the functions ψa,b are concave by (7), and since wa,b ≤ wa,b by their definitions, (10) implies that ∩ ∪ 0 ψa,b (wa,b ) − ψa,b (wa,b ) ≥ ψa,b (wa,b ) − ψa,b (wa,b )
from which
0 ∪ ∩ ψa,b (wa,b ) + ψa,b (wa,b ) ≥ ψa,b (wa,b ) + ψa,b (wa,b )
follows for all a ∈ T and b ∈ F . Summing up these inequalities, we obtain Ψ(S) + Ψ(S 0 ) ≥ Ψ(S ∪ S 0 ) + Ψ(S ∩ S 0 ), proving that Ψ is indeed submodular. We shall also consider the continuous relaxation of problem (PK ), because it provides further options for efficient approximations. Let U = [0, 1] denote the unit interval, and consider the problem obtained from (PK ), in which the variables are not restricted to be binary: (P∗K )
maximize subject to
Ψ(y) P j∈V
yj yj
≤ ∈
K, U,
for all j ∈ V.
It is immediate to see that this problem is a concave maximization problem over a convex domain whenever Ψ is given by (4) and ψa,b , a ∈ T , b ∈ F satisfy conditions (7). Hence, whenever the functions ψa,b and their (quasi) derivatives are all easily computable (which will be the case for all examples we consider in this paper), (P∗K ) can efficiently be solved, and there is a wealth of robust and efficient algorithms available for it (see e.g. [46]). Furthermore, if Ψ is linear in y, then even (PK ) is trivial to solve. We can utilize the computational tractability of (P∗K ) in a number of ways. First of all, the components 0 ≤ yj∗ ≤ 1 of the optimal solution y ∗ ∈ UV of (P∗K ) can be regarded directly as a measure of importance or relevance of the corresponding attribute in (T, F ), with the natural interpretation that the closer yj∗ is to 1, the more relevant the jth attribute is (cf. [17, 35]). A second idea is to use the highest ranking attributes to form a support set. In a more refined way, we may accept the top ranking one, and then recompute the fractional relevance weights for the rest of the attributes; this process is repeated until the obtained set S satisfies |S| = K for a given constant K. We shall call such procedures as rounding algorithms, since they can also be interpreted as rounding the fractional components of y ∗ to integral ones, in some specific way. Ψ-ROUNDING Input: A pdBf (T, F ) and a constant K. Step 0: Set S = ∅. Step 1: Solve (P∗K ) with the restrictions yj = 1 for all j ∈ S already added, and let y ∗ ∈ UV denote the obtained optimal solution. Step 2: Choose j ∈ V \ S for which yj∗ is the largest, and set S = S ∪ {j}. Step 3: If |S| = K, then STOP, otherwise return to Step 1.
8
This idea can be further improved by accepting not the relevance of attributes with a high yj∗ value, but rather accepting the irrelevancy of attributes with a small yj∗ value. Let us call this type of algorithms stingy. Such an idea to work better, we need to restrict further the feasible region of the problem, by enforcing stronger separation between T and F . Let us note that adding linear inequalities and equalities to (P∗K ) does not diminish the nature and computational tractability of the problem. (P+ K)
maximize subject to
Ψ(y) P y P j∈V j j∈a∆b yj yj yj
≤ ≥ = ∈
K, 1, 0, U,
for all a ∈ T and b ∈ F, for all j ∈ Z, for all j ∈ V.
Here the set Z stands for the set of variables already decided to be fixed at 0. Ψ-STINGY Input: A pdBf (T, F ). Step 0: Set K := 1 and Z := ∅. ∗ V Step 1: Solve the optimization problem (P+ K ), and let y ∈ U denote the obtained optimal solution. If this problem has no feasible solution, then set K := K + 1, and return to Step 1.
Step 2: If K + |Z| = |V |, then output y ∗ and STOP. Otherwise, sort the components of yj∗ , j 6∈ Z such that yi∗1 ≤ yi∗2 ≤ · · · ≤ yi∗|V \Z| , and set k to be the largest integer satisfying k X
yi∗j < 1.
j=1
Step 3: Set Z := Z ∪ {i1 , i2 , . . . , ik }, and return to Step 1. Let us remark that this stingy algorithm returns always a support set for (T, F ), and runs in polynomial time, whenever (P+ K ) can be solved in polynomial time. Furthermore, this algorithm returns a minimal support set, unlike all other heuristics mentioned earlier. Theorem 2 Algorithm Ψ-STINGY stops in O(n) iterations, and returns a binary vector y ∗ such that the set S ∗ = {j | yj∗ = 1} is a minimal support set of the pdBf (T, F ), assuming that T ∩F = ∅, Ψ is given by (4), and that ψa,b , a ∈ T , b ∈ F satisfy condition (6). Proof: Let us observe first that if (P+ K ) is feasible, and the algorithm is Pnot stopping P in Step 2, then K < |V \ Z|, and thus k ≥ 1 must hold in Step 2, since otherwise j∈V yj∗ = j∈V \Z yj∗ = |V \ Z| > K, contradicting the first constraint of (P+ K ). Thus, in every iteration, either K or |Z| is increased strictly, and hence the algorithm must terminate in O(n) iterations, since K + |Z| ≤ |V |. Let us note next that when K = |V \ Z| is reached and the algorithm stops, then yj∗ = 1 for all j 6∈ Z holds automatically, due to the monotonicity conditions of (6).
9
It remaines to show that the set V \ Z is a minimal support set of (T, F ) at termination. For this, let us show first that V \ Z is always a support set of (T, F ). This is certainly true as long as Z = ∅, since T ∩ F = ∅ is assumed. Let us assume indirectly that there is an iteration when V \ Z is a support set, but V \ (Z ∪ {i1 , . . . , ik }) is not anymore. This implies that there must exists a pair u ∈ T and v ∈ F such that u∆v ⊆ Z ∪ {i1 , . . . , ik }. For this pair however we have X
yj∗ ≤
j∈u∆v
X
yj∗ +
k X
yi∗j =
yi∗j < 1
j=1
j=1
j∈Z
k X
according to the selection of k in Step 2, contradicting the fact that y ∗ is feasible in (P+ K ). This contradiction proves that the set V \ Z remains a support set in the course of the algorithm. To show finally that V \ Z is a minimal support set at termination, let us consider the last iteration in which the algorithm touches Step 2. At this step we have K < |V \ Z|, since X X yj∗ ≥ yj∗ > 0 j6∈Z
j∈(a∆b)\Z
must hold, for any a ∈ T and b ∈ F , due to the selection of k in this last step. After this step Z does not change, and (P+ K ) remains infeasible, while K increases one-by-one, in Step 1, to K = |V \ Z|. If there were a proper subset S ⊂ V \ Z, |S| < |V \ Z|, which is a support set of (T, F ), then (P+ K ) would become feasible as soon as K ≥ |S|, that is certainly for some value K ≤ |V \ Z| − 1. This final contradiction shows that a proper subset of V \Z cannot be a support set at termination, concluding our proof.
4
Various measures of separation
In this section we consider several examples for Ψ satisfying the conditions (5) – (8). All of these examples are based on the Hamming distances of pairs of vectors a ∈ T and b ∈ F . Given a subset S ⊆ V with its characteristic vector y = χS , let us recall that the Hamming distance of the projections of two vectors a, b ∈ Bn on BS can be written as X dy (a, b) = yj . (11) j∈a∆b
First, we consider three natural measures: the average Hamming distance between the projections of T and F , the number of separated pairs of positive and negative examples, and a weighted version of the latter: σ(y) θ(y)
= =
1 |T ||F | X
X
dy (a, b),
(12)
a∈T,b∈F
min{dy (a, b), 1}, and
(13)
min{dy (a, b), 1} , |a∆b| − 1 + ²
(14)
a∈T,b∈F
θw (y)
=
X a∈T,b∈F
where ² > 0 is a fixed constant, small enough. The weights in the third function θw (y) are to indicate the fact that the smaller is a∆b, the harder it is to separate a from b. Such a weighting 10
was proposed in [2]. Here we modified the denominator by ² in order to prevent a division by zero (T ∩ F = ∅ is assumed, as always). Proposition 1 Each of the functions σ(y), θ(y) and θw (y) is a separation measure of the form (4), and each satisfies the conditions (5)–(7). Furthermore, θ(y) and θw (y) satisfy (8). Proof: Immediate from the above definitions. Consequently, the subsets that maximize θ or θw are automatically support sets by Lemma 1, and the greedy algorithms for all three measures have approximation ratio 1 − 1/e by Theorem 1. Let us add that θ-GREEDY is exactly the standard greedy algorithm we already cited in Section 2 for the set-covering problem, while θw -GREEDY is essentially the weighted greedy procedure as introduced in [2, 3]. One possible problem with simple greedy procedures, like θ-GREEDY, is that ties are quite frequent. Since no meaningful tie-breaking is built into these procedures, they are not very robust. For instance, θ-GREEDY shows considerable variations in the produced support sets, if it is applied to the same input data with differently permuted attribute set. The following family of separation measures is an attempt to refine the above measures by providing an easy to compute, meaningful tie-breaking. Given y = χS ∈ BV , define hk (y) as the number of pairs of positive and negative examples, which are separated k-fold by S: hk (y) = |{(a, b) : a ∈ T, b ∈ F, dy (a, b) = k}|, k = 0, 1, . . . , n. Clearly
n X
(15)
hk (y) = |T ||F |
k=0
holds for every y ∈ BV , furthermore, hk (y) = 0 if k > |S| = Let us introduce the vector
Pn j=1
yj .
h(y) = (h0 (y), h1 (y), . . . , hn (y)) as a reasonable measure of the separation in the following sense: For y, y 0 ∈ BV , we say that y separates lexicographically T and F better than y 0 , if the vector h(y) is lexicographically smaller than h(y 0 ); in other words, if there exists an index i (0 ≤ i ≤ n) such that hk (y) = hk (y 0 ) for all k < i, and hi (y) < hi (y 0 ). Let us further note that the latter conditions are equivalent to n X
hk (y)αk
φα (y 0 ), where α ≥ 0 is small enough. Note that φα (y) is a generalization of the lexicographic order, since α may be set to any nonnegative value. In particular, if α = 0, then φ0 (y) = |T ||F | − h0 (y) counting the number of pairs of positiv and negative examples, which have Hamming distance at least 1, and hence φ0 (y) = θ(y) for all y ∈ BV . This shows that φα is indeed a generalization and is a refinement of θ.
11
Proposition 2 The function φα (y) is a separation measure of the form (4), and it satisfies conditions (5)–(7). Moreover, it satisfies (8) as well, if α ≥ 0 is small enough. Proof: Follows by elementary calculations from the definitions. Thus, any vector that maximizes φα corresponds to a support set by Lemma 1, and the procedure φα -GREEDY has a guaranteed approximation ration of 1 − 1/e by Theorem 1. Furthermore, the measure φα can be modified (analogously to the derivation of θw from θ in [2]) to the following weighted version: X
φw α (y) =
a∈T,b∈F
1 − αdy (a,b) . |a∆b| − 1 + ²
(17)
As we shall see from the computational studies, φα is more robust than θ, and this weighted version φw α does not produce any noticeable difference in performance. Let us remark finally that the minimum Hamming distance ρ(y) =
min dy (a, b)
a∈T,b∈F
(18)
is also a reasonable measure of separation. However, ρ is not in the form of (4), resulting in substantial differences. For instance, we can show that no constant-factor approximation algorithm exists for (PK ) with Ψ = ρ, unless P=NP. Theorem 3 Unless P=NP, there exists no polynomial-time r(N )-approximation algorithm for problem (PK ) with Ψ = ρ, where N denotes the problem size and 0 < r(N ) ≤ 1 for all N . Proof: Let r : Z+ 7→ (0, 1] be an arbitrary function over the set of non-negative integers, and assume indirectly that there exists a polynomial-time r(N )-approximation algorithm A. We show that, from A, we could construct a polynomial time algorithm B for the set-covering problem, which is known to be NP-hard [27], hence proving the theorem. Let us note that the set-covering problem is the same as finding the smallest K for which the objective function Ψ = ρ of (PK ) is at least 1, since, as we noted earlier, any set-covering problem can be cast as a minimum support set problem. Therefore, to solve an instance of the set covering problem, we could cast it as a minimum support set problem, and apply algorithm A to problem (PK ) with Ψ = ρ for K = 1, 2, . . . , |V |. Let N denote the input size of this problem, let y K,A ∈ BV denote the corresponding solutions, produced by A, and let y K,∗ denote the corresponding optimal solution. Since A is assumed to be an r(N )-approximation algorithm, we must have ρ(y K,A ) ≥ r(N )ρ(y K,∗ ), since the size of the problem does not essentially depend on K. Thus, for each K, for which the optimum value of (PK ) is positive, algorithm A outputs a solution y K,A with ρ(y K,A ) ≥ 1, since ρ is integer valued, and r(N ) > 0. Letting K ∗ be the smallest K such that ρ(y K,A ) ≥ 1, we can ∗ easily see that y K ,A is an optimum solution of the set covering problem. In concluding this section, we note that problems (P∗K ) with Ψ(y) = θ(y), θw (y), φα (y), or φw α (y) are all concave maximization problems, while (P∗K ) with Ψ(y) = σ(y), or ρ(y) can be formulated as linear programming problems.
12
5
Computational results
We conducted a number of computational experiments to compare the above methods, and to evaluate their performance in various ways. In our experiments we used mostly randomly generated data sets, but we also conducted computational studies with several of the publicly available machine learning data sets from the Irvine repository [43]. In a first batch of our experiments, we generated random pdBfs (T, F ) with known support sets, and compared the performance of heuristics from Section 3 with the various measures introduced in Section 4. In this way, we also gained an implicit comparison to several other approaches, since in [2, 3] θw -GREEDY was compared to several other methods, and was reported as one of the best. In generating random problem instances, we assume that T and F are the results of random sampling from some real phenomenon that depends on an (unknown) subset of the observed attributes. Thus, in particular, we assume in our experiment the existence of a Boolean function f : BH 7→ B for some subset H ⊆ V , such that f (a[H]) = 1 for a ∈ T and f (b[H]) = 0 for b ∈ F . We shall say that f is the hidden logic behind the pdBf (T, F ), and H is the set of hidden attributes. We shall also assume that the rest of the variables are either dependent attributes D ⊆ V or random noise (i.e. irrelevant) attributes R ⊆ V . Thus, V =H ∪D∪R is assumed to be a partition, where for every k ∈ D there exists a Boolean function gk : BH 7→ B such that xk = gk (x[H]) holds for all x ∈ T ∪ F , while for indices k ∈ R such a mapping does not exist (if T ∪ F is large enough). Clearly, H is a support set of any pdBf (T, F ) generated in the above model, though it may not be the only one, and it may not even be the (unique) smallest one, depending on the size of the sample set T ∪ F ⊆ BV . In the following, we compare heuristic algorithms according to as how close are the obtained support sets to the hidden set H. We shall study the effect of different measures of separation, and the problem of how large the data set |T ∪ F | should be in order to have H as one of the “best” support sets.
5.1
Comparing measures of separation
In our first set of experiments, we studied the quality of different measures θ, θw , φα , φw α by solving the continuous relaxation (P∗K ), for random instances, with |H| = 6, |R| = 14 and D = ∅. We used α = e−5 in (16) and (17), and ² = 10−3 in (14) and (17). ∗ For each instance and measure Ψ ∈ {θ, θw , φα , φw α }, we solved problem (PK ), and used the ∗ V components of the optimal solution y ∈ U as weights of relevance of the attributes. For each instance-measure pair we ran 2 experiments, one with K = 3 as a lower estimate on the true support set size, and the other with K = 9 as an upper estimate on the same. (In practice such bounds, perhaps even tighter ones, can easily be obtained: Via linear programming one can find a lower bound on the smallest value of K for which a support set of size K exists, while any of the described heuristics provide an upper bound on that.) To compare the quality of four measures, we applied a standard method known as simple averaged precision (see e.g. [29]), which is used in information retrieval. That is, first rank all the attributes in the decreasing order of their weights: yi∗1 ≥ yi∗2 ≥ · · · ≥ yi∗n , and then evaluate how close are the attributes in H to the top of this list. More precisely, if j1 < j2 < · · · < jh are the indices (in the above order) of the attributes of H, H = {ij1 , ij2 , . . . , ijh }, 13
then, we compute the ratios
k , k = 1, 2, . . . , h, jk and their average r˜. Clearly, the closer these values are to 1, the better the relevance weights reflect the hidden logic. We have generated problems with sample sizes |T |+|F | = 25, 50, 75, 100. To avoid the influence of possible ties, for each instance we first generated a random subset H ⊆ V = {1, 2, . . . , 20}, and then a random Boolean function f : BH 7→ B with distribution Prob(f (x) = 1) = 0.5 independently for all x ∈ BH . Then we have drawn sample vectors x ∈ BV with uniform distribution, and divided them into T and F according to the values f (x[H]). We have tabulated the results in Table 1, in which each value represents the average over 50 instances of the same size and type. It can be seen from Table 7 for n(= |V |) = 20 and |H| = 6 that instances with |T | + |F | = 25 examples are very unlikely to have a unique support set of size 6, while instances with |T | + |F | = 100 examples are much more likely to have H as their only support set of this size. Table 1 also includes, as a base line comparison, the expected values of rk and r˜ for a randomly selected permutation π of the set V . rk =
|T ∪ F |
25
50
75
100
r1 r2 r3 r4 r5 r6 r˜ r1 r2 r3 r4 r5 r6 r˜ r1 r2 r3 r4 r5 r6 r˜ r1 r2 r3 r4 r5 r6 r˜
π 0.53 0.42 0.38 0.36 0.35 0.34 0.40 0.53 0.42 0.38 0.36 0.35 0.34 0.40 0.53 0.42 0.38 0.36 0.35 0.34 0.40 0.53 0.42 0.38 0.36 0.35 0.34 0.40
θ 0.61 0.54 0.46 0.42 0.38 0.34 0.46 0.72 0.61 0.51 0.45 0.43 0.37 0.52 0.75 0.67 0.62 0.55 0.50 0.41 0.58 0.93 0.82 0.78 0.65 0.58 0.51 0.71
K θw 0.63 0.57 0.50 0.43 0.37 0.35 0.47 0.74 0.62 0.54 0.49 0.43 0.39 0.53 0.86 0.78 0.69 0.61 0.50 0.44 0.65 0.90 0.86 0.82 0.68 0.59 0.49 0.72
=3 φα 0.74 0.69 0.56 0.43 0.40 0.36 0.53 0.83 0.70 0.62 0.51 0.46 0.40 0.59 0.96 0.82 0.75 0.70 0.60 0.50 0.72 0.98 0.92 0.89 0.79 0.71 0.55 0.81
φw α 0.74 0.68 0.54 0.43 0.39 0.36 0.52 0.84 0.69 0.61 0.52 0.46 0.41 0.59 0.95 0.84 0.74 0.67 0.59 0.49 0.71 0.98 0.88 0.84 0.78 0.67 0.55 0.78
θ 0.60 0.54 0.50 0.42 0.38 0.34 0.46 0.62 0.52 0.49 0.47 0.42 0.38 0.48 0.64 0.59 0.60 0.58 0.56 0.52 0.58 0.75 0.71 0.70 0.67 0.66 0.65 0.69
K=9 θw φα 0.58 0.73 0.52 0.62 0.49 0.50 0.41 0.41 0.38 0.38 0.34 0.35 0.45 0.50 0.61 0.82 0.53 0.68 0.48 0.57 0.45 0.49 0.42 0.45 0.40 0.41 0.48 0.57 0.74 0.87 0.67 0.77 0.67 0.69 0.67 0.61 0.65 0.52 0.62 0.44 0.67 0.65 0.85 0.96 0.81 0.83 0.78 0.77 0.76 0.70 0.76 0.61 0.74 0.51 0.78 0.73
φw α 0.65 0.56 0.47 0.43 0.39 0.36 0.48 0.84 0.68 0.56 0.49 0.45 0.41 0.57 0.86 0.75 0.68 0.61 0.52 0.44 0.64 0.95 0.79 0.75 0.69 0.61 0.50 0.72
Table 1: Ranked based retrieval results, averaged over 50 runs with randomly generated problems. We can see that all four measures are significantly better than the base line random permutation 14
π. Among the four measures, φα and φw α perform almost the same, and in general, they outperform θ and θw . It is interesting to note that θ and θw are very similar, except when |T | + |F | is large and K = 9, in which case θw becomes better. In general, the best results are obtained with φα (or φw α ) and with K = 3 (i.e. estimating the minimum support set size from below). For the case of |T | + |F | = 75, θw is also competitive. However, if we plan to use the top ranking attribute (for instance in some iterative scheme), φα seems still to be more reliable. For these experiments we used the AMPL package (Student version), with the MINOS 5.5 solver (see [26]). The total solution time of the 1600 problems used for generating Table 1 was about 45 minutes in total on a Pentium 500MHz Win32 computer.
5.2
Comparing heuristics on randomly generated data
In a second batch of our computational experiments, we compare the various heuristics of Section 3 applied with the measures of Section 4, for randomly generated sets of instances. In our generation, the construction of the hidden logic f and the dependency functions gk (as defined in the beginning of this section) are done by the following two methods, where all random drawings are done from uniform distribution. • Random functions: Given H, the truth table of f is randomly constructed by assigning the output 0 or 1 to each input vector from BH with equal probabilities. Each gk is dependent on a randomly chosen variable set H 0 ⊂ H of size d|H|/2e, and its output values are determined in a manner similar to f . (Due to randomness, gk might end up depending on an even smaller subset H 00 ⊂ H 0 .) • Functions represented by DNF: Terms are randomly generated from the variable set H, by choosing a positive literal with probability 1/4, a negative literal with probability 1/4 and no literal with probability 1/2, for each variable in H. The disjunction of an appropriate number of such terms is then used to define the hidden logic f (the generation of terms is terminated when their disjunction covers at least half of BH ). The dependence functions gk are also generated analogously, after choosing H 0 as above. Based on these two models, each instance (T, F ) is constructed by repeating |T | + |F | times the following: First generate a random binary vector x ∈ BH , determine whether it belongs to T or F by the value f (x), append the components xk = gk (x) for k ∈ D, and finally append randomly generated components xj for j ∈ R. We have generated six types of instances for random functions and DNF functions, respectively. The parameters of the six types are listed below in Table 2. Types of instances A1 A2 B1 B2 C1 C2
|H| 10 10 5 5 5 5
|D| 5 5 10 10 0 0
|R| 5 5 10 10 40 40
|T | + |F | 200 400 200 400 200 400
Table 2: Sizes of six types of problem instances. In one experiment we tested Ψ-ROUNDING on these problems for all mentioned variants of Ψ, that is for Ψ ∈ {σ, ρ, θ, θw , φα , φw α }. The obtained results are displayed in Table 3 for each of 15
the categories of Table 2. The entries show the average behavior (over 40 instances, each) of |S|, rH = |S ∩ H|/|S|, rD = |S ∩ D|/|S|, and rR = |S ∩ R|/|S|, where S denotes the obtained support set. It is desired to have |S| as small as |H|, and to have large rH and small rD and rR . In this experiment we used AMPL with the MINOS 5.5. solver (Student version) on a 500MHz Win32 PC. The 1440 instances corresponding to the entries in Table 3 took several hours. The smaller sized problems (e.g. types A1 and B1) took, on average, 20-60 seconds, each, while the largest instances (type C2) took 5-10 minutes, each. In general ρ-ROUNDING ran the slowest, and σ-ROUNDING was the fastest. The running times with the other 4 measures were about the same. Measures σ ρ θ θw φα φw α
|S| 14.450 11.200 11.100 10.925 10.600 10.550
σ ρ θ θw φα φw α
|S| 11.184 6.474 6.237 6.000 5.622 5.514
σ ρ θ θw φα φw α
|S| 13.050 10.750 5.875 5.500 5.150 5.050
A1 rH rD 0.586 0.288 0.577 0.303 0.599 0.285 0.628 0.252 0.648 0.253 0.635 0.271 B1 rH rD 0.405 0.301 0.483 0.109 0.572 0.041 0.533 0.016 0.565 0.023 0.558 0.008 C1 rH rD 0.314 0.000 0.054 0.000 0.860 0.000 0.917 0.000 0.976 0.000 0.992 0.000
rR 0.126 0.120 0.117 0.120 0.099 0.094
|S| 15.625 11.300 12.225 11.600 11.500 10.850
rR 0.294 0.408 0.387 0.450 0.411 0.435
|S| 10.825 7.000 5.950 5.650 5.275 5.308
rR 0.686 0.946 0.140 0.083 0.024 0.008
|S| 12.900 11.000 5.353 5.353 5.000 5.000
A2 rH rD 0.607 0.278 0.841 0.117 0.762 0.163 0.787 0.152 0.830 0.141 0.880 0.093 B2 rH rD 0.460 0.236 0.476 0.141 0.611 0.021 0.593 0.011 0.664 0.000 0.655 0.000 C2 rH rD 0.382 0.000 0.322 0.000 0.961 0.000 0.954 0.000 1.000 0.000 1.000 0.000
rR 0.115 0.041 0.075 0.061 0.029 0.027 rR 0.304 0.383 0.368 0.396 0.336 0.345 rR 0.618 0.678 0.039 0.046 0.000 0.000
Table 3: Performance summary of Ψ-ROUNDING with random functions, where α = 0.001 rH = |S ∩ H|/|S|, rD = |S ∩ D|/|S|, rR = |S ∩ R|/|S|
Looking at the results in Table 3 we can see that the hidden logic is quite well recovered in many cases. In general φα -ROUNDING and φw α -ROUNDING produced the best results, with the smallest support sets obtained, and at the same time, with the highest proportion of recovered hidden attributes. In particular, when dependent attributes are not present (type C), the hidden logic is almost perfectly recovered even in the smaller instances. It is also noticeable that the presence of dependent variables confuses substantially this method. This is very transparent in the results with type B instances: Regardless of the measure chosen, a very large percentage of random noise attributes gets selected into the support sets. Comparing this to the results with type C instances, it is apparent that the dependent variables influence badly the procedure, even though they themselves do not get chosen. Perhaps they gain some weight in the optimal solution of (P∗K ) at the expense of the hidden attributes (similarly as a chanceless third candidate can 16
alter the results of an election). These results suggest that if accompanied with some statistical pre-processing, even better results can be obtained (dependency is a structural information, and typically it is harder to filter out for standard statistical procedures, than randomness.) In the second and third batches of experiments, we compare GREEDY and STINGY procedures. We choose to compare the following procedures σ-GREEDY, ρ-STINGY, θ-GREEDY, θ-STINGY, φα -GREEDY and φα -STINGY (with α = 0.001 used in the last three). (Recall that θ is the same as φα with α = 0.) The results are tabulated in Table 4 for random functions and in Table 5 for DNF functions. From these tables we see that θ-GREEDY, φα -GREEDY and φα -STINGY perform reasonably well, while methods based on ρ or on σ show a weaker performance (with the possible exception of ρ-STINGY for type B problems). In most results of Tables 4 and 5, rH is much larger than rD and rR , implying that these heuristic algorithms are able to distinguish quite well essential attributes from others, though their power varies according to the algorithms. There are different tendencies for types A, B and C, as to which of the dependent variables D and the random variables R are more confused with H. We can see that dependent variables confuse much more these procedures than random ones; in particular GREEDY algorithms are prone to this type of error, and in particular for problems of type B, which have a large proportion of dependent variables (to some extent this can be expected; if a variable depends exactly on one of the variables from H, then in fact it cannot be distinguished from that one, at all). In general, φα -STINGY performs slightly better than the others. In particular, it retrieves much smaller proportion of dependent variables, even for type B problems, and generates somewhat smaller support sets. For instance, for type C2 problems with randomly generated hidden logic, H was determined exactly by φα -STINGY, for all 40 instances. The same happened with the DNF instances with φα -STINGY, φα -GREEDY, as well as with θ-GREEDY. Comparing Tables 3 and 4 we can see that ROUNDING is slightly better than GREEDY or STINGY. It is especially interesting to see that STINGY procedures are, in general, very effective in eliminating random noise attributes, while ROUNDING proved to be effective against dependent attributes. It is a challenging open question if and how can these two procedures be mixed efficiently to eliminate both types of unwanted attributes. We show the effects of choosing different values for parameter α in Table 6, in which the average size |S| of the obtained support sets are reported for various values of α, with both φα -GREEDY and φα -STINGY applied to problem instances with random hidden functions. Recall that φα GREEDY (resp., φα -STINGY) with α = 0 is equivalent to θ-GREEDY (resp., θ-STINGY). These results show that the choice of α has relatively little influence on the performance; a small positive value like α = 0.001 may be recommended. The influence of α becomes even smaller for the instances of DNF functions, though we omit the details. The computation time to execute the above heuristic algorithms varies greatly depending on whether the linear or concave programs (P∗K ) need be solved or not. As discussed before, ΨSTINGY require to solve these programs repeatedly. We used in our experiment the CPLEX code for linear programming and the SQP (successive quadratic programming) code developed in [31] for convex programming (see e.g. [8] for more on SQP). Using a machine of Sun Ultra 60 (360MHz, 2GB memory), φα -STINGY with α = 0.001 (using convex programming) took more than 1000 seconds on the average to solve one A2 instance of random functions, and ρ-STINGY (using linear programming) took about 60 seconds for the same case. The algorithm θ-STINGY is computationally much heavier, even if it is based on linear programming, because the numbers of variables and constraints in (P+ K ) are very large. Therefore, we did not complete the experiment of θ-STINGY for the instances of A2, B2 and C2. On the other hand, algorithms Ψ-GREEDY are very fast, in general. In the above case of type A2, for example, σ-GREEDY, θ-GREEDY and 17
Types A1 A2 B1 B2 C1 C2
|S| 14.83 15.89 11.30 10.81 12.92 13.23
σ-GREEDY rH rD 0.586 0.131 0.606 0.119 0.404 0.297 0.462 0.303 0.317 0.000 0.366 0.000
rR 0.283 0.275 0.299 0.236 0.684 0.634
|S| 11.05 11.40 5.22 5.38 7.80 6.88
ρ-STINGY rH rD 0.538 0.155 0.791 0.083 0.738 0.263 0.716 0.284 0.623 0.000 0.812 0.000
rR 0.307 0.126 0.000 0.000 0.377 0.188
|S| 11.18 12.14 6.14 6.01 5.84 5.24
θ-GREEDY rH rD 0.602 0.131 0.756 0.076 0.567 0.400 0.591 0.392 0.866 0.000 0.976 0.000
rR 0.267 0.168 0.032 0.018 0.134 0.024
|S| 10.75 — 6.00 — 6.15 —
θ-STINGY rH rD 0.699 0.082 — — 0.550 0.372 — — 0.848 0.000 — —
rR 0.219 — 0.078 — 0.152 —
|S| 11.18 12.12 6.10 6.00 5.78 5.22
φα -GREEDY rH rD 0.615 0.109 0.771 0.062 0.595 0.377 0.596 0.389 0.885 0.000 0.976 0.000
rR 0.275 0.167 0.029 0.016 0.115 0.024
|S| 10.75 10.78 5.80 5.70 5.67 5.00
φα -STINGY rH rD 0.740 0.038 0.913 0.010 0.704 0.223 0.730 0.216 0.923 0.000 1.000 0.000
rR 0.222 0.078 0.073 0.053 0.077 0.000
Types A1 A2 B1 B2 C1 C2 Types A1 A2 B1 B2 C1 C2
Table 4: Performance summary for random functions, where α = 0.001, rH = |S ∩ H|/|S|, rD = |S ∩ D|/|S|, and rR = |S ∩ R|/|S|.
φα -GREEDY took only 0.99, 1.4 and 6.1 seconds, respectively. In view of these, our conclusion is to recommend θ-GREEDY or φα -GREEDY as a practical tool to find reasonably small support sets. It is also interesting to look at our results from the point of view used in the probabilistic arguments of Appendix A (even though these instances do not satisfy the assumptions of Appendix A). Note that |T | ≈ |F | holds in our problem instances. For type A instances with |H| = 10 and n = 20, Table 7 indicates that |T |+|F | = 200 is below the threshold to attain E(X(K, mT , mF , n)) ≤ 1, but |T |+|F | = 400 is well above it. Our results are in agreement with this in the sense that for most methods rH is closer to 1.0 for type A2 than for type A1 instances. For type B and C instances with |H| = 5, Table 7 does not tell the exact thresholds, but indicates that both |T | + |F | = 200, 400 are sufficient to guarantee E(X(K, mT , mF , n)) ≤ 1 for K = 5. Even though this argument applies stronger for type B than for type C instances, most methods perform better on type C than on type B, indicating strongly that dependent attributes are much harder to eliminate than random noise is.
18
Types A1 A2 B1 B2 C1 C2
|S| 14.45 16.04 10.10 9.35 11.75 12.51
σ-GREEDY rH rD 0.557 0.240 0.558 0.241 0.308 0.563 0.342 0.595 0.367 0.000 0.402 0.000
rR 0.203 0.201 0.129 0.064 0.633 0.598
|S| 10.60 11.57 5.40 5.28 7.33 6.60
ρ-STINGY rH rD 0.570 0.128 0.740 0.065 0.546 0.454 0.605 0.395 0.679 0.000 0.821 0.000
rR 0.302 0.195 0.000 0.000 0.321 0.180
|S| 10.08 10.20 5.72 5.47 5.14 5.00
θ-GREEDY rH rD 0.806 0.061 0.964 0.013 0.512 0.463 0.547 0.453 0.968 0.000 1.000 0.000
rR 0.134 0.023 0.025 0.000 0.032 0.000
|S| 10.22 — 5.65 — 5.83 —
θ-STINGY rH rD 0.773 0.050 — — 0.483 0.479 — — 0.889 0.000 — —
rR 0.177 — 0.039 — 0.111 —
|S| 10.03 10.20 5.75 5.47 5.13 5.00
φα -GREEDY rH rD 0.836 0.040 0.970 0.008 0.498 0.474 0.560 0.440 0.975 0.000 1.000 0.000
rR 0.124 0.022 0.028 0.000 0.025 0.000
|S| 9.82 10.35 5.72 5.47 5.20 5.00
φα -STINGY rH rD 0.897 0.010 0.939 0.004 0.503 0.454 0.558 0.430 0.964 0.000 1.000 0.000
rR 0.094 0.057 0.043 0.013 0.036 0.000
Types A1 A2 B1 B2 C1 C2 Types A1 A2 B1 B2 C1 C2
Table 5: Performance summary for DNF functions, where α = 0.001, rH = |S ∩ H|/|S|, rD = |S ∩ D|/|S|, and rR = |S ∩ R|/|S|.
5.3
Computational results with real world data
To see how our algorithms perform on real world data, we took four data sets from Irvine repository (http://www.ics.uci.edu/˜mlearn/MLRepository.html). As some attributes of these data sets were numerical or nominal, we transformed them into binary attributes, as described respectively, before computing support sets. Furthermore, all data vectors containing missing bits were omitted from consideration in our experiments. 5.3.1
Breast Cancer
There are 9 original attributes, each taking an integer value from {1, 2, . . . , 10} (see [40]). To transform this into a binary data set, a cut-point is placed between every two consecutive integer values of each attribute; For example, “x2 ≥ 2.5” denotes the binarized attribute taking value 1 if the value of x2 is not less than 2.5, and 0 otherwise (cf. [13]). The resulting binary data set has 9 × 9 = 81 binary attributes. After removing data vectors containing missing bits, this data set has |T | = 239 and |F | = 444 data vectors. Some runs of θ-GREEDY obtained a support set of 11 binary attributes but in most cases support sets of 12 attributes (as it breaks ties randomly, different runs on the same data set may
19
Table 6: Average sizes of |S| with different α (random functions). Instances α = 0.1 11.43 12.10 6.67 6.53 5.60 5.28
A1 A2 B1 B2 C1 C2
φα -GREEDY 0.01 0.001 11.18 11.18 12.12 12.12 6.13 6.10 6.00 6.00 5.78 5.78 5.22 5.22
0.0 11.18 12.14 6.14 6.01 5.84 5.24
α = 0.1 10.95 11.40 6.53 6.03 6.53 5.58
φα -STINGY 0.01 0.001 10.82 10.75 11.05 10.78 6.05 5.80 5.63 5.70 5.80 5.67 5.60 5.00
0.0 10.75 — 6.00 — 6.15 —
produce different solutions). All runs of φα -GREEDY and φα -STINGY (where α = 0.001) obtained support sets of 12 binary attributes. (As φα -STINGY takes too much computation time on the original binary data set, the data set was reduced by choosing 27 promising binary attributes before applying φα -STINGY.) We list two typical support sets of 11 attributes and 12 attributes for the purpose of future comparisons. (a) x2 ≥ 2.5, x6 ≥ 3.5, x7 ≥ 4.5, x1 ≥ 6.5, x3 ≥ 4.5, x8 ≥ 2.5, x4 ≥ 2.5, x6 ≥ 8.5, x1 ≥ 4.5, x3 ≥ 3.5, x5 ≥ 5.5. (b) x1 ≥ 4.5, x1 ≥ 6.5, x2 ≥ 2.5, x3 ≥ 3.5, x3 ≥ 4.5, x4 ≥ 1.5, x4 ≥ 2.5, x5 ≥ 3.5, x6 ≥ 3.5, x6 ≥ 7.5, x7 ≥ 4.5, x8 ≥ 2.5.
5.3.2
Mushrooms
The original data set contains 22 nominally valued attributes. Each nominal value of each attribute was transformed into one binary attribute that takes value 1 if and only if the original attribute takes the chosen nominal value. In this way, 125 binary attributes were generated. The data set has |T | = 2156 and |F | = 3488 data vectors. Applications of θ-GREEDY and φα -GREEDY could find the following support set of 6 binary attributes (φα -STINGY was not tested as the data set was too large): bruises odor gill-size stalk-root spore-print-color habitat
= = = = = =
bruises (or its negation), none, broad, bulbous, white, woods.
This may be an interesting result, because the rules cited in the Irvine repository contains 8 binary attributes. It should be further checked, however, whether this smaller support set can produce better rules for Mushrooms. 5.3.3
Hepatitis
The original data set has 13 binary attributes and 6 numerical attributes. We introduced 4 to 6 cut points of appropriate locations for each numerical attribute. The resulting binary data set has 46 binary attributes, and |T | = 67, |F | = 13. All three methods were applied to the resulting data set; θ-GREEDY and φα -GREEDY obtained the following support set of 7 attributes: 20
MALAISE SPIDERS
= =
VARICES ALK PHOSPHATE ALK PHOSPHATE PROTIME HISTOLOGY
= > > > =
(yes/no), (yes/no) (or AGE>50 or SPLEEN PALPABLE=(yes/no) or ASCITES=(yes/no)), (yes/no), 120, 200, 50, (yes/no).
On the other hand, φα -STINGY obtained a slightly different support set, though the number of attributes is the same. AGE SPIDERS BILIRUBIN ALK PHOSPHATE PROTIME PROTIME HISTOLOGY
5.3.4
> = > > > > =
40, (yes/no), 1.20, 120, 40, 50, (yes/no).
Voting Records
The original data set is already binary, it has 16 attributes, |T | = 124 positive examples, and |F | = 108 negative examples. All three methods obtained the same support set of 8 attributes: 1, 2, 3, 4, 11, 13, 15, 16. 5.3.5
Running times
The computation time required for these experiments varies according to the sizes of data sets. θ-GREEDY and φα -GREEDY took about 30 seconds for Breast Cancer, more than 1000 seconds for Mushrooms, and less than 1 second for Hepatitis and Voting Records. φα -STINGY is much slower, and took 11 seconds for Hepatitis, and 360 seconds for Voting Records.
6
Conclusion and discussion
In this paper, we have studied the problem of finding small support sets S for a given data set (T, F ), capturing the essential attributes of the phenomenon under consideration. Although such data set may have many “random” support sets, our probabilistic arguments show that essential ones can be distinguished from random ones if their sizes are smaller than a certain threshold (depending on |T | and |F |). After introducing various measures of separation between T and F , we showed that the problem of finding a best support set can be formulated as an integer programming problem, which can be relaxed to a certain linear or nonlinear programming problems. Based on these, we proposed several heuristic algorithms, and compared their performance by computational experiments on artificial and real-world data sets. Some of the heuristic algorithms turned out to be quite effective to find small support sets consisting of mostly essential attributes. This direction of research is very important, since finding essential attributes in large data sets is a core issue in such areas as data mining, knowledge discovery and logical analysis of data. There remain however, many possible improvements for future research: 1. To distinguish dependent variables from essential variables, a special routine may be included in the proposed heuristic algorithms. Suppose that a set of attributes S 0 has been selected so far, which will then be enlarged to a support set S. To check whether an attribute j ∈ V \ S 0 is dependent on S 0 , we partition set T [S 0 ]∪F [S 0 ] into Tj [S 0 ] and Fj [S 0 ] such that x[S 0 ] ∈ T [S 0 ]∪F [S 0 ] 21
belongs to Tj [S 0 ] (resp., Fj [S 0 ]) if xj = 1 (resp., xj = 0) holds. By applying separation measures to pdBf (Tj [S 0 ], Fj [S 0 ]), we can evaluate how much xj depends on S 0 . In other words, if Tj [S 0 ] and Fj [S 0 ] are completely separated, then there is a function that determines the value of xj from x[S 0 ] ∈ T [S 0 ] ∪ F [S 0 ], that is xj is dependent on S 0 . This information may be utilized in a special routine so that highly dependent attributes are excluded from the candidate list for the next selection step. 2. The input data set (T, F ) may contain errors, in the sense that some elements of vectors in T ∪F are different from the true values, and that classification into T and F may also be erroneous. As a result of errors, even the property T ∩ F = ∅ may not hold true. In this situation, measures such as σ, θ and in particular φα still convey useful information of separation, and can be used in the algorithms for solving (PK ). Turning around this argument, we could use the same measures to generate subsets S ⊆ V which do not (necessarily) distinguish all pairs of positive-negative examples, but only a large fractions of those. Classification rules based on such “almost”-support sets may turn out to be more robust, and generalize better. 3. Data vectors in real applications may contain missing bits ∗, implying that the value is not known for some reason (cf. [15].) To deal with such cases, the definitions of separation measures have to be extended. Some attempts and experimental studies are under way in our group. 4. Many of the real-world data sets contain numerical vectors, rather than binary ones. To deal with numerical data, further study is necessary in order to generalize meaningfully the separation measures and the algorithms for computing support sets. We are directing some of our efforts also in this direction (cf. [13]). 5. The sizes of pdBfs (T, F ) encountered in real life applications can be very large. Although the experiments in Section 5 were done for rather small data sets, this was mainly due to the large computation time needed for solving linear and nonlinear programming problems (P∗K ). Those heuristic algorithms not relying on (P∗K ) (e.g., θ-GREEDY, φα -GREEDY, etc.) can be successfully applied to much larger data sets. However, further efforts would be necessary to make them practically useful. For example, evaluating a separation measure for a pdBf (T, F ) takes O(n|T ||F |) time if it is implemented na¨ıvely. For some separation measures, this can be improved to O(n(|T | + |F |) log(|T | + |F |)). Even for other separation measures, an approximate evaluation of its value may be done much faster than O(n|T ||F |). There are several other issues related to the implementation of practical codes.
References [1] R. Agrawal, T. Imielinski and A. Swami. Mining association rules between sets of items in large databases, International Conference on Management of Data (SIGMOD 93), (1993) 207-216. [2] H. Almuallim and T. Dietterich. Efficient algorithms for identifying relevant features. In Proceedings of the Ninth Canadian Conference on Artificial Intelligence, pp. 38-45. Vancouver, BC: Morgan Kaufmann. [3] H. Almuallim and T. Dietterich. Learning Boolean concepts in the presence of many irrelevant features. Artificial Intelligence 69 (1994) 279-305. [4] D. Angluin. Queries and concept learning, Machine Learning, 2 (1988) 319-342. [5] M. Anthony and N. Biggs. Computational Learning Theory, Cambridge University Press, 1992. 22
[6] F.J. Banzaf III. Weighted voting doesn’t work: A mathematical analysis. Rutgers Law Review 19 (1965) pp. 317-343. [7] D. A. Bell and H. Wang. A formalism for relevance and its application in feature sebset selection, Machine Learning, 41 (2000) 175-195. [8] D. P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods, Academic Press, 1982. [9] A. Blum, L. Hellerstein and N. Littlestone. Learning in the presence of finitely or infintely many irrelevant attributes. Journal of Computer and System Sciences 50 (1995) pp. 32-40. [10] A. Blum and P. Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence 67 (1997) 245-285. [11] A. Blumer, A. Ehrenfeucht, D. Haussler and M. K. Warmuth, Occam’s razor. Information Processing Letters, 24 (1987) 377-380. [12] E. Boros, P.L. Hammer, T. Ibaraki, A. Kogan, E. Mayoraz and I. Muchnik. An implementation of logical analysis of data. IEEE Trans. on Knowledge and Data Engineering 12 (2000) 292306. [13] E. Boros, P.L. Hammer, T. Ibaraki and A. Kogan. Logical analysis of numerical data Mathematical Programming 79 (1997), 163-190. [14] E. Boros, T. Ibaraki and K. Makino. Error-free and best-fit extensions of a partially defined Boolean function. Information and Computation, 140 (1998) 254-283. [15] E. Boros, T. Ibaraki and K. Makino. Logical analysis of binary data with missing bits. Artificial Intelligence, 107 (1999) 219–264. [16] P.S. Bradley, O.L. Mangasarian and W.N. Street. Feature selection via mathematical programming. INFORMS Journal on Computing 10 (1998) 209-217. [17] W. Brauer and M Scherf. Feature selection by means of a feature weighting approach. Technical Report FKI-221-97, Institute f¨ ur Informatik, Technische Universit¨at M¨ unchen, 1997. [18] L. Breiman, J.H. Friedman, R.A. Olshen and C.J. Stone. Classification and Regression Trees. (Wadsworth International Group. 1984.) [19] R. Caruana and D. Freitag. Greedy attribute selection. In: Machine Learning: Proceedings of the Eleventh International Conference, (Rutgers University, New Brunswick, NJ 1994), pp. 28-36. [20] C.K. Chow. Boolean functions realizable with single threshold devices. Proc. IRE, 49 (1961) pp. 370-371. [21] V. Chv´atal. A greedy heuristic for the set-covering problem. Mathematics of Operations Research 4(3) (1979) 233–235. [22] M. Conforti and G. Cornuejols. Submodular set functions, matroids and the greedy algorithm: Tight worst-case bounds and some generalizations of the Rado-Edmonds theorem. Discrete Applied Mathematics, 7 (1984) 251-274. [23] Y. Crama, P. L. Hammer and T. Ibaraki. Cause-effect relationships and partially defined Boolean functions. Annals of Operations Research, 16 (1988) 299-326. 23
[24] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press/The MIT Press, 1996. [25] U. Feige. A threshold of ln n for approximating set cover. In Proc. of the 28th ACM Symposium on Theory of Computing, 1996, pp. 314-318. [26] R. Fourer, David M. Gay and Brian W. Kernighan. A Modeling Language for Mathematical Programming. Management Science 36 (1990) 519-554. [27] M. R. Garey and D. S. Johnson. Computers and Intractability, Freeman, New York, 1979. [28] M.A. Hall and L.A. Smith. Practical feature subset selection for machine learning. In: Proceedings of the 21st Australasian Computer Science Conference (Springer Verlag, 1998) pp. 181-191. [29] D. Harman (editor). Overview of the Third Text Retrieval Conference (TREC-3), Gaithersburg, MD 20899-0001. National INstitute of Standards and Technology, Special Publications 500-225, 1995. [30] D. S. Hochbaum and A. Pathria. Analysis of the greedy approach in covering problems. Naval Research Quarterly, 45 (1998) 615-627. [31] T. Ibaraki and M. Fukushima. FORTRAN 77 Optimization Programming (in Japanese), Iwanami, 1991. [32] G. John, R. Kohavi and K. Pfleger. Irrelevant features and the subset selection problem. In: Machine Learning: Proceedings of the Eleventh International Conference, pp. 121-129. (Morgan Kaufmann Publisher, 1994.) [33] M. Karpinski and A. Zelikovsky. Approximating Dense Cases of Covering Problems. DIMACS Technical Report, DTR 96-59, DIMACS, Rutgers University, 1996. [34] S. Khuller, A. Moss and J. Naor. The budgeted maximum coverage problem. Information Processing Letters, 70 (1999) 39-45. [35] K. Kira and L. Rendell. The feature selection problem: Traditional methods and a new algorithm. In: Proceedings of the Tenth National Conference on Artificial Intelligence, Menlo Park, (AAAI Press/The MIT Press, 1992), pp. 129-134. [36] D. Koller and M. Sahami. Toward optimal feature selection. In ICML-96: Proceedings of the Thirtenth International Conference on Machine Learning, pp. 284-292. (Morgan Kaufmann, 1997.) [37] N. Littlestone. Learning quickly when irreleveant attributes abound: a new linear-threshold algorithm. Machine Learning 2 (1988) pp. 285-318. [38] H. Liu, H. Motoda and M. Dash. A monotonic measure for optimal feature selection. ECML98: The 10th European Conference on Machine Learning. [39] L. Lov´asz. On the ratio of optimal integral and fractional covers. Discrete Mathematics, 13(4) (1975) 383–390. [40] O.L. Mangasarian, R. Setiono and W.H. Wolberg. Pattern recognition via linear programming: Theory and Applications to medical diagnosis. In Large-Scale Numerical Optimization, edited by T.F. Coleman and Y. Li, SIAM Publications, Philadelphia, (1990) 22-30. 24
[41] H. Mannila, H. Toivonen and A.I. Verkamo. Efficient algorithms for discovering association rules. AAAI Workshop on Knowledge Discovery in Database, edited by U.M. Fayyad and R. Uthurusamy, (1994) 181-192. [42] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, New York, 1995. [43] P. M. Murphy and D. W. Aha. UCI Repository of machine learning databases [http://www.ics.uci.edu/˜mlearn/MLRepository.html], University of California, Department of Information and Computer Science, 1994. [44] P.M. Narenda and K. Fukunaga. A branch-and-bound algorithm for feature subset selection. IEEE Trans. on Computer, C-26 (1977) pp. 917-922. [45] G. L. Nemhauser and L. Wolsey. Maximizing submodular set functions: formulations and analysis of algorithms. In Studies of Graphs and Discrete Programming, North-Holland, Amsterdam, 1981, pp. 279-301. [46] Y. Nesterov and A. Nemirovskii. Interior-Point Polynomial Algorithms in Convex Programming. SIAM Studies in Applied Mathematics, 1994. [47] J.R. Quinlan. Induction of decision trees. Machine Learning, 1 (1986) 81-106. [48] J.R. Quinlan. C4.5: Programs for Machine Learning. (Morgan Kaufmann, 1992.) [49] L.S. Shapley and M. Shubik. A method for evaluating the distribution of power in a committee system. Amer. Polit. Sci. Rev. 48 (1954) pp. 787-792. [50] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27 (1984) 1134-1142. [51] R.O. Winder. Threshold Logic. Ph.D. Dissertation, Department of Mathematics, Princeton University, Princeton, NJ, 1962. [52] R.O. Winder. Chow parameters in threshold logic. J. ACM 18 (1971) pp. 265-289.
Appendix A Distribution of small support sets In this section, we evaluate how many support sets of size K (1 ≤ K ≤ n) exist on average if T and F are randomly generated. The results show that, given K, many support sets usually exist unless the numbers of data vectors |T | and |F | are larger than a certain threshold (depending on K). Interpreting this in the reverse direction, we can conclude that, given a data set (T, F ), the support sets of size K found for (T, F ) may be those just randomly existent, and not necessarily the one inherent in the phenomenon from which the data set is generated, unless K is smaller than a certain threshold. Therefore, if our efforts to find small support sets succeed and the size of the obtained support set is smaller than this threshold, then, with a high probability, the attributes in such a support set are the essential ones, inherent in the phenomenon under consideration. For our analysis, let us suppose that each vector of T and F is randomly chosen from Bn , uniformly and independently. Let us denote the number of trials for T and F by mT and mF , respectively. Note that |T | < mT or |F | < mF may occur, since we allow duplicate vectors. For a subset S ⊆ V , we use notations mST = |T [S]| and mSF = |F [S]|. 25
Let XS be the indicator random variable defined by ½ 1, if T [S] ∩ F [S] = ∅ (i.e. if S is a support set) XS = 0, otherwise, and let
X
X(K, mT , mF , n) =
XS ,
S⊆V,|S|=K
where K is a given constant. Then the expectation E(X(K, mT , mF , n)) of X(K, mT , mF , n) gives the expected number of support sets of size K. Since the expected value E(XS ) is the same for all S ⊆ V of size K, µ ¶ X n E(X(K, mT , mF , n)) = E(XS ) = E(XS ) (19) K S⊆V,|S|=K
follows by the linearity of expectation. For convenience, we denote E(X(K, mT , mF , n)) ≈ 1 if and only if E(X(K, mT , mF , n)) ≥ 1 and E(X(K, mT + 1, mF + 1, n)) < 1 hold. Such mT and mF always exist for any K and n, since, by definition, E(X(K, mT , mF , n)) is monotonically decreasing with mT and mF . In the following subsections, we will analyze the relationships between K, mT , mF and n when E(X(K, mT , mF , n)) ≈ 1 holds.
A.1
Upper and lower bounds on E(X(K, mT , mF , n))
Given a subset S ⊆ V of size K and a parameter m (0 ≤ m ≤ min{mT , 2K }), the probability that a random vector from F [S] does not appear in T [S] under the condition that mST = m holds is (2K − m)/2K . Therefore, we have ³ m ´m F Prob(XS = 1 | mST = m) = 1 − K . (20) 2 Using the above, we compute first a lower bound on E(X(K, mT , mF , n)). Since the equations E(XS )
= Prob(XS = 1)
(21)
min{mT ,2K }
=
X
Prob(mST = m)Prob(XS = 1 | mST = m)
m=0
obviously hold, the following theorem is an immediate consequence. F Theorem 4 For any subset S ⊆ V of size K, the inequality E(XS ) ≥ (1 − mT /2K )m holds, and + thus we have µ ¶³ n mT ´mF E(X(K, mT , mF , n)) ≥ 1− K 2 K +
where (z)+ = max{z, 0}. Next we derive an upper bound on the expectation E(X(K, mT , mF , n)). For this purpose, we first estimate the number of different vectors mST in T [S] for a fixed subset S ⊆ V of size K. The key statement in the derivation of our upper bound (Theorem 5) is the following lemma. Lemma 3 Prob(mST ≤ m) ≤ (m/2K )mT −m holds for any m = 0, 1, 2, . . . , min{mT , 2K }. 26
Proof: Consider a Markov chain consisting of states 0, 1, 2, . . . , 2K , whose transition probability pij from state i to j for i, j = 0, 1, 2, . . . , 2K is defined by if j = i i/2K , 1 − i/2K , if j = i + 1 pij = 0, otherwise. Then mST can be viewed as the index of the state after mT moves are made on this Markov chain starting from state 0. Let us denote a move from state i to i (resp., i to i + 1) as a stay (resp., a forward step). The event mST ≤ m occurs if and only if the walk on the Markov chain makes at most m forward steps, which is equivalent to the event that the walk stays in some states in {1, 2, . . . , m} at least mT − m times. Then the lemma is immediate as pii ≤ m/2K holds for i ≤ m. Now, we are ready to prove an upper bound on the expectations. Theorem 5 For any subset S ⊆ V of size K and any m = 1, 2, . . . , min{mT , 2K }, the inequality ³
m ´m F E(XS ) ≤ 1 − K + 2
µ
m−1 2K
¶mT −m+1 (22)
holds. Hence, we have µ ¶ (³ µ ¶mT −m+1 ) n m ´m F m−1 . E(X(K, mT , mF , n)) ≤ 1− K + K 2 2K Proof: By (20) and (21) we have ³ m ´m F E(XS ) ≤ Prob(mST ≥ m) · 1 − K + Prob(mST ≤ m − 1) · 1 2 for any m = 1, 2, . . . , min{mT , 2K }. Hence the theorem is immediate from Lemma 3. To obtain the best upper bound from (22), we must minimize the right hand side as a function of m, which is not difficult numerically if min{mT , 2K } is small. Otherwise the asymptotic bounds of Section A.2 are recommended, since they are easier to compute and usually tight in such cases. We show in Table 7 some numerical results to see the accuracy of the lower and upper bounds from Theorems 4 and 5, respectively. We assume mT = mF and the figures in the rows LB and U B have the meaning that E(X(K, mT , mF , n)) ≥ 1 holds if mT = mF ≤ LB, and E(X(K, mT , mF , n)) ≤ 1 holds whenever mT = mF ≥ U B. That is, they are lower and upper bounds on mT = mF to satisfy E(X(K, mT , mF , n)) ≈ 1. These numerical results in Table 7 can be interpreted such that, if mT = mF ≤ LB then the number of vectors in T and F is not sufficient to avoid randomly existent support sets, but if mT = mF ≥ U B, random support sets rarely appear. We can observe that the two bounds are close together especially for larger K, which is highly encouraging. For instance, if the hidden phenomenon depends on 10 essential attributes in a data set consisting also of 90 inessential variables, and our data set contains more than 179 positive and negative examples, then it is likely that by generating the smallest support set we do indeed identify all 10 truly essential attributes. 27
Table 7: Lower and upper bounds on mT = mF to satisfy E(X(K, mT , mF , n)) ≈ 1.
n = 10 n = 15 n = 20 n = 40 n = 100 n = 1000
n = 15 n = 20 n = 40 n = 100 n = 1000
A.2
K LB UB LB UB LB UB LB UB LB UB LB UB
K LB UB LB UB LB UB LB UB LB UB
1 1 4 1 4 1 5 1 6 1 7 1 10
12 156 159 216 219 297 303 367 376 491 508
2 2 6 3 7 3 8 3 10 3 13 3 19
13 194 196 300 303 430 435 537 545 724 740
3 4 9 5 10 5 12 6 14 6 18 7 28
4 7 12 8 14 9 16 10 21 11 26 13 38
14 209 211 413 416 619 624 782 790 1063 1078
5 11 16 13 19 15 22 17 27 19 34 23 51
6 17 20 21 26 23 29 27 37 31 45 38 66
7 23 26 31 35 35 41 42 51 48 62 61 88
8 30 32 45 49 51 57 63 71 74 87 95 120
9 33 35 63 67 75 80 94 102 113 124 146 168
10
11
88 91 108 112 139 146 168 179 221 241
119 122 153 157 204 210 250 259 331 349
15
16
17
18
19
20
559 562 888 893 1135 1143 1555 1569
743 745 1270 1274 1643 1650 2265 2279
958 960 1811 1815 2372 2379 3293 3305
1171 1173 2576 2580 3418 3424 4775 4787
1252 1254 3657 3661 4917 4923 6912 6924
5179 5183 7062 7068 9989 10001
Asymptotic behavior of E(X(K, mT , mF , n))
Next we examine the asymptotic behavior of E(XS ) when K is sufficiently large. For this, we first derive two technical lemmas from Theorem 4 and Theorem 5, respectively. Lemma 4 Let c > 1, and assume that mT ≤ 2K /c. Then, µ E(XS ) ≥
1 1− c
¶ cmTKmF 2
holds for any subset S ⊆ V of size K. Proof: By the assumption, mT ´mF 1− K 2 ( ) mTKmF 2K 2 ³ mT ´ m T 1− K 2 µ ¶ cmTKmF 2 1 1− c ³
E(XS ) ≥ = ≥
28
holds, since (1 − 1/x)x is monotonically increasing for x ≥ 1. Note that 0 < (1 − 1/c)c < e−1 and (1 − 1/c)c → e−1 in the limit as c → ∞. Lemma 5 If mT = o(22K/3 ) and mF = o(22K/3 ) hold, then for any subset S of size K we have K E(XS ) ≤ (1 + o(1))e−mT mF /2 , where o(1) → 0 as K → ∞. Proof: By substituting m = mT − E(XS )
√
mT in (22), we have
¶ µ ¶√mT +1 µ √ √ mT − mT mF mT − mT − 1 + ≤ 1− 2K 2K (mT −√KmT )mF 2 √ ¶ m −2K√m µ³ √ ¶ √ µ mT mF ´ mT +1 T T mT − mT − 2K = 1− + o e 2K ³ ´ mT mF mT mF ≤ e− 2K +o(1) + o e− 2K +o(1) = (1 + o(1))e−
mT mF 2K
.
The second line is based on the fact that (mT − K → ∞.
√
√
mT − 1)/2K → 0 and e−
mT mF /2K
→ 1 as
Lemmas 4 and 5 together with (19) then give the following theorem. 2K/3 Theorem 6 Suppose ) and mF = o(22K/3 ). Then E(X(K, mT , mF , n)) ≈ 1 implies ¡ n ¢ mT = o(2 K mT mF = 2 ln K (1 + o(1)), where o(1) → 0 as K → ∞.
Proof: On the one hand, ¡ n ¢by Lemma 4 andc (19), we have the inequality E(X(K, mT , mF , n)) > 1 hold if mT mF < 2K ln K /(− ln(1 − 1/c) ). By the assumption, we can choose c → +∞, which implies − ln(1 − 1/c)c = 1 + o(1). On the other ¡ n ¢ hand, E(X(K, mT , mF , n)) < 1 is implied by Lemma 5 and (19), whenever mT mF > 2K (ln K + ln(1 + o(1))). Note that there exist mT and mF satisfying all conditions mT = o(22K/3 ), mF = o(22K/3 ) and ¡n¢ K/3 mT mF = Ω(2K ln K ), if n = o(e(2 )/K ) holds. It might be interesting to note that the values of s µ ¶ n K mT = mF = 2 ln (23) K were observed to lie either between LB and U B or slightly larger than U B of Table 7, and hence (23) may be used as a good estimate of mT = mF for which the condition E(X(K, mT , mF , n)) ≈ 1 holds. As a consequence of Theorem 6, we have the following corollary, in which we note that (Ke)1/K → 1 as K → ∞. Corollary 1 Suppose mT = o(22K/3 ) and mF = o(22K/3 ), and suppose n ≥ ((Ke)1/K + 1)K − 1 and ln ln n = o(K). Then E(X(K, mT , mF , n)) ≈ 1 implies K +lg K = lg(mT mF )(1+o(1)), where o(1) → 0 as K → ∞.
29
Proof: The equation mT mF = 2K ln
¡n¢
(1 + o(1)) in Theorem 6 is equivalent to µ ¶ n lg(mT mF ) = K + lg ln + o(1). K K
As
µ ¶ ³ ´ en K n ≤ K K
holds for any n and K with n ≥ K ≥ 1 (see e.g. p. 434 of [42]), we have µ ¶ n lg ln ≤ lg K + lg(1 + ln n − ln K) K ≤ lg K + lg(1 + ln n) for K ≥ 1. Moreover, as
µ ¶ µ ¶K 1 n e(n − K + 1) ≥ K Ke K
holds for any n and K with n ≥ K ≥ 1 (the proof of (24) will be given later), we have µ ¶ n lg ln ≥ lg(K + K ln(n − K + 1) − (K + 1) ln K − 1) K ≥ lg K for n ≥ ((Ke)1/K + 1)K − 1. Therefore we have K + lg K + o(1) ≤ lg(mT mF ) ≤ K + lg K + lg(1 + ln n) + o(1). As (lg(1 + ln n))/K = O(ln ln n)/K → 0 (K → ∞), K + lg K = lg(mT mF )(1 + o(1)) holds. Next we prove (24). As ln K! =
ln 1 + ln 2 + · · · + ln K Z K ≤ ln xdx + ln K 1
= holds for K ≥ 1, we have
(K + 1) ln K − K + 1
K! ≤ K K+1 e−K+1 .
Hence µ ¶ n = K ≥ =
n(n − 1) · · · (n − K + 1) K! (n − K + 1)K K K+1 e−K+1 µ ¶K e(n − K + 1) 1 Ke K
holds for any n and K with n ≥ K ≥ 1.
30
(24)