From Cases to Rules: Induction and Regression Enriqueta Aragones∗, Itzhak Gilboa†, Andrew Postlewaite‡, and David Schmeidler§ March 2002
Abstract Inductive reasoning aims at finding general rules that hold true in the database. Specifically, finding a functional relationship between one variable and several others is the problem of regression. We show that several problems associated with these processes are computational hard. Finding regularities in a database is a hard problem, but verifying that a certain regularity holds (once it has been suggested) is easy. A similar result applies to linear regression: finding a small set of variables that obtain a certain value of R2 is computationally hard. Computational complexity may explain why a person is not always aware of rules that, if asked, she would find valid. This may explain why one can often change other people’s minds (opinions, beliefs) without providing them new information.
1
Motivation
A: “Russia is a dangerous country.” B: “Nonsense.” ∗
Institut d’Anàlisi Econòmica, C.S.I.C.
[email protected] Tel-Aviv University.
[email protected] ‡ University of Pennsylvania.
[email protected] § Tel-Aviv University and the Ohio State University.
[email protected] †
1
A: “Don’t you think that Russia might initiate a war against a Western country?” B: “Not a chance.” A: “Well, I believe it very well might.” B: “Can you come up with examples of wars that erupted between two democratic countries?” A: “I guess so. Let me see... How about England and the US in 1812?” B: “OK, save colonial wars.” A: “Anyway, you were wrong.” B: “I couldn’t be wrong, as I said nothing. I merely posed a question.” A: “Yes, but your implicit claim was obvious.” B: “My implicit claim is that you can barely come up with such examples, apart from colonial wars.” A: “Well, this might be true.” B: “It’s called ‘democratic peace’ and it’s a well-known phenomenon, I didn’t make it up.” A: “So what are you saying?” B: “That Russia is not very likely to initiate a war against a Western democracy.” A: “Well, you might have a point, but only if Russia remains a democracy.” The argument can go on forever. The point we’d like to make is that B seems to have managed to change A’s views, despite the fact that B has not provided A with any new information. Rather, B has pointed out a certain regularity (the “democratic peace” phenomenon) in the cases that are known to both A and B. This regularity is, apparently, new to A. Yet, A had had all the information needed to observe it before meeting B. It simply did not occur to A to test the accuracy of the generalization suggested to her by B. A’s knowledge base contained many cases of wars (and of peaceful resolutions of conflicts as well), but they were not organized by the type
2
of regime of the relevant parties. Indeed, it is likely that, if B were to ask A, “Can you name wars that occurred in the 20th century?” or “Can you name wars in which Germany was involved?”, A would have had no difficulty in coming up with examples. This is due to the fact that most people can access their international conflicts memory by the “indexes” of “period” or “country”. But most people do not generate the index “type of regime” unless they are encouraged to do so. Thus, drawing someone’s attention to a certain regularity, or similarity among cases, may cause them to view the world differently without providing them any new factual information.
2
Introduction
Human reasoning is a multi-faceted phenomenon. People use logical rules and perform deductive reasoning. They also form probabilistic beliefs and update them according to Bayes rule. And they also reason from specific cases. Reasoning from cases often takes the form of drawing analogies between two specific cases. But sometimes it also yields general rules that were gleaned from cases. Gilboa and Schmeidler (2001) argue that analogical reasoning avoids the theoretical problems posed by induction. Yet, they admit that rules still serve two purposes: first, they offer a concise way to summarize many cases. Second, they point out similarities among cases. Especially when one wishes to convey information to another, rules prove to be a very efficient way to communicate. In the motivating example we started with, a rule is used by a speaker to point out certain regularity to a listener. It is therefore an example in which a rule, even if it is not perfectly correct, can change the listener’s similarity function, and the way she reasons from cases. How does one model rules and how do they evolve out of cases? In this paper we offer a very simple model, in which cases are simply real-valued
3
vectors. A rule in such a model is a specification of what cannot happen. Formally, it is an assignment of values to a subset of columns, interpreted as claiming that this combination of values cannot occur. For instance, the rule “All ravens are black” corresponds to the claim “There does not exist a non-black raven”. Another type of regularity that one may find in a database is a functional relationship between several (“predicting”) variables and another one (the “predicted” variable). We refer to this type of regularity as “regression”. For rules as well as for regression, one generally prefers a simple regularity, which we take to mean one that uses few variables. In this paper we show that finding simple regularities is a computationally hard problem. Thus, as in the democratic peace example above, people are often surprised by regularities that apply to data they have access to. One reason is that they cannot find all the regularities that hold true in a given database, and they can therefore find that a certain regularity, which they find correct, simply did not occur to them.
3 3.1
Induction Rules
Rules have traditionally been modeled by propositional logic, which may capture rather sophisticated formal structure of rules (see Carnap (1950)). By contrast, we offer here a very simplified model of rules, aiming to facilitate the discussion of the process of induction, and highlight its similarity to regression analysis. Let C = {1, ..., n} and A = {1, ..., m} be finite and non-empty sets of cases and of attributes, respectively. We assume that data are given by a matrix X = (xij )i≤n,j≤m of real numbers in [0, 1], such that, for all i ≤ n, j ≤ m, xij measures the degree to which case i has attribute j. (Equivalently, we may view X as a function X : C × A → [0, 1].) Attributes may also be 4
thought of as variables. We will also refer to a row vector xi = (xij )j≤m as case or observation i and to a column vector Xj = (xij )i≤n as attribute or variable j. We assume that the matrix X is known in its entirety. In many applications it will be necessary to relax this assumption. In particular, a prediction or decision problem presents itself with an unknown outcome. Thus, the attributes of the outcome are missing at the time that prediction or decision is called for. For the purposes of this paper, however, assuming a complete matrix X greatly simplifies notation. A rule is a pair (B, β) such that B ⊂ A and β : B → {0, 1}.1 The interpretation of the rule (B, β) is that there is no case whose values on the attributes in B coincide with β. For instance, assume that the set of cases consists of living creatures. Saying that “all humans are mortal” would correspond to setting B = {human, mortal} and β(human) = 1, β(mortal) = 0, stating that one cannot find a case of a human who is not mortal. To consider the example we started out with, assume that the database contains cases of political conflict. If we set B = {country_1_democratic, country_2_democratic, war} and β(country_1_democratic) = β(country_2_democratic) = β(war) = 1, the rule (B, β) corresponds to the claim that there are no cases of political conflict in which the two countries were democratic and war ensued. To what extent does a rule (B, β) hold in a database X? Let us begin by asking to what extent does the rule apply to a particular case i. We suggest that this be measured by f ((B, β), i) = maxj∈B |xij − β(j)|. Thus, if case i is a clear-cut counter-example to the rule (B, β), i will be a case in which all attributes in B have the values specified for them by β. That 1
One may also define β : B → [0, 1], and allow rules to have intermediate values of the attributes. The following definitions and results hold with this more general definition. For some purposes it may be useful to let β assume interval values, namely, to exclude ranges of the attribute value.
5
is, xij = β(j) for all j ∈ B, and then f ((B, β), i) = 0. By contrast, if at least one of the attributes in B is not shared by i at all, that is, |xij − β(j)| = 1
for at least one j ∈ B, then f ((B, β), i) = 1, reflecting the fact that i fails to constitute a counter-example to (B, β), and thus (B, β) holds in case i. Generally, the closer are the values (xij )j∈B to (β(j))j∈B , the closer does i get to be a counter-example to (B, β). For instance, one might wonder whether the Falkland Islands war is a counter-example to the democratic peace rule. To this end, one has to determine to what extent Argentina was a democracy at that time. The more is Argentina deemed democratic, the stronger is the contradiction suggested by this example to the general rule. Given the degree to which a rule (B, β) applies in each case i, it is natural to define the applicability of a rule (B, β) given the entire database as its average applicability for specific cases: f(B, β) =
1 n
P
i≤n
f ((B, β), i).
This definition appears to be reasonable when the database contains cases that were not selectively chosen. Observe that one may increase the value of f(B, β) by adding cases to C, in which rule (B, β) has no bite and is therefore vacuously true. For instance, the veracity of the democratic peace phenomenon will be magnified if we add many cases in which no conflict occurred. We implicitly assume that only relevant cases are included in C. More generally, one may augment the model by a relevance function, and weight cases in f (B, β) by this function. As argued by Wittgenstein (1922), “The process of induction is the process of assuming the simplest law that can be made to harmonize with our experience.” (Proposition 6.363).2 In our model, the complexity of a rule (B, β) is naturally modeled by the number of attributes it refers to, namely, |B|. 2
Simplicity was mentioned by William of Occam several centuries earlier. But Occam’s razor is an argument with a normative flavor, whereas here we refer to a descriptive claim about the nature of human reasoning.
6
Thus, performing induction may be viewed as looking for a small set B that, coupled with an appropriate β, will have a high value for f(B, β). For instance, if there have never been any wars in human history, the simplest rule would be, “there are no wars”, corresponding to the set B = {war} and β(war) = 1. If, indeed, the column war in the matrix X consisted of zeroes alone, it is more likely that one would come up with the generalization that wars do not occur than with the democratic peace phenomenon. Alas, this is not the case. Indeed, the very fact that the attribute war exists in the model probably indicates that wars have occurred. But if democratic countries were never involved in wars, a simpler rule, corresponding to B = {country_1_democratic, war} would be arrived at. Unfortunately, even this is plainly false. Only the more complicated rule, restricted to cases in which both countries are democracies, has a claim to verisimilitude. This discussion suggests the following problem: Problem INDUCTION: Given a matrix X, a natural number k ≥ 1 and a real number r ∈ [0, 1], is there a rule (B, β) such that |B| ≤ k and f (B, β) ≥ r? We can now state the following result. Theorem 1 INDUCTION is an NP-Complete problem.3 This result suggests that there is probably no practical way to find the simplest regularity that holds in a given database. The reason that people are often surprised by the democratic peace phenomenon is not that they are too lazy to think, or that they are irrational. Rather, it is that the problem is genuinely hard. There is nothing irrational about not being able to solve NP-Hard problems. Faced with the induction problem, which is NP-Hard,4 people may use various heuristics for finding some rules that are true, but 3
For definition of NP-Completeness, see Gary and Johnson (1979). All NP-Complete problems are NP-Hard. NP-Hard problems include problems that are not NP-Complete, such as problems that are not in NP. 4
7
they cannot be sure, in general, that the rules they found are the simplest ones. We do not claim that the combinatorial complexity is necessarily the most important cognitive limitation for which people fail to perform induction. Indeed, even polynomial problems can be hard to solve when the database consists of many cases and many attributes. Moreover, it is often the case that looking for a general rule does not even cross someone’s mind. Yet, the difficulty of performing induction shares an important property with NPComplete problems: while it is hard to come up with a solution to such a problem, it is easy to verify whether a suggested solution is valid. Similarly, it is hard to come up with an appropriate generalization, but it is relatively easy to assess the applicability of such a generalization once it is offered.
3.2
Simple Rules
The induction problem stated above is computationally hard for two combinatorial reasons: first, there are many subsets B that may be relevant to the rule. Second, for each given B there are many assignments of values β. Our intuitive discussion, however, focussed on the first issue: we claim that it is hard to find minimal regularities because there are many subsets of variables one has to consider. It is natural to wonder whether the complexity of problem INDUCTION is due to the multitude of assignments β, and has little to do with our intuitive reasoning about the multitude of subsets.5 We therefore devote this sub-section to simple rules, defined as rules (B, β) where β ≡ 1. A simple rule can thus be identified by a subset B ⊂ A. We denote f ((B, β), i) by f (B, i), and f(B, β) — by f (B). Explicitly, f (B, i) = maxj∈B (1 − xij ). 5
This concern can only be aggravated by reading our proof: we actually use all attributes in the proof of complexity, relying solely on the difficulty of finding the assignment β.
8
and f (B) =
1 n
P
i≤n
f (B, i).
We now formulate the following problem: Problem SIMPLE INDUCTION: Given a matrix X, a natural number k ≥ 1 and a real number r ∈ [0, 1], is there a rule B ⊂ A such that |B| ≤ k and f (B) ≥ r? We can now state the following result. Proposition 2 SIMPLE INDUCTION is an NP-Complete problem. It follows that the difficulty of the problem INDUCTION is not an artifact of the function β, but rather, has to do also with the selection of variables one considers.
4 4.1
Regression Qualitative Regression
Rules are defined above as excluding certain possibilities, such as “Country 1 is a democracy, country 2 is a democracy, and countries 1 and 2 engaged in a war”. But there are regularities that go beyond exclusion of some combination of values of variables. For instance, consider the regularity, “The increase in the price index equals the increase in the quantity of money”. This regularity is best described as an equation, relating one variable to another. A slightly more refined observation might be “The increase in the price index equals the increase in the quantity of money divided by the increase in GNP”. In this case, one variable is described as a function of other variables. Generally, one is often interested in approximating the value of a given variable by a function of other variables. Indeed, statistical regression analysis addresses this problem precisely, under the specification of a particular 9
functional form. For instance, one may use linear regression, or log-linear regression (as in the quantity of money example above). Linear regression will be dealt with in the following sub-section. Here we study people’s ability to find regularities in problems for which no given structure is known. In particular, there are many problems where the data are qualitative and one does not expect a particular functional form to hold. Yet, some combinations of variables will be more informative than others. For instance, in the democratic peace example, the way conflicts between two countries are resolved is the variable of interest. Variables that might be useful in predicting it include the strengths of the armies of the two countries involved, their regimes, and so forth. None of the these variables is quantitative by nature, and there is no reason to expect, a-priori, that a particular functional form would express the relationship between the predicting variables and the predicted one. The problem that arises, therefore, is to find both a set of predicting variables and a function thereof that approximates a given variable. We refer to this problem as qualitative regression. We will refer to the last column of the matrix as the predicted variable. Specifically, consider a database consisting of cases C = {1, ..., n}, attributes A = {1, ..., m}. The data are X = (xij )i≤n,j≤m and Y = (yi )i≤n where yi ∈ [0, 1]. We wish to predict the value of Y as a function of (Xj )j≤m . Let a predictor for Y be a pair (B, g) where B ⊂ A and g : [0, 1]B → [0, 1]. Given a predictor for Y , (B, g), it is natural to define its degree of accuracy in case i ∈ C by f ((B, g), i) = (g(xij )j∈B − yi )2 and f (B, g) =
1 n
P
i≤n
f ((B, g), i).
Thus, f (B, g) corresponds to the sum of squared errors (SSE) in linear regression analysis, divided by the number of observations. 10
The quest for simplicity drives people to look for simple predictors, which is here modeled to mean predictors that employ a small set of attributes B. Indeed, finding a good fit with a small number of predicting variables is also the goal of regression analysis, and, for the same R2 , one would typically have a higher belief in the predictions generated by a smaller set of variables.6 We are therefore led to defining the following problem: Problem QUALITATIVE REGRESSION: Given a matrix X and a vector Y , a natural number k ≥ 1 and a real number r ∈ [0, 1], is there a predictor for Y , (B, g), such that |B| ≤ k and f (B, g) ≤ r? We can now state the following result. Theorem 3 QUALITATIVE REGRESSION is an NP-Complete problem. It follows that, looking at a database and finding functional relationships in it is, in general, a hard problem. In the absence of additional structure, one cannot be expected to find the simplest functional relationships that hold in the given database. Correspondingly, one may be surprised to be told that such a functional relationship holds, even if, in principle, one had all the information needed to find it.
4.2
Linear Regression
We now turn to the more structured problem, in which one is interested in a linear relationship between one variables and several others. This special case is of interest in light of the prevalence of this technique. Assume, then, that we are trying to predict a variable Y given the pre2 dictors X1 , ..., Xm . For a subset K of {X1 , ..., Xm }, let RK be the value of the coefficient of determination R2 when we regress (yi )i≤n on (xij )i≤n,j∈K . Problem MINIMAL REGRESSION: Given a matrix X and a vector Y , a natural number k ≥ 1, and a real number r ∈ [0, 1], is there a subset K of {X1 , ..., Xm } with |K| ≤ k and R2K ≥ r? 6
The next sebsection deals with linear regression specifically.
11
Theorem 4 MINIMAL REGRESSION is an NP-Complete problem. This result shows that it is a hard problem to find the smallest set of variables that obtain a pre-specified level of R2 . Alternatively, it can be viewed as pointing out that finding the highest R2 for a pre-specified number of variables k is a hard problem. In line with Theorem 3, the present result might also explain why people may be surprised to learn of simple regularities that exist in a database they have access to. We have already established that it is hard to find a small set of variables that can predict another variable. We now see that this problem is hard also when only linear functions are taken into account.7 A person who has access to the data should, in principle, be able to assess the veracity of all linear theories pertaining to these data. Yet, due to computational complexity, this capability remains theoretical. In practice one may often find that one has overlooked a simple linear regularity that, once pointed out, seems evident. While the focus of this paper is on everyday human reasoning, Theorem 4 can also be interpreted as a result about scientific research. It is often the case that a scientist is trying to regress a variable Y on some predictors (Xj )j . Naturally, increasing the number of predictors can only increase R2 , and randomly chosen datapoints will yield R2 = 1 with probability 1 if the number of predictors is large enough (relative to the number of observations). More generally, a theory with too many degrees of freedom may explain anything we like. But this does not mean that we believe it to be “correct” or that we wish to use it for prediction, decision making, and so forth. Rather, we generally prefer simple theories that can account for a large database. In the context of regression analysis, it is natural to measure complexity of a theory by the number of predicting variables it employs. Thus, scientists 7
Observe that this theorem is not stronger than the previous result. Restricting attention to a sub-class of functions need not make the problem easier. While a sub-class of functions is easier to exhaust, solution algorithms are not restricted to enumeration of all possible functions.
12
often look for a small number of variables that will attain a high R2 . Our result shows that many practicing scientists can be viewed as coping with a problem that is NP-Hard. This result might explain why empirical scientific research requires expertise: for many problems of interest, there are too many potential variables for an exhaustive search of all theories. Moreover, there is no known polynomial algorithm that can find a “good” theory, namely one that will have relatively few variables but will attain a high R2 . Thus, the scientist’s work can not be automated.
5
Modeling Issues and Future Directions
Most of the formal literature in economic theory and in related fields adheres to the Bayesian model of information processing. In such a model, a decision maker starts out with a prior probability, and she updates it in the face of new information by Bayes rule. Hence, this model can easily capture changes in opinion that result from new information. But it does not deal very graciously with changes of opinion that are not driven by new information. In fact, in a Bayesian model with perfect rationality people cannot change their opinions unless new information has arrived. It follows that the examples we started out with cannot be explained by such models. Relaxing the perfect rationality assumption, one may attempt to provide a pseudo-Bayesian account of the phenomena discussed here. We believe, however, that the model offered in this paper may provide further insight into the mechanism of induction. Another approach to modeling induction might provide a more explicit account of the components of cases. One may wish to model the entities involved in each case and the relations, or predicates that hold among them, and use these in the formal definition of induction. Such a predicate model would provide more structure, would be closer to the way people think of
13
complex problems, and would allow for a more intuitive modeling if analogies than one can obtain from our present model. Moreover, while the mathematical notation required to describe a predicate model is more cumbersome than that used for the attribute model above, the description of actual problems within the predicate model may be more concise. In particular, this implies that problems that are computationally easy in the attribute model may still be computationally hard with respect to the predicate model.8
Appendix: Proofs Proof of Theorem 1: That INDUCTION is in NP is simple to verify: given a suggested rule (B, β), one may calculate f (B, β) in linear time in the size of the sub-matrix C × B (which is bounded by the size of the input, |C × A|).9 That INDUCTION is NP-Complete may be proven by a reduction of the satisfiability problem:10 Problem SATISFIABILITY: Given a Boolean function f in CNF in the variables y1 , ..., yp , is there an assignment of values ({0, 1}) to the variables for which f = 1? Q Let there be given the function f = i≤q fi where each factor fi is the summation of variables yj and their negations y j . (The variables are Boolean, summation means disjunction, multiplication means conjunction, and bar denotes logical negation.) Let n = q and m = p. For each factor fi , i ≤ q = n, 8
In “Rhetoric and Analogies”, we present both the attribute and the predicate models for the study of analogies, prove their equivalence, and show that finding a good analogy in the predicate model is a hard problem. 9 Here and in the sequel we assume that reading an entry in the matrix X, as well any algebraic computation require a single time unit. Our results hold also if one assumes that xij are all rational and takes into account the time it takes to read and manipulate these numbers. 10 SATISFIABILITY is the first problem that was proven to be NP-Complete. This was done directly, whereas proofs of NP-Completeness of other problems is typically done by reduction of SATISFIABILITY to these problems (often via other problems). See Gary and Johnson (1979) for definitions and more details.
14
let there be a case i. For each variable yj , j ≤ p = m, let there be an attribute j. Define xij as follows: If yj appears in fi , let xij = 0; If y j appears in fi , let xij = 1; Otherwise, let xij = 0.5. We claim that there exists a valued rule (B, β) with |B| = k = n such that f (B, β) ≥ r = 1 iff f is satisfiable by some assignment of values to the variables y1 , ..., yp . Observe that every rule (B, β) with |B| = n defines an assignment of values (0 or 1) to the variables (yj )j≤p , and vice versa. We claim that every rule (B, β) with |B| = n obtains the value f (B, β) = 1 iff the corresponding assignment satisfies f. To see this, let (B, β) be a rule with |B| = n. Note that f (B, β) = 1 iff for every case i we have f ((B, β), i) = 1, which holds iff for every case i there exists an attribute j such that |xij − β(j)| = 1, that is, xij = 1 − β(j). By construction of the matrix X, xij = 1 − β(j) iff (i) yj appears in fi , and β(j) = yj = 1, or (ii) y j appears in fi and β(j) = yj = 0. (Observe that, if neither yj nor y j appear in fi , |xij − β(j)| = 0.5.) In other words, xij = 1 − β(j) iff the variable yj (or its negation) satisfies the factor fi . It follows that f (B, β) = 1 iff the assignment defined by β satisfies f . Observing that the construction above can be performed in polynomial time, the proof is complete.¤ Proof of Proposition 2: It is easy to see that SIMPLE INDUCTION is in NP. To show that SIMPLE INDUCTION is NP-Complete, we use a reduction of the following problem, which is known to be NP-Complete (see Gary and Johnson (1979)): Problem COVER: Given a natural number p, a set of subsets of S ≡ {1, ..., p}, S = {S1 , ..., Sq }, and a natural number t ≤ m, are there t subsets in S whose union contains S? S (That is, are there indices 1 ≤ j1 ≤ ... ≤ jt ≤ q such that l≤t Sjl = S ?) Given and instance of COVER, n, S = {S1 , ..., Sq }, and t, define n = p, m = q, and k = t. Thus each member of S corresponds to a case i ∈ C, 15
and each subset Sj ∈ S — to an attribute j ∈ A. Let xij = 1 if i ∈ / Sj and xij = 0 if i ∈ Sj . We argue that there is a rule B ⊂ A such that |B| ≤ k and f (B) ≥ 1 iff there are k subsets {Sjl }l≤k whose union covers S. Indeed, such a rule exists iff there is a set B of k attributes {jl }l≤k such that, for every i, f (B, i) = 1. This holds iff, for every i there exists an attribute jl ∈ B such that xijl = 0. And this holds iff for each member of S there is at least one of the k sets {Sjl }l≤k to which it belongs. Finally, observe that the construction of the matrix X is polynomial in the size of the data.¤ Proof of Theorem 3: We first show that QUALITATIVE REGRESSION is in NP. To this end, it suffices to show that, for any given set of attributes B ⊂ A with |B| = k, one may find in polynomial time (with respect to n × (m + 1)) whether there exists a function g : [0, 1]B → [0, 1] such that f (B, g) ≤ r. Let there be given such a set B. Restrict attention to the columns corresponding to B and to Y . Consider these columns, together with Y as a new matrix X 0 of size n × (|B|+1). Sort the rows in X 0 lexicographically by the columns corresponding to B. Observe that, if there are no two identical vectors (xij )j∈B (for different i’s in C), there exists a function g : [0, 1]B → [0, 1] such that f (B, g) = 0. Generally, f (B, g) will be minimized when, for every vector (xij )j∈B ∈ [0, 1]B that appears in the matrix corresponding to C × B, g((xij )j∈B ) will be set to the average of (yl ) over all l ∈ C with (xlj )j∈B = (xij )j∈B . That is, for every selection of values of the predicting variables, the sum of squared errors is minimized if we choose the average value (of the predicted variable) for this selection, and this selection is done separately for every vector of values of the predicting variables. It only remains to check whether this optimal choice of g yields a value for f (B, g) that exceeds r or not. We now turn to show that QUALITATIVE REGRESSION is in NPComplete. The reduction is, again, from COVER. Let there be given a natural number p, a set of subsets of S ≡ {1, ..., p}, S = {S1 , ..., Sq }, and a 16
natural number t. Define n = p + 1, m = q, and k = t. Define (xij )i≤n,j≤m and (yi )i≤n as follows. For i ≤ p, and j ≤ m = q, set xij = 1 if i ∈ Sj and xij = 0 otherwise. For i ≤ n, set yi = 1. For all j ≤ m, let xnj = 0. Finally, set yn = 0. We claim that there is a predictor for Y , (B, g), with |B| = k and f (B, g) = 0 iff there are k = t subsets in S that cover S. Indeed, there exists a predictor for Y , (B, g), with |B| = k and f (B, g) = 0 iff there are k columns out of the m columns in the matrix, such that no row in the matrix, restricted to these columns, consists of zeroes alone. (Observe that the first p observations of Y are 1. Thus the only problem in defining g to obtain a perfect match f (B, g) = 0 might occur if some of these vectors, restricted to the these k columns, is equal to the last vector, which consists of zeroes for the predicting variables and zero also for Y .) And this holds iff there are k sets in S = {S1 , ..., Sq } that cover S. Finally, observe that the construction is polynomial.¤ Proof of Theorem 4: It is easy to see that MINIMAL REGRESSION is in NP: given a suggested set K ⊂ {1, ..., m}, one may calculate R2K in polynomial time in |K|n (which is bounded by the size of the input, (m + 1)n). To show that MINIMAL REGRESSION is NP-Complete, we use a reduction of the following problem, which is known to be NP-Complete (see Gary and Johnson (1979)): Problem EXACT COVER: Given a set S, a set of subsets of S, S, are there pairwise disjoint subsets in S whose union equals S? (That is, does a subset of S constitutes a partition of S?) Given a set S, a set of subsets of S, S, we will generate n observations of (m + 1) variables, (xij )i≤n,j≤m and (yi )i≤n , a natural number k and a number r ∈ [0, 1] such that S has an exact cover in S iff there is a subset K of {1, ..., m} with |K| ≤ k and R2K ≥ r. Let there be given, then, S and S. Assume without loss of generality that S = {1, ..., s}, and that S = {S1 , ..., Sl } (where s, l ≥ 1 are natural numbers). 17
We construct n = s+l +1 observations of m = 2l predicting variables. It will be convenient to denote the predicting variables by X1 , ..., Xl and Z1 , ..., Zl and the predicted variable — by Y . Their corresponding values will be denoted (xij )i≤n,j≤l , (zij )i≤n,j≤l , and (yi )i≤n . We will use Xj , Zj , and Y also to denote the column vectors (xij )i≤n , (zij )i≤n , and (yi )i≤n , respectively.11 We now specify these vectors. For i ≤ s and j ≤ l, xij = 1 if i ∈ Sj and xij = 0 if i ∈ / Sj ; For i ≤ s and j ≤ l, zij = 0; For s < i ≤ s + l and j ≤ l, xij = zij = 1 if i = s + j and xij = zij = 1 if i 6= s + j; For j ≤ l, xnj = znj = 0; For i ≤ s + l, yi = 1 and yn = 0. We claim that there is a subset K of {X1 , ..., Xl }∪ {Z1 , ..., Zl } with |K| ≤ 2 k ≡ l for which RK ≥ r ≡ 1 iff S has an exact cover from S. First assume that such a cover exists. That is, assume that there is a set J ⊂ {1, ..., l} such that {Sj }j∈J constitutes a partition of S. This means that P j∈J 1Sj = 1S where 1A is the indicator function of a set A. Let α be the intercept, (β j )j≤l be the coefficients of (Xj )j≤l and (γ j )j≤l — of (Zj )j≤l in the regression. Set α = 0. For j ∈ J, set β j = 1 and γ j = 0, and for j ∈ / J set P P β j = 0 and γ j = 1. We claim that α1 + j≤l β j Xj + j≤l γ j Zj = Y where 1 is a vector of 1’s. For i ≤ s the equality α+ follows from
P
P
α+ 11
j≤l
j∈J
P
β j xij +
P
j≤l
γ j zij =
P
j≤l
β j xij = yi = 1
1Sj = 1S . For s < i ≤ s + l, the equality
j≤l
β j xij +
P
j≤l
γ j zij = β j + γ j = yi = 1
In terms of our formal model, the variables may well be defined by these vectors to begin with. However, in the context of statistical sampling, the variables are defined in a probabilistic model, and identifying them with the corresponding vectors of observations constitutes an abuse of notation.
18
follows from our construction (assigning precisely one of {β j , γ j } to 1 and P P the other — to 0). Obviously, α + j≤l β j xnj + j≤l γ j znj = 0 = yi = 0.
The number of variables used in this regression is l. Specifically, choose / J }, with |K| = l, and observe that R2K = 1. K = { Xj | j ∈ J } ∪ { Zj | j ∈ We now turn to the converse direction. Assume, then, that there is a 2 subset K of {X1 , ..., Xl } ∪ {Z1 , ..., Zl } with |K| ≤ l for which RK = 1. Equality for observation n implies that this regression has an intercept of zero (α = 0 in the notation above). Let J ⊂ {1, ..., l} be the set of indices of the X variables in K, i.e., {Xj }j∈J = K ∩ {X1 , ..., Xl }. We will show that {Sj }j∈J constitutes a partition of S. Set L ⊂ {1, ..., l} be the set of indices of the Z variables in K, i.e., {Zj }j∈L = K ∩ {Z1 , ..., Zl }. Consider the 2 coefficients of the variables in K used in the regression obtaining RK = 1. Denote them by (β j )j∈J and (γ j )j∈L . Define β j = 0 if j ∈ / J and γ j = 0 if j∈ / L. Thus, we have P P j≤l β j Xj + j≤l γ j Zj = Y . We argue that β j = 1 for every j ∈ J and γ j = 1 for every j ∈ L. To see this, observe first that for every j ≤ l, the s + j observation implies that β j + γ j = 1. This means that for every j ≤ l, β j 6= 0 or γ j 6= 0 (this also implies that either j ∈ J or j ∈ L). If for some j both β j 6= 0 and γ j 6= 0, we will have |K| > l, a contradiction. Hence for every j ≤ l either β j 6= 0 or γ j 6= 0, but not both. (In other words, J = Lc .) This also implies that the non-zero coefficient out of {β j , γ j } has to be 1. Thus the cardinality of K is precisely l, and the coefficients {β j , γ j } define a subset of {S1 , ...Sl }: if β j = 1 and γ j = 0, i.e., j ∈ J, Sj is included in the subset, and if β j = 0 and γ j = 1, i.e., j ∈ / J, Sj is not included in the subset. That this subset {Sj }j∈J constitutes a partition of S follows from the first s observations as above. To conclude the proof, it remains to observe that the construction of the variables (Xj )j≤l , (Zj )j≤l , and Y can be done in polynomial time in the size of the input. ¤ 19
References [1] Carnap, R. (1950). Logical Foundations of Probability. Chicago: University of Chicago Press. Gary, M. and D. S. Johnson (1979), Computers and Intractability: A Guide to the Theory of NP-Completeness. San-Francisco, CA: W. Freeman and Co. Gilboa, I. and D. Schmeidler (2001), A Theory of Case-Based Decisions, manuscript. Wittgenstein, L. (1922), Tractatus Logico-Philosophicus. London: Routledge and Kegan Paul; fifth impression, 1951.
20