Beyond Prediction: Directions for Probabilistic and ... - CiteSeerX

1 downloads 0 Views 879KB Size Report
unification of probabilistic and logical learning offers a unique ability to produce the full ... work on quasi-experimental design in the social sciences. Specifically ...
Beyond Prediction: Directions for Probabilistic and Relational Learning David D. Jensen Department of Computer Science, University of Massachusetts Amherst, 140 Governors Drive, Amherst, Massachusetts, 01003, USA [email protected]

Abstract. Research over the past several decades in learning logical and probabilistic models has greatly increased the range of phenomena that machine learning can address. Recent work has extended these boundaries even further by unifying these two powerful learning frameworks. However, new frontiers await. Current techniques are capable of learning only a subset of the knowledge needed by practitioners in important domains, and further unification of probabilistic and logical learning offers a unique ability to produce the full range of knowledge needed in a wide range of applications. Keywords: Statistical relational learning, causal inference, quasi-experimental design.

1 Introduction The past decade has produced substantial progress in unifying methods for learning logical and probabilistic models. We now have a large array of representations, algorithms, performance characterizations, and applications that bring together these two keys to effective learning and reasoning. Much of this work has sought to preserve key attributes of existing learning algorithms (e.g., efficiency) while extending the expressiveness of the representations that can be learned. However, combining probabilistic and logical techniques may open up entirely new capabilities that should not be ignored. Specifically, these recent advances in machine learning for relational data sets have revealed a surprising new opportunity to learn causal models of complex systems. The opportunity is a potentially deep and unexploited technical interaction between two previously unconnected areas: (1) work in statistical relational learning; and (2) work on quasi-experimental design in the social sciences. Specifically, the type of new data representations conceived and exploited recently by researchers in statistical relational learning (SRL) may provide all the information needed to automatically apply powerful statistical techniques from the social sciences known as quasiexperimental design (QED). QEDs allow a researcher to exploit unique characteristics of sub-populations of data to make strong inferences about cause-and-effect H. Blockeel et al. (Eds.): ILP 2007, LNAI 4894, pp. 4 – 21, 2008. © Springer-Verlag Berlin Heidelberg 2008

Beyond Prediction: Directions for Probabilistic and Relational Learning

5

dependencies that would otherwise be undetectable. Such causal dependencies infer whether manipulating one variable will affect the value of another variable, and they make such inferences based on non-experimental data. To date, QEDs have been painstakingly applied by social scientists in an entirely manual way. However, data representations from SRL that record relations (organizational, temporal, spatial, and others) could facilitate automatic application of QEDs. Constructing methods that automatically identify sub-populations of data that meet the requirements of specific QEDs would enable strong and automatic causal inferences from non-experimental data. This fusion of work in SRL and QED would lead to: (1) large increases in the percentage of causal dependencies that can be accurately inferred from non-experimental data; (2) substantial reductions in the amount of data needed to discover causal dependencies that can already be inferred; and (3) reductions in the computational complexity of causal learning algorithms. If exploited, this capability could substantially improve the ability of researchers to construct causal models of large and complicated systems (e.g., social systems, organizations, and computer systems). Such models would be a significant improvement over existing models learned by statistical and machine learning techniques, the vast majority of which are non-causal (and thus do not allow analysts to correctly infer the effects of potential actions) or only weakly causal (because many of the potential causal dependencies cannot be correctly inferred).

2 Why Causal Models Are Useful Nearly all algorithms for machine learning analyze data to identify statistical associations among variables. That is, they identify variables of some entity (e.g., a patient's occupation, recent physical contacts, and symptoms) that are statistically associated with other variables (e.g., a disease). Such associations are useful for making predictions about the values of unobserved variables based on the values of variables that can be observed. For example, a doctor could predict whether a patient has a particular disease (an unobserved variable) based on a set of observed symptoms. Such associational models can be useful in many situations. For example, associational models constructed by machine learning algorithms now sit at the heart of most state-of-the-art systems for machine translation, speech understanding, computer vision, information extraction, and information retrieval. In all of these cases, associations among variables alone are sufficient to meet the goals of the deployed system. However, machine learning algorithms are often deployed in the hope that they will support decisions about which actions, or interventions, to make in a given situation. In the case of medical diagnosis, most medical professionals do not simply want to diagnose disease, but to prevent, treat, or mitigate the effects of the disease as well. They want to know what effect a particular intervention (e.g., implementation of a public health measure or widespread administration of a drug) will have on the health of a population. In such situations, practitioners want models that help them to

6

D.D. Jensen

design effective interventions, and this requires the modeling of causality, not merely statistical association. Remarkably, most existing probabilistic models are practically useless for designing effective interventions because they only identify statistical associations, not causal dependencies. As is emphasized in nearly all introductory statistics courses, correlation is not causation — statistical association between two variables does not necessarily imply that one causes the other. For example, suppose we gathered a sample of patients and measured a variety of variables about each patient, including their history of smoking and their incidence of lung cancer 1. If we analyzed the data, we would very likely find a statistical association between several of these variables.

Fig. 1. Three causal models that produce statistical association: (a) A causes B; (b) B causes A; and (c) a common cause C causes both A and B

However, the association between any two variables A and B could result from any of three causal situations shown in Figure 1. If A and B are associated, then A could cause B (smoking causes lung cancer), B could cause A (a predisposition to nicotine addiction causes smoking), or a third variable C could cause both A and B (genetics could cause both a predisposition to nicotine addiction and a predisposition to lung cancer)2. If a purely associational model is constructed from data and then used to support the design of interventions under the assumption that A causes B, either of the latter two cases could cause the resulting interventions to be ineffective (or even counterproductive)3. 1

This example is inspired by a similar example from Pearl [1]. While the third explanation has been almost entirely discredited through careful experimental and epidemiological analysis, it was at one time a viable causal explanation for the observed association between smoking and lung cancer. 3 Limited situations involving causal inference among a few variables, such as the case of smoking, genetics, and cancer above, are relatively simple to understand (though not necessarily to analyze). In practice, however, causal models can involve analyzing complex associations among dozens or hundreds of variables. Many such cases have already long surpassed the abilities of human analysts to evaluate without the aid of computational tools, and recent work in relational learning (e.g., [2]) is greatly increasing this complexity by enabling the construction of probabilistic models that analyze associations among variables on vast numbers of inter-related entities. The sections below cover this in substantially more detail. 2

Beyond Prediction: Directions for Probabilistic and Relational Learning

7

In contrast, if accurate causal models could be constructed, they would be useful to a wide range of users within science, government, and business. Some of the most obvious uses for causal models are modeling the internal dynamics of organizations and the effects of organizational policies, computer security (e.g., modeling normal and abnormal user behavior in response to system changes and external events), and system design and evaluation (e.g., modeling system behavior to diagnose and improve performance). In addition, many societal goals and functions could be informed by the construction and use of specific causal models, including education and training (e.g., helping to design new educational programs and modify existing ones to maximize effectiveness), logistics (e.g., understanding the effects on the supply chain to improve on-time delivery of goods and services), and healthcare (understanding the causal factors underlying infectious diseases and physical injuries).

3 Example: Fraud Detection at the NASD One concrete example of the potential opportunities of causal models can be drawn from recent work of the Knowledge Discovery Laboratory (KDL) at the University of Massachusetts Amherst. KDL has been working for several years with the National Association of Securities Dealers (NASD) to analyze large amounts of data on the structure of the US securities industry and the behavior of individuals within that industry ([3], [4], [5]). While the effort to date has produced exclusively associational models, the setting and existing work help make clear the potential applications for causal models. 3.1 NASD NASD is the primary private-sector regulator of the securities industry in the United States4. It is currently responsible for overseeing the activities of more than 5,000 brokerage firms, 170,000 branch offices and 659,000 registered individuals. Since 1939, NASD has worked under the oversight of the U.S. Securities and Exchange Commission (SEC) to regulate all securities firms (called broker-dealers) that conduct business with the public. Currently, NASD employs a staff of over 2,500 employees situated in offices across the country and has an annual operating budget of more than $500 million. One of NASD's primary aims is to prevent and discover securities fraud and other forms of misconduct by member firms and their employees, called registered representatives, or reps. With over 659,000 reps currently employed, it is imperative for NASD to direct its limited regulatory resources towards the parties most likely to engage in risky behavior in the future. It is generally believed by experts at the NASD and others that fraud happens among small, interconnected groups of individuals. Due to the large numbers of potential interactions between reps, timely identification of 4

NASD has recently changed its name to FINRA (the Financial Industry Regulatory Authority), but we retain the older name here for consistency with earlier papers.

8

D.D. Jensen

persons and groups of interest is a significant challenge for NASD regulators. The ongoing work is joint effort between researchers in KDL and staff at the NASD to identify effective, automated methods for detecting these high-risk entities to aid NASD in their regulatory efforts. 3.2 Available Data The primary data employed in this effort is drawn from NASD's Central Registration Depository® (CRD), a collection of records representing all federally registered firms and individuals, including those registered by the SEC, NASD, the states, and other authorized regulators such as the New York Stock Exchange. This depository includes key pieces of regulatory data such as ownership and business location for firms and employment histories for individuals. Although the information in CRD is entirely selfreported, errors, inaccuracies or missing reports can trigger regulatory action by NASD. Since 1981, when the CRD was first created, records on around 3.4 million individuals, 360,000 branches and 25,000 firms have been added to the database. Some of the most critical pieces of information for NASD’s regulatory mission are records of disciplinary actions, typically called disclosures, filed on particular reps. A disclosure can represent any non-compliant actions including regulatory, criminal, or other civil judicial action. In addition, disclosures can also record customer complaints, termination agreements between firms and individual reps, and financial hazards an individual rep experience such as bankruptcies, bond denials, and liens. The disclosure information found in the CRD is one of the primary sources of data on past behavior that NASD uses to assess future risk and focus their regulatory examinations to greatest effect.

Fig. 2. Entity-relationship diagram for the NASD data

KDL's analysis utilizes data from the CRD about firms, branches, reps and disclosures. The complete entity-relation diagram is shown in Figure 2 along with

Beyond Prediction: Directions for Probabilistic and Relational Learning

9

counts for each entity appearing in our view of the database. Some of the key aspects of the data set are that it is reasonably large-scale (containing millions of entities), relational and heterogeneous (containing entities and relations of multiple types), and dynamic (representing changes over time in the existence and attributes of entities and relations). In addition, many of the most important statistical associations present in the data are probabilistic. 3.3 The Potential for Causal Models To date, our analysis of the CRD data has produced several highly useful models that are purely associational. For example, we have constructed statistical models that identify strong associations between the attributes of reps, the branches and firms at which they have worked, and the other reps who work at those organizations ([3],[4]). An extensive double-blind evaluation of these models by NASD analysts showed that they perform well overall and that they perform equivalently to NASD's current expert-derived rules for prioritizing the attention of field examiners [3]. These statistical models could be used in place of, or as an augmentation to, NASD's current methods for identifying high-risk reps and branches. While highly useful, prioritizing examiner attention affects a relatively small part of NASD's overall regulatory mission. In contrast, causal models could help provide answers to an entirely different class of questions, and they could inform much higher-level decisions. For example, causal models of the securities industry could help inform NASD's approach to these key decisions: • Prioritizing new programs — Given limited resources to devote to a particular region, should NASD place more emphasis on enforcement (e.g., increasing the rate of examinations) or education (e.g., seminars on complying with NASD and SEC rules)? • Making closure decisions — Given that the reps and management of a particular branch have been proven to engage in misconduct, will closing that branch reduce future misconduct (because reps have been deprived of a permissive environment) or increase it (because reps will disperse to other branches and firms, making monitoring more difficult and encouraging questionable behavior at their new organizations)? • Responding to economic conditions — Given a change in a state's economic picture, how are firms in that state likely to respond? How will the changing economic picture interact with different potential NASD's regulatory and education programs in that region to affect the number, composition, and internal regulatory programs of firms in the state? An accurate answer to each of these questions virtually requires a correct causal model.

4 Current Practice Given that causal models are so useful, it is not surprising that methods have been developed to learn them from data. At least three classes of statistical techniques have

10

D.D. Jensen

been developed that differ both in the types of data to which they can be applied and the degree to which they can be applied automatically. One class of techniques, often grouped under the name “experimental design” is useful when analysts have a high degree of control over the situation that produces data. Another class of techniques, which we will call “joint modeling,” combines automated methods for estimation of joint probability distributions with a small number of assumptions to infer causality from non-experimental or observational data. A final class of techniques, here grouped under the term “quasi-experimental design,” seeks to identify situations where observational data meet the assumptions of the techniques from experimental design, allowing causal inferences from observational data. These three classes of methods face a common set of challenges. First, the techniques must identify whether a statistical association exists between a pair of variables. The basic principles for reliably inferring statistical association — statistical hypothesis testing — have been known for decades, and this step poses relatively little challenge to most manual and algorithmic methods. Second, the techniques must identify the direction of potential causation. This step is most commonly resolved through the use of time (causation is assumed to always run forward in time), but joint modeling can also sometimes resolve this issue in other ways (see below). Finally, the techniques must eliminate the effects of all other potential common causes of the variables, to be certain that the observed association is not just a by-product of other causal influences. As we will see below, each class of methods approaches this final challenge in a unique way. 4.1 Experimental Design Probably the most commonly used method in the world today for discovering valid causal knowledge is the controlled randomized experiment. The rapid expansion of knowledge in the biological, physical, and social sciences over the past 50 years is due in no small part to the available “meta-knowledge” about how to design experiments and analyze their results. Indeed, the discovery, codification, and widespread adoption of methods for experimental design represent one of the major intellectual achievements of the past century. As the name implies (”controlled randomized experiment”), this set of innovations relies on two key concepts: control and randomization5. Control refers generally to the ability of an investigator to intentionally set alternative values of some variable, and to compare the effects of those alternative settings6. Control 5

Here, we omit mention of several other key concepts from experimental design, including replication, orthogonality, and factorial experiments, which are less relevant to the present discussion. 6 This use of the term “control” is at slight variance with how the term is usually applied in the literature on experimental design. “Control” in this context is often used to refer to a set of subjects that do not receive a treatment (a “control group”) as opposed to another group that does receive a treatment. Here, however, we use the term in its broader historical meaning to refer to the ability to set the values of any of a large set of variables so as to keep them constant or to systematically vary their values. This sort of control is so foundational to experimental design that many treatments of the topic omit clear mention of it.

Beyond Prediction: Directions for Probabilistic and Relational Learning

11

lies at the heart of what is typically meant by an “experiment” and the concept is quite old, stretching back at least as far as John Stuart Mill (1843), and perhaps as far as Hume a century earlier and Bacon a century before that [6]. By controlling variables in an experiment, an investigator can either factor out their effects by holding their value constant or study their effect by systematically varying their values. To do this, however, an investigator must know of the existence of a particular variable and be able to affect its value. “Randomization” refers to the approach of randomly assigning subjects (e.g., patients in a medical experiment) to treatment groups so that variation in their characteristics that cannot be controlled by the investigator (genetic traits, the effects of unknown medical conditions, etc.) cannot systematically affect the variables being studied. If randomization is employed, the effect of these uncontrolled characteristics will average out across a sufficiently large group. The principles of randomization, and its application to experimental design, was outlined by R.A. Fisher in the 1920s [7], and it has been a staple of experimental design ever since. What is remarkable about randomization is that it can remove the effect of variables that are unknown to the investigator, as long as their values are tied to subjects who can be randomly assigned to treatment groups. For example, an investigator does not have to know which specific genetic factors might influence a patient's response to a particular drug, as long as they randomly assign patients (and thus their genetic factors) to treatment groups. Investigators who study phenomena in an experimental setting typically control the variables that they can, either systematically altering their values or holding them constant, and randomize most or all the remaining variables. With these two methods, they can study the effects of the variables whose values they can directly manipulate and successfully factor out nearly all other potential causes. 4.2 Joint Modeling Methods from experimental design assume that you have the ability to control the values of variables and to randomize the assignment of subjects to treatment groups. However, in many cases, investigators do not have the ability to directly manipulate either variable values or subject assignment. Either it is impossible in practice (e.g., it is essentially impossible to set the value of inflation when studying the effect of economic conditions) or such manipulation would be unethical (e.g., though possible in practice, it would be unethical to force subjects to smoke or to work for specific securities firms). Instead, investigators want to draw causal inferences from observational data that were gathered without direct manipulation by investigators.7 Such observational data sets are increasing in both size and number. They are gathered almost incidentally from a variety of automated systems for accounting, 7

In some cases, a small degree of experimental control or randomization is available to investigators, but they want to maximize the information-gathering utility of such interventions. In these cases, too, we would like make use of all possible techniques to exploit the observational portions of the data.

12

D.D. Jensen

auditing, regulation, inventory control, publishing, information retrieval, and communication, and they are often massive in both size and scope. In some cases, the data sets represent the entire population rather than only a sample from it. For example, NASD's CRD database represents essentially every rep operating in the US over several decades. Investigators have sought to exploit this potential wealth of observational data to infer causal models. Lacking the ability to assure control and randomization, researchers have devised another way to eliminate potential common causes of associated variables: joint modeling. This area has been a small but active topic of research for the past several decades in statistics (e.g., [8],[9]), philosophy (e.g., [10]), and computer science (e.g., [1]). The key concept of joint modeling is that, rather than controlling a particular variable, you can estimate the effect of the uncontrolled value and adjust for that effect mathematically. In this way, modeling provides the equivalent of control. If the investigator knows about a potential common cause and can model its effect, then he or she can use that model to factor out the effect of that potential cause. In all but the simplest cases, this approach requires modeling the joint probability distribution of a set of potential causes, and much of the work in this area has been done using structure learning algorithms for Bayesian networks, which can estimate the joint probability distribution. Using a fairly conventional structure learning algorithm, and making a relatively few additional assumptions, the resulting belief network can be considered to encode causal dependencies8. Furthermore, such modeling can be done automatically, once the variables of interest are selected, and the algorithm has the potential to identify all the causal dependencies that hold among that set of variables.

Fig. 3. Three graphical models with partial shared structure

In many cases, structure learning algorithms can identify a set of model structures with equivalent estimated likelihood and which collectively account for 8

The additional assumptions include faithfulness, which specifies that causal effects of different variables do not perfectly cancel out and make it impossible to detect statistical dependence, and completeness, which specifies that all potential common causes of any pair of variables included in the model are also included in the model [10].

Beyond Prediction: Directions for Probabilistic and Relational Learning

13

the vast majority of probability density. If all of these network structures specify that dependence between two variables exists and runs in a single direction, then it is reasonable to identify that dependence as causal. Statistical association exists, runs in a specific direction, and all other potential causes have been accounted for by modeling. For example, Figure 3 shows a simple example with three network structures, each of which shows a somewhat different network structure, but all the structures share a common pair of dependencies showing that C depends on A and B. 4.3 Quasi-experimental Design A final class of statistical methods is routinely used to support causal inferences. These methods, grouped under the rubric “quasi-experimental design” (QED), attempt to exploit inherent characteristics of observational data sets that partially emulate the control and randomization possible in an experimental setting [11,12]9. Although QEDs clearly do not always have the internal validity of traditional experimental designs, but they can be applied to the much wider array of data sets that modern data collection practices have made available, and the size and scope of those data sets can partially or completely compensate for the deficiencies that arise from lack of experimental control. Indeed, there are a wide variety of situations where causality can be explored in no other way. In the absence of explicit control and randomization, some QEDs employ case matching to identify pairs of data instances that are as similar as possible in all respects except for the variable under investigation (the non-equivalent group design). Other QEDs examine how the value of a given variable on the same data instances changes over time, typically before and after some specific event (the regressiondiscontinuity design). Other types of quasi-experimental designs that have been devised include the proxy pretest design, double pretest design, nonequivalent dependent variables design, pattern matching design, and the regression point displacement design. A particularly salient example of quasi-experimental design is a classical twin study, a design that has been employed for decades to study the causal factors for particular diseases and conditions. Twin studies compare the incidence of disease in sets of monozygotic (identical) and dizygotic (fraternal) twins. Monozygotic twins share identical genetics, a common fetal environment, and (typically) a common postnatal environment. The same is true for dizygotic twins, except that they are only genetically similar rather than genetically identical. This remarkable degree of shared background, as well as the specific difference in the shared background between the two types of twins, provides a nearly ideal setting to study the effect of genetics on disease. For example, to identify the degree to which a given condition is due to genetic factors, investigators can determine the correlation in 9

Here we use a relatively expansive definition of QEDs, including some methods that are not typically covered in textbooks on QED, but which fit well within the general framework of this class of techniques. In particular, we include here methods from hierarchical and multilevel modeling.

14

D.D. Jensen

the condition among pairs of each type of twin, and then compare the correlation between the two types. A large difference indicates that a large portion of the condition is due to genetics, whereas no difference indicates that the condition is due to other factors. Figure 4 summarizes a few results from twin studies of various conditions.

Fig. 4. Example results from twin studies, drawn from a recent review [13]. Darkest bars indicate effects due to genetics, dark bars to shared environment, and light bars to unique environment.

Two factors are remarkable about the efficacy of twin studies. First, they allow the quantitative impact of genetics to be determined even though investigators may have no idea what specific genes are involved. That is, they can determine the degree to which some variable on a particularly entity (genotype) affects the observed condition without being able to define or measure that specific variable. Second, they can perform this analysis by studying only a tiny fraction of an entire population. Indeed, without access to a very large population, it would be virtually impossible to gain access to a sufficient number of pairs of monozygotic and dizygotic twins. We will return to both of these factors below. It is also important to note that the validity of twin studies relies on at least three pieces of information known to investigators but not (typically) represented explicitly in data used for QED studies. First, twins occur relatively randomly in the population. If identical twins were much more likely to be born to parents with particular genetic traits or who lived in particular environments, then those factors would confound efforts to use twins to study the effects of genetics on physical conditions. Second, genetic makeup is established temporally prior to the onset of diseases and other conditions. Thus, we know the direction of causality without having to determine it from data. Finally, and perhaps most importantly, we know that genotype (genetic sequence) and phenotype (physical condition) can be treated as related but separate entities. This means that individuals can have identical genotypes but not identical phenotypes. Thus, the relational representation shown in Figure 5 can be used to represent the data, where monozygotic twins share a common genotype and dizygotic twins do not. This relational representation underlies the inference of investigators that, if the two types of twins do not differ significantly in the correlation of their conditions, then all possible variables on genotype can be removed as potential causal factors. Again, we will return to all these three factors in the discussion below.

Beyond Prediction: Directions for Probabilistic and Relational Learning

15

Fig. 5. Graphical models representing monozygotic (a) and dizygotic (b) twins. Circles represent variables, boxes represent entities, and solid and dashed arrows represent known and potential dependencies, respectively.

5 Limitations of Current Approaches The availability of three different classes of methods might imply that the existing technical infrastructure for causal inference is sufficient. However, an interacting set of limitations severely restricts the applicability and effectiveness of these methods. • Insufficient power — The only existing methods that are automatic (those from joint modeling) are not capable of discovering all important causal dependencies. The ability of joint inference methods to discover causal dependencies led to a great deal of initial enthusiasm for these methods in the machine learning community. However, as experience has accumulated, it has become clear that these methods often cannot distinguish among a large equivalence class of models that encode the same joint probability distribution. In such cases, joint modeling alone cannot

16

D.D. Jensen

distinguish among model structures with different edges, cannot resolve the direction of key edges, or both. • Computational intractability — Attempting to learn causal models in non-trivial domains often leads to intractable computational problems due to the need to search large spaces of potential models (vastly increasing the number of models that must be evaluated) and the need to evaluate those models against very large data sets (increasing the effort required to evaluate each model). • Manual application — In contrast to joint modeling methods, quasi-experimental designs must be applied manually. They are not an integral part of automated algorithms for machine learning and knowledge discovery. QEDs are currently used as a way for investigators to manually structure data analysis tasks and to select among different statistical tests, rather than as a component of a machine learning algorithm. This limitation is understandable, given that the applicability of specific QEDs requires knowledge that has not, until quite recently, been explicitly represented in the data provided to machine learning algorithms (relations among entities, temporal information, etc.). However, that situation has changed dramatically over the past decade. • Non-relational models — None of the approaches support construction of joint models of relational domains (domains consisting of interacting entities). The techniques either assume that the domain can be accurately described as a set of independent entities of a single type (in the case of joint modeling), or they do not provide the means to construct a joint probability model (in the case of experimental and quasi-experimental design) 10. These limitations prevent application of causal modeling to the majority of domains of realistic interest, including the NASD tasks outlined above.

6 Research Opportunities This raises an intriguing target for future machine learning research: Developing methods for automatically applying quasi-experimental designs to large relational data sets, and to seamlessly integrate these methods for causal inference with methods from joint modeling. We conjecture that the combination of these two classes of methods will produce a large increase in the number of causal dependencies that can be learned, a dramatic decrease in the amount of data needed to learn these dependencies, and a significant decrease in computational complexity of causal inference. Such research could address three key objectives: • Select and enhance data representation and modeling framework — Researchers in statistical relational learning (SRL) and the larger machine learning community have recently produced a remarkable array of new languages for representing both models and data. Among the representational enhancements are three key types of 10

It is, in theory, possible for quasi-experimental designs to inform the construction of relational models (indeed, that is one of the primary ideas underlying this proposal). However, this would require something that has not been done to date: combining QEDs with models capable of representing joint probability distributions over relational data.

Beyond Prediction: Directions for Probabilistic and Relational Learning

17

information needed by QEDs: relations, time, and assignment mechanisms11. While at least one existing language can represent each of these types of information, there is not a clear choice among the set of existing languages, and further investigation may reveal additional capabilities necessary to exploit the full range of QEDs. Future work should select, enhance, and (if necessary) develop new data representations and learning frameworks to meet our needs. • Develop automated learning methods that apply QEDs — A large number of specific techniques have been developed in QED and multi-level modeling. Future work should bring those methods into a common framework, develop automated methods for evaluating the applicability and likely validity of specific QEDs, develop querying methods that can select appropriate data samples, and implement algorithms for applying the techniques themselves. These methods will differ substantially from existing machine learning methods, which nearly always apply the same algorithm to the entire data set. In the case of QED, different data will likely be sampled to evaluate each different dependency, and evaluating two different dependencies is likely to require two different designs. • Integrate QED and joint modeling methods — Finally, future work should develop methods to control the application of both QED and traditional joint modeling techniques to integrate these methods into a single powerful system. Such a system will necessitate some moderately complex reasoning because of the interactions among QED methods and joint modeling. For example, joint modeling might be intractable initially, because the number of potential dependencies is so large. However, QED techniques could evaluate specific classes of dependencies and identify whole entities whose variables have no effect on specific variables. They could also identify dependencies that exist with high probability, and thereby vastly decrease the search space for joint modeling. Similarly, the dependencies found by joint modeling may enable or disable specific QEDs. Once developed, we believe that these methods will lessen or remove the deficiencies noted above. Specifically, the fusion of machine learning and QED that we propose will directly automate application of QEDs and learn relational models. In addition, these methods will indirectly: • Increase statistical power — In the simplest case, QEDs will strictly add to the potential power of a joint modeling approach because they may be able to infer causal dependencies that joint modeling cannot, and they are unlikely to reduce the effectiveness of joint modeling. However, we also expect that they will vastly prune the search space for joint modeling techniques by identifying key causal dependencies and independencies before joint modeling is attempted. This, in turn, will significantly improve the overall power of the combined system. • Reduce computational complexity — Application of QEDs is likely to be a lowcomplexity operation, both because they can make isolated inferences about specific dependencies without considering all possible dependencies (as joint 11

“Assignment mechanism” refers to an explicit description of how variables receive their values or two entities become related. For example, one could assume that the occurrence of twins is randomly distributed throughout a population or that the securities industry reps make decisions to join firms based on specific characteristics of the firm as well as their own abilities.

18

D.D. Jensen

modeling does) and because specific designs exploit very small data sets to make inferences. These inferences, in turn, can prune the search space for joint modeling.

7 Timeliness This concept of combining work in machine learning and quasi-experimental design appears to be novel. As one indicator, consider the data in Figure 6 showing the results of several queries to Google Scholar, an online database of the full text of technical articles in a wide variety of fields. Approximately 20,000 of the articles mention machine learning or related technologies, and roughly a similar number mention quasi-experimental design. However, only roughly 100 of those articles (0.5%) include one or more terms from both queries. Of those, the vast majority merely mentions one or both technologies rather than making simultaneous use of both.

Fig. 6. The relative overlap of articles in machine learning and quasi-experimental design based on hits in Google Scholar with queries of (”machine learning” OR “inductive logic programming” OR “bayesian network” OR “graphical model”) and (”quasi-experimental” OR “quasi experimental”)

Given the apparent value of combining machine learning and QED, why haven't these methods been combined before? First, and most obviously, the methods do not yet exist to automatically apply QEDs, and developing them will be a substantial effort. Second, the majority of recent work in joint modeling has focused on parameter estimation and efficient inference in undirected graphical models, rather than on structure learning in directed models. The former topics are relevant to a variety of relatively low-level processing tasks such as machine translation, speech understanding, computer vision, information extraction, and information retrieval, while the latter topics are the ones relevant to causal inference. Third, automatic application of QEDs virtually requires data representations that encode relations, time, and assignment mechanisms, and such representations have not been available in machine learning until quite recently. Finally, the application of QEDs relies on judicious and highly selective sampling, which runs directly counter to the prevailing practice in machine learning and knowledge discovery of trying to use data sets in their entirety.

Beyond Prediction: Directions for Probabilistic and Relational Learning

19

While novel, this work builds on at least three key technical developments from the past decade of work in machine learning and knowledge discovery. Specifically: • Large observational data sets — Large observational data sets have become the norm in the field, as well as in related fields such as social network analysis. These data sets provide the scope and size necessary to identify the specific subsets of data that correspond to particular QEDs. For example, in the case of the twin study design mentioned above, a vast initial data set (or at least a population from which data can be drawn) is needed both to identify a sufficient number of pairs of twins and to ensure that a sufficient number of the pairs have at least one member who have a particular disease or condition. Without such vast sets of data, many QEDs could not be effectively applied. • Relational data and knowledge representations — Automatic application of QEDs would be impossible without the sort of rich and expressive data and knowledge representations developed in the past decade12. Remarkably, essentially all of the information needed to apply QEDs is encoded in at least one of these representations, although no single representation appears to have every type of information needed. • Graph query languages — Several languages have been developed in the past decade that can extract complex structures from large relational data sets. Unlike traditional query languages (e.g., SQL), these languages return subgraphs (sets of interconnected records, each of which exists in the original graph) rather than single records created for the purpose of summarizing the matching records. This capability is precisely what is needed for checking the applicability of QEDs and carrying out the requisite data sampling. Given these developments, we believe the time is right to attempt a fusion of machine learning and QED.

8 Risks and Benefits The risks of this research, while fairly high, are more than compensated for by the large potential benefits of improved techniques for causal inference. Potential risks include: • Limited applicability of QEDs — It could be that QEDs only apply in a relatively narrow range of circumstances. This appears very unlikely, given the vast number of studies in the social sciences that use QEDs. • Lack of synergy between QEDs and joint modeling — It is possible that there is relatively little overlap between the cases where QEDs apply and the cases where joint modeling cannot resolve dependencies. • High complexity — It is possible that effective reasoning about when to apply specific QEDs and joint modeling techniques will have higher complexity than 12

What is remarkable is not that these richer data representations exist and might be useful, but that methods for joint modeling have been able to infer anything meaningful about causality without them. Every other field of study that attempts to infer causality from observational data implicitly uses such representations, even though the methods are entirely manual.

20

D.D. Jensen

doing joint inference alone. This seems very unlikely and, even if true, could probably be addressed by using a set of heuristics rather than more precise reasoning methods. • Low power due to representational richness — Increasing the expressiveness of model representations may increase the size of the search space so much that it will overwhelm any gains from applying QEDs. However, all this means is that we have developed imperfect methods for learning models that previously could not be learned at all. In either case, the effort will significantly advance the state of the art in causal inference. The potential benefits include overcoming the limitations noted in the previous sections, as well as some potential ancillary benefits: • General theory of QED — Quasi-experimental design is currently a loose collection of relatively unrelated designs. It is possible that, in the course of developing automatic methods for applying QEDs, a unified and more general framework for QED will emerge. This would represent not only a computational advance, but a statistical and philosophical one as well. • Theory of effective representations — This work may help identify a formal basis for evaluating the quality of data representations. Specifically, good representations are those that maximize the opportunities for applying QEDs (and thus being able to infer causality). Evaluating representations is a long-standing problem in machine learning, and has been particularly salient in recent work in statistical relational learning. • Unifying framework for interdisciplinary research — If successful, this work would provide a common language for computer scientists, statisticians, philosophers, and social scientists, magnifying the number of researchers working toward common goals and accelerating progress in all of these fields. Acknowledgments. Discussions with Andrew Fast, Lisa Friedland, Henry Goldberg, Michael Hay, John Komoroske, Marc Maier, Jennifer Neville, Matthew Rattigan, and Agustin Schapira contributed to the ideas in this paper. This material is based on research sponsored by the Air Force Research Laboratory and the Distributive Technology Office (DTO), under agreement number FA8750-07-2-0158. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusion contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory and the Distributive Technology Office (DTO), or the U.S. Government.

References 1. Pearl, J.: Causality: Models, Reasoning, and Inference. Cambridge (2000) 2. Getoor, L., Taskar, B.: Introduction to Statistical Relational Learning. MIT Press, Cambridge (2007) 3. Neville, J., Simsek, Ö., Jensen, D., Komoroske, J., Palmer, K., Goldberg, H.: Using Relational Knowledge Discovery To Prevent Securities Fraud. In: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2005)

Beyond Prediction: Directions for Probabilistic and Relational Learning

21

4. Fast, A., Friedland, L., Maier, M., Taylor, B., Jensen, D., Goldberg, H., Komoroske, J.: Relational Data Pre-Processing Techniques For Improved Securities Fraud Detection. In: To appear in The Proceedings of the 13th International Conference on Knowledge Discovery and Data Mining (2007) 5. Friedland, L., Jensen, D.: Finding Tribes: Identifying Close-Knit Individuals From Employment Patterns. In: To appear in The Proceedings of the 13th International Conference on Knowledge Discovery and Data Mining (2007) 6. Boring, E.: The Nature and History of Experimental Control. The American Journal of Psychology 67(4), 573–589 (1954) 7. Fisher, R.: Statistical Methods for Research Workers. Oliver and Boyd (1925) 8. Holland, P., Rubin, D.: Causal Inference in Retrospective Studies. Evaluation Review 12, 203–231 (1988) 9. Rubin, D.: Formal Models of Statistical Inference For Causal Effects. Journal of Statistical Planning and Inference 25, 279–292 (1990) 10. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search, 2nd edn. MIT Press, Cambridge (2000) 11. Campbell, D., Stanley, J., Gage, N.: Experimental and Quasi-experimental Designs for Research. Rand McNally (1963) 12. Shadish, W., Cook, T., Campbell, D.: Experimental and Quasi-Experimental Designs. Houghton Mifflin (2002) 13. Boomsma, D., Busjahn, A., Peltonen, L.: Classical Twin Studies and Beyond. Nature Reviews Genetics 3, 872–882 (2002)