Flattening network data for causal discovery: What could ... - UMass CS

Flattening network data for causal discovery: What could go wrong? Marc Maier

Katerina Marazopoulou

David Arbour

David Jensen

Knowledge Discovery Laboratory School of Computer Science University of Massachusetts Amherst

{maier, kmarazo, darbour, jensen}@cs.umass.edu 1.

INTRODUCTION

propositionalize

Methods for learning causal dependencies from observational data have been the focus of decades of work in social science, statistics, machine learning, and philosophy [18, 21, 22]. Much of the theoretical and practical work on causal discovery has focused on propositional representations. Propositional models effectively represent individual directed causal dependencies (e.g., path analysis, Bayesian networks) or conditional distributions of some outcome variable (e.g., linear regression, decision trees). However, propositional representations are limited to modeling independent and identically distributed (IID) data of a single entity type. Many real-world systems involve heterogeneous, interacting entities with probabilistic dependencies that cross the boundaries of those entities (i.e., non-IID data with multiple entity types and relationships). For example, citation data involve researchers collaborating on scholarly papers that cite prior work and are published in various venues. These systems produce network, or relational, data, and they are of paramount interest to researchers and practitioners across a wide range of disciplines. To model such data, researchers in statistics and computer science have devised expressive classes of directed graphical models, such as probabilistic relational models (PRMs) [3], and network models, such as exponential random graph models (ERGMs) [6]. Despite the assumptions embedded in propositional models, a common practice is to flatten, or propositionalize, relational data and use existing algorithms [11] (see Figure 1, focusing on algorithms that learn causal graphical models). While there are statistical concerns, this process is generally innocuous if the task is to model statistical associations for predictive inference. In contrast, to learn causal structure, estimate causal effects, or support inference over interventions, the effects of flattening inherently relational data can be particularly deleterious. In this paper, we identify four classes of potential issues that can occur with a propositionalization strategy as opposed to embracing a more expressive representation that would not succumb to these problems. We also present empirical results comparing the effectiveness of two theoretically sound and complete algorithms that learn causal structure: PC—a widely used constraint-based, propositional algorithm for causal discovery [22], and RCD—a recently developed constraint-based algorithm that reasons over a relational representation [15].

relational data

learn

propositional data

model

learn

relational data

model

Figure 1: Schematic of propositionalizing (flattening) and using a propositional learner to discover a causal model (top) vs. using a relational learner on the original data (bottom).

2.

BACKGROUND ON FLATTENING

Most techniques in classical statistics and machine learning operate over a single data table, often referred to as propositional data. However, it is becoming increasingly common for real-world systems to involve multiple types of interacting entities. These systems produce relational data that can be represented as a network. There have been two main approaches to model such data: (1) develop new algorithms to handle the increased representational complexity and (2) transform the data into a single table and rely on existing algorithms. The latter approach involves a process called propositionalization, which flattens relational data into a propositional representation [11]. Propositionalization is the process of projecting a set of tables (typically one for each entity and relationship type) down to a single, propositional table. This procedure is defined with respect to a single perspective—one of the original entity or relationship types—from which the other tables are summarized. The additional constructed features are meant to capture relational aspects of the data, placing the complexity on transformation rather than on learning. Propositionalization creates all variables ahead of time, decoupling feature construction from learning, as opposed to dynamically generating variables as in most relational learning algorithms. There are many different approaches and

systems for propositionalization, but they can be generally divided into three classes: (1) those constructing boolean clauses over related concepts as typically used in the inductive logic programming (ILP) community (e.g., LINUS [13], RSD [14]), (2) those using database aggregations over sets of related values as used in the knowledge discovery in databases (KDD) community (e.g., ACORA [19]), and (3) those characterized by both (e.g., Relaggs [12]).

Citation Count Authors

Historically, there has been a debate surrounding the efficacy of propositionalization. The disadvantages focus on statistical concerns, including the prevalence of autocorrelation [7, 10] and degree disparity bias [8], both of which can increase Type I error (alternative tests that avoid such bias are discussed by Rattigan & Jensen [20]). The advantages center around not needing to develop new algorithms given the vast number of deployed propositional learning algorithms. Flattening data invariably leads to some loss of the relational information, but it is generally assumed that predictive accuracy—a common measure for tasks involving propositionalization—achieves equivalent levels of performance. However, the accuracy varies depending on the propositionalization approach. The goal of this paper is to outline potential problems with propositionalization for causal discovery rather than prediction. Furthermore, while statistical issues can also lead to serious causal concerns (e.g., statistical conclusion validity [21]), this paper examines the extent to which mistakes can occur when reasoning about causality in the absence of statistical errors.

3.

PROBLEMS DUE TO FLATTENING

In this section, we describe four potential problems faced by propositionalization approaches, in the context of learning causal dependencies. The first three points were also identified by Maier et al. [17]. 1 A relational schema describes the entity, relationship, and attribute classes in a domain. It can be displayed graphically as an entity-relationship (ER) diagram. Rectangles correspond to entity classes, diamonds are relationships, ovals indicate attributes, and cardinality constraints are represented as crow’s feet.

PAPER

Cites h-index RESEARCHER

As an example, consider the domain of scholarly publishing. Figure 2 presents a relational schema1 that describes a subset of the domain. Papers are authored by one or more researchers, published in a single venue, and give and receive citations to other scholarly articles. Each entity type, as well as the relationships (although not in this example), include intrinsic attributes such as paper length, citation count, and subject, researcher h-index [5], and venue impact factor [2]. For citation analysis, there has been considerable interest in analyzing bibliographic networks for predicting citation counts [23, 24]. One approach could be to propositionalize the relational data from the perspective of a paper and run linear regression to predict the number of citations. Propositionalized features could include the average or maximum h-index of the researchers who author the paper, the minimum number of citations that cited papers have received, etc. This process would construct a single table where the rows correspond to papers and the columns are attributes (intrinsic and relational) of those papers.

Published In

Length Subject

Impact factor VENUE

Figure 2: An example relational schema for the domain of scholarly publishing: Papers are authored by researchers, published in venues, and cite other scholarly articles.

(1) Unnecessary latent variables Manual and misinformed approaches to propositionalization are particularly susceptible to exclude certain variables from the flattened data set. Causal sufficiency—the assumption that all common causes of variables are observed and included in the data—is an important assumption for causal discovery methods. Missing variables can violate causal sufficiency by not representing these common causes. This may lead to false positive dependencies or arbitrary bias of causal effects (often referred to as omitted-variable bias). Even though X and Y may have no direct dependence, there may be no observed set of variables to render them conditionally independent. If X does have a direct effect on Y , but the conditional model for Y excludes a relevant indicator Z, then the modeled effect of X may be biased. For example, if researcher h-index is excluded from the model of paper citations, then the model may over- or under-estimate the effect of venue impact factor. Additionally, by not representing certain variables, propositional methods may introduce false negative dependencies. If a causal dependency exists between two variables, but one or both are excluded in the propositionalization, then those dependencies will necessarily be excluded. The abstract ground graph—a recent lifted representation for network data introduced by Maier et al. [16]—fully specifies, for a given perspective, the necessary variables to avoid such problems for a large class of relational models. Nevertheless, the problems outlined below are still potential threats to learning valid causal dependencies, even for a causally sufficient data set. (2) Induced selection bias Berkson’s paradox is a well-known phenomenon where two marginally independent variables X and Y can be rendered conditionally dependent given a third variable Z, if Z is a common effect of X and Y [1]. This frequently occurs in relational data sets, where testing the association between two variables on different entities implicitly conditions on the existence of the relationships (i.e., links) between them [17]. This particular form of Berskon’s paradox is commonly referred to as selection bias. If the existence of the relationship is a common effect (e.g., in the presence of homophily),

then the variables appear to be statistically dependent even in the absence of a direct causal dependency. For example, the subjects of a citing and cited paper appear dependent because the existence of the citation itself is due jointly to having common subjects. Propositionalization implicitly conditions on relationship existence by creating variables based on connectivity in the underlying network. Thus, standard propositional methods would learn false positive dependencies where direct causation does not occur. There are causal discovery algorithms for propositional data that reason about selection effects (e.g., FCI [22]), but they fail to explicitly represent dependencies with relationship existence. Modeling the probability of relationship existence has been explored in both statistical [4] and causal contexts [17], but its representation and connection to conditional independence remains an open problem. (3) Violations of the causal Markov condition Another important assumption for causal discovery is the causal Markov condition, which states that every variable should be conditionally independent of its non-effects given its causes. Data or methods that fail to meet this assumption can produce false positive dependencies. Propositionalization is responsible for feature construction, requiring that all variables be generated prior to learning. As currently practiced, this process can lead to individual attribute values participating in multiple aggregate variables. This can be the result of either choosing more than one aggregation function for the same underlying relational variable (e.g., average and maximum researcher h-index) or representing two different relational variables that have a nonempty intersection (e.g., the set of cited papers and the set of previous papers the authors have published will frequently overlap). This duplication produces statistical dependence among the values of these aggregated variables. Consider the case where X causes Y , but the propositionalization created multiple variables based on Y . Then even after conditioning on X, a statistical dependence among all Y variables will remain due to correlated residual variation despite the absence of a direct causal dependency. This violates the causal Markov condition. In contrast, relational learning algorithms typically provide a dynamic ability to construct features and would not consider examining a direct dependency between aggregates based on the same variable. (4) Unspecified acyclicity constraints Most causal discovery algorithms using graphical model representations (e.g., Bayesian networks) assume atemporal and acyclic data (i.e., no feedback loops). However, after propositionalizing, the input data may consist of variables based on the same attributes. For example, the data may have attributes for a paper’s subject and the most common (modal) subject of cited papers. Without additional constraints on the dependency space, propositional learning algorithms that compare all pairs of variables will assume a dependency can exist between X and the X values of related entities even

though the assumption on the model space precludes such classes of dependencies. This issue can arise if practitioners use out-of-the-box software packages (e.g., pcalg in R [9], the TETRAD project [22]) without additional expertise in the underlying algorithms. As in problem (2), there may be variables with different aggregates for the same underlying attributes (e.g., the mean, minimum, and maximum number of citations of cited papers). In contrast, relational learners have built in constraints on the dependency space and would not consider dependencies that conflict with assumptions.

4.

EXPERIMENT

The relational causal discovery (RCD) algorithm was recently introduced as a method for learning causal structure from relational data [15]. RCD is an extension of the PC algorithm—a widely adopted baseline causal discovery algorithm for propositional data, with various open-source implementations [22]. Both algorithms are provably sound and complete, which implies that the algorithms learn only correct causal dependencies (soundness) and no other method can infer more causal dependencies from observational data (completeness). These theoretical results are valid given several common assumptions, including causal sufficiency (all common causes are included in the data), faithfulness (the probability distribution over the data encodes only conditional independencies entailed by the causal structure), model acyclicity, and perfect conditional independence tests. RCD also assumes a prior relational skeleton (i.e., the attributes are not causes of entity or relationship existence). This presents an ideal scenario under which we can empirically test some of the effects of propositionalization. We compare the effectiveness of learning causal structure with RCD against PC executed on propositionalized relational data from each perspective (i.e., entity or relationship class). We limited the propositionalization approach to consider only variables that have schema-level paths that visit an entity or relationship class at most once. We then take the best and worst perspectives for each trial by computing the average F-score of its undirected (skeleton) and partially directed (oriented) learned models. To meet the assumptions of RCD, we focus on synthetically generated relational models that are both acyclic and causally sufficient. We use a relational d -separation oracle [16] to provide perfect tests of conditional independence, thereby also satisfying the faithfulness condition. We generated 1,000 random causal models over randomly generated schemas for each of the following combinations: entities (1– 4); relationships (one less than the number of entities) with cardinalities selected uniformly at random; attributes per item drawn from Pois(λ = 1) + 1; and relational dependencies (1–15) limited by at most 4 hops connecting schema items and at most 3 causes per variable. This procedure yielded a total of 60,000 synthetic models. Note that this includes propositional Bayesian networks when there is a single entity class. For each trial, we record the precision (the proportion of learned edges that appear in the true model) and recall (the proportion of true edges that appear in the learned model) for both the undirected skeleton and the partially orientated model. Figure 3 displays the average across 1,000 trials for

0.75 0.5

Remaining algorithms Precision Recall

RCD

PPC* PPCb

PPC* PPCb

RCD

PPCb PPCb

PPCw

PPCw PPCw

PPCw

RCD RCD

RCD RCD

0.25

*

RCD

1

*

0.75

*

RCD RCD PPCb

PPCb

PPCw

0.5

PPCb

PPCb

PPCb

PPCw

PPCb

PPCw

PPCw

0.25

PPCw

PPCw

0

Oriented

0

Skeleton

1

*

Deps: 5 Entities:

1

10

15

1

5

2

10

15

1

5

3

10

15

1

5

4

10

15

Figure 3: Skeleton and oriented precision and recall for the RCD algorithm, as well as the best and worst perspectives for PC on propositionalized data. Results are averaged over 1,000 models for each setting. each algorithm and measure. Both algorithms learn identical models for the single-entity case because they reduce to PC when analyzing propositional data. For truly relational data, RCD is necessary for accurate learning. The best and worst propositionalized PC cases learn flawed skeletons (and also flawed oriented models), with high false positive and high false negative rates. The main culprit for poor performance of PC is problem (1) from Section 3. The propositionalization creates unnecessary latent variables, leading to both false positives and false negatives. For learning causal structure, this is particularly disastrous as many causal dependencies are inserted and missed with PC. This experiment isolates the potential effect of problem (1) by learning without respect to actual data (problem (2)), choosing models in which attributes do not cause relationship existence (problem (3)), and propositionalizing such that additional constraints are unnecessary (problem (4)).

5.

CONCLUSIONS

We described four problems that can occur for causal discovery from network, or relational data, if those data are flattened via propositionalization techniques. Propositionalization can lead to fundamental violations of assumptions embedded in causal discovery methods, such as the causal sufficiency assumption and the causal Markov condition. We presented empirical evidence of the extent to which rendered latent variables can decrease the effectiveness of a standard propositional causal discovery algorithm. While arguments for the advantage of propositionalization have been made in the literature, they have focused solely on learning models with predictive capabilities. We argue

that for causal discovery, propositionalizing inherently relational data is ill-advised. Native representations and learning algorithms for network data retain valuable information for causal discovery, which should be leveraged to uncover more causal structure while avoiding serious errors.

Acknowledgments This effort is supported by the Air Force Research Lab under agreement number FA8750-09-2-0187, the National Science Foundation under grant number 0964094, and Science Applications International Corporation (SAIC) and DARPA under contract number P010089628. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright notation hereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements either expressed or implied, of AFRL, NSF, SAIC, DARPA or the U.S. Government. Katerina Marazopoulou received scholarship support from the Greek State Scholarships Foundation.

6.

REFERENCES

[1] J. Berkson. Limitations of the application of fourfold table analysis to hospital data. Biometrics Bulletin, 2(3):47–53, June 1946. [2] E. Garfield. Citation analysis as a tool in journal evaluation. Science, 178:471–479, 1972. [3] L. Getoor, N. Friedman, D. Koller, A. Pfeffer, and B. Taskar. Probabilistic relational models. In L. Getoor and B. Taskar, editors, Introduction to Statistical Relational Learning, chapter 5, pages 129–174. MIT Press, Cambridge, MA, 2007. [4] L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of link structure.

3:679–707, 2002. [5] J. E. Hirsch. An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46):16569–16572, 2005. [6] P. W. Holland and S. Leinhardt. An exponential family of probability distributions for directed graphs. Journal of the American Statistical Association, 76(373):33–50, 1981. [7] D. Jensen and J. Neville. Linkage and autocorrelation cause feature selection bias in relational learning. pages 259–266, 2002. [8] D. Jensen, J. Neville, and M. Hay. Avoiding bias when aggregating relational data with degree disparity. In Proceedings of the Twentieth International Conference on Machine Learning, pages 274–281, 2003. [9] M. Kalisch, M. M¨ achler, D. Colombo, M. H. Maathuis, and P. B¨ uhlmann. Causal inference using graphical models with the R package pcalg. Journal of Statistical Software, 47(11):1–26, 2012. [10] D. Kenny and C. Judd. Consequences of violating the independence assumption in analysis of variance. ˘ S–431, 1986. Psychological Bulletin, 99(3):422ˆ aA¸ [11] S. Kramer, N. Lavraˇc, and P. Flach. Propositionalization approaches to relational data mining. In S. Dˇzeroski and N. Lavraˇc, editors, Relational Data Mining, pages 262–286. Springer-Verlag, New York, NY, 2001. [12] M.-A. Krogel. On Propositionalization for Knowledge Discovery in Relational Databases. PhD thesis, Otto-von-Guericke-Universit¨ at Magdeburg, 2005. [13] N. Lavraˇc. Principles of Knowledge Acquisition in Expert Systems. PhD thesis, Faculty of Technical Sciences, University of Maribor, 1990. ˇ [14] N. Lavraˇc, F. Zelezn´ y, and P. Flach. RSD: Relational subgroup discovery through first-order feature construction. In Proceedings of the Twelfth International Conference on Inductive Logic Programming, pages 149–165, 2002. [15] M. Maier, K. Marazopoulou, D. Arbour, and D. Jensen. A sound and complete algorithm for learning causal models from relational data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, 2013. [16] M. Maier, K. Marazopoulou, and D. Jensen. Reasoning about Independence in Probabilistic Models of Relational Data. arXiv preprint arXiv:1302.4381, 2013. [17] M. Maier, B. Taylor, H. Oktay, and D. Jensen. Learning causal models of relational domains. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010. [18] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, NY, 2000. [19] C. Perlich and F. Provost. Distribution-based aggregation for relational learning with identifier attributes. Machine Learning, 62(1-2):65–105, 2006. [20] M. J. Rattigan and D. D. Jensen. Leveraging D-separation for relational data sets. In Proceedings of the Tenth IEEE International Conference on Data Mining, pages 989–994, 2010. [21] W. R. Shadish, T. D. Cook, and D. T. Campbell.

Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton Mifflin, Boston, MA, 2002. [22] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and Search. MIT Press, Cambridge, MA, 2nd edition, 2000. [23] R. Yan, J. Tang, X. Liu, D. Shan, , and X. Li. Citation count prediction: Learning to estimate future citations for literature. In Proceedings of the Twentieth ACM International Conference on Information and Knowledge Management, pages 1247–1252, 2011. [24] X. Yu, Q. Gu, M. Zhou, and J. Han. Citation prediction in heterogeneous bibliographic networks. In Proceedings of the SIAM International Conference on Data Mining, pages 1119–1130, 2012.