Causal Inference in the Presence of Causally

0 downloads 0 Views 3MB Size Report
... Urdinola, Ph.D. Bogotá, D.C., May 31st 2017 ...... Thulasiraman and Swamy (1992) and Lauritzen (1996). ...... meaning of the latter in the causal inference literature. ...... https://www.rochester.edu/college/psc/clarke/MissProp.pdf. ..... R. H., editor, Handbook of Structural Equation Modeling, chapter 5, pages 68–91. New.
Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

´ rdenas Hurtado Camilo Alberto Ca Economist, UNAL

Universidad Nacional de Colombia Facultad de Ciencias Departamento de Estad´ıstica ´ , D.C. Bogota November 2016

Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

´ rdenas Hurtado Camilo Alberto Ca Economist, UNAL

A dissertation submitted for the degree of Master of Science, Statistics

Advisor B. Piedad Urdinola, Ph.D. Ph.D Demograhy, UC Berkeley

Universidad Nacional de Colombia Facultad de Ciencias Departamento de Estad´ıstica ´ , D.C. Bogota November 2016

3

Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach. T´ıtulo en espa˜ nol Inferencia Causal en Presencia de Unidades Causalmente Conectadas: Una Aproximaci´ on a trav´es de un Modelo A Semi-Param´etrico, Jer´arquico de Ecuaciones Estructurales. Abstract: Causal inference has become a dominant research area in both theoretical and empirical statistics. One of the main drawbacks of conventional frameworks is the assumption of no causal interactions among individuals (i.e independent units). Violation of this assumption often yields biased estimations of causal effects of an intervention in quantitative social, biomedical and epidemiological research. This document proposes a novel approach for modeling causal connections among units within the Structural Causal Model framework: a Semi-Parametric Hierarchical Structural Equation Model (SPHSEM). Estimation uses Bayesian techniques, and the empirical performance of the proposed model is evaluated through both simulation and applied studies. Results prove that the Bayesian SPHSEM recovers nonlinear (causal) relationships between latent variables belonging to different levels and yields unbiased estimates of the (causal) model parameters. Resumen: La inferencia causal se ha convertido en un ´area activa de investigaci´on en la estad´ıstica te´ orica y aplicada. Una falencia de las aproximaciones convencionales es el supuesto de ausencia de interacciones causales entre individuos (unidades independientes de estudio). La violaci´ on de este supuesto resulta en estimaciones sesgadas de los efectos causales en investigaciones sociales, biom´edicas y epidemiol´ogicas. En este documento se propone una nueva manera de modelar dichas conexiones causales bajo el Modelo Estructural de Causalidad: un modelo Semi-Param´etrico, Jer´arquico de Ecuaciones Estructurales (SPHSEM). La estimaci´on se hace mediante t´ecnicas Bayesianas, y su capacidad emp´ırica se eva´ ua a trav´es tanto de un ejercicio de simulaci´on como de una aplicaci´ on emp´ırica. Los resultados confirman que el SPHSEM Bayesiana recupera las relaciones causales no lineales que existen entre variables latentes pertenecientes a distintos niveles de agrupamiento, y que las estimaciones de los par´ametros causales son insesgadas. Keywords: Causal inference, independence assumption violation, causally connected units, directed acyciclic graphs (DAG), structural equation models, hierarchical linear

models, semiparametric models, Bayesian estimation. Palabras clave: Inferencia causal, violaci´on de supuesto de independencia, dependencia entre observaciones, grafos ac´ıclicos direccionados (DAG), modelos de ecuaciones estructurales (SEM), modelos jer´arquicos (HLM), modelos semiparam´etricos, estimaci´ on Bayesiana.

Acceptation Note Thesis Work Approved “TBD mention”

Jury Edilberto Cepeda, Ph.D.

Jury Iv´an D´ıaz, Ph.D.

Advisor B. Piedad Urdinola, Ph.D.

Bogot´ a, D.C., May 31st 2017

Dedicated to

To everyone who truly believed in me throughout this journey.

Acknowledgements

First, I would like to thank my advisor, Prof. B. Piedad Urdinola. Her never ending patience and support fueled my passion for academics and encouraged me to keep working everyday on my thesis despite of being a challenging, yet rewarding, journey. To her, my deepest gratitude. Second, I am grateful to my juries, Prof. Iv´an D´ıaz and Prof. Edilberto Cepeda, as well to my professors in the Statistics Department at Universidad Nacional de Colombia, for their guidance, knowledge, time and commentaries on previous versions of this document. Third, I would like to thank Prof. Nian-Sheng Tang, for kindly sharing his superb, unpublished manuscript. His piece of work was key to understand and set the basis for the structural equation model here presented. Also, I am in debt with my family, friends, and partners, who were always patient and understanding with my temporal absences. Their company, love and support were fundamental throughout these years. To them, thank you. Finally, to Paola. Her support was critical for me during the last few months of this process. I cannot say anything but “gracias infinitas, siempre”.

Contents

Contents

I

Introduction

III

1. Causality: An introduction

1

1.1 In the search of a causal language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2 Going deeper into Pearl’s Structural Causal Model (SCM) . . . . . . . . . . . . . . 11 1.3 The relationship between SCM and RCM: Why the former and not the latter? 24 2. Causal Inference Through Parametric, Linear Models

28

2.1 Structural Equation Models (SEMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.2 Bayesian Estimation of Structural Equation Models . . . . . . . . . . . . . . . . . . . 36 3. Causally Connected Units

38

3.1 Hierarchical Structural Equation Models (HSEM) . . . . . . . . . . . . . . . . . . . . 42 3.2 Bayesian Estimation of HSEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4. A Semi-Parametric Hierarchical Structural Equation Model (SPHSEM) 48 4.1 The observed random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 The measurement equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 The structural equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4 A Note on Bayesian P-splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.5 Identification Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5. Bayesian Estimation of the SPHSEM

56

5.1 Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Posterior Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 I

CONTENTS

6. Simulations & Application

II 69

6.1 A Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.1.1

MCMC Simulations and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.1.2

Bayesian Model Comparison and Goodness-of-fit Statistics . . . . . . . . 71

6.1.3

An Intervention to the Simulated Causal System . . . . . . . . . . . . . . . 74

6.2 Empirical Application of the SPHSEM: Hostility, Leadership, and Task Significance Perceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.2.1

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.2.2

Analysis of Intervention: What if soldiers reported higher Leadership perceptions? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Conclusions

83

Further Research

84

Some Causal Quantities of Interest

85

Derivation of posterior distributions

88

Results of Simulation Study

94

MCMC Results for the Simulation Example

96

Description of the Empirical Exercise

99

R Codes

102

Introduction

Scientific discovery, both in Natural and Social Sciences, is the task of learning from the world through observation. Most of Social Science studies handle social phenomena from a descriptive point of view, a task that is by no means easy, but challenges arise when shifting from descriptive questions (what) to causal questions (what if and why). As Social Scientists face what has been called the fundamental problem of causal inference in social research (Holland, 1986), there has been a sprout of new developments of statistical theory and methodologies on causal analyses over the past few decades. These statistical methodologies are framed in what is know in social statistics as Causal Inference. Several authors have contributed to the causal inference literature from either a theoretical or empirical approach (Rubin, 1974, 1978, 2006, Angrist et al., 1996, Robins, 1986; Robins et al., 2004, or Pearl (2009b), as the most cited or known references), but most of them assume no causal connections or interactions among individuals (i.e independent units and the Stable Unit Ireatment Value Assumption, SUTVA, assumption), something that is very uncommon when by observations we mean people. Formal results show that the presence of causally connected individuals yields biased estimations of the causal effects of an intervention (Rosenbaum, 2007; Sekhon, 2008). VanderWeele and An (2013) present a survey with the most recent advances in terms of modeling causal relationships in presence of causally connected units, but, as it recognized by the authors themselves, “a formal theory for inferring causation on social networks is arguably still in its infancy” (p.353), and most of papers on causal inference with causally connected units lack of the structural framework proposed by Pearl (1988b, 1995, 2009b). This thesis aims to fill this gap by presenting a Semi-Parametric Hierarchical Structural Equation Model (SPHSEM) that counts for the presence of non independent, causally connected units clustered in groups that are organized in a multilevel fashion. We build upon the work on the hierarchical structural equation model by Rabe-Hesketh et al. (2004); Rabe-Hesketh and Skrondal (2012) and Lee and Tang (2006); Lee (2007), among others; and, following Song et al. (2013) and others, expand it by proposing a semi-parametric formulation akin to the theoretical SCM presented in Pearl (2009b). This novel methodology would allow the quantitative Social Scientists, or researchers in Applied Biology or Epidemiology, to assess causal effects of interventions from observational data sampled from clustered subpopulations. Following Lee (2007); Song and Lee (2012a) and Song et al. (2013), we present a Bayesian estimation algorithm for

III

INTRODUCTION

IV

the SPHSEM’s parameters. This document is organized as it follows: after the first introductory section, we present and introduction to Causality and its multiple theoretical and applied frameworks in Chapter 1. Chapter 2 presents how causal inference is reached through statistical models, in particular Structural Equation Models. Chapter 3 presents how causally connected units are modeled in a multilevel SEM. In Chapter 4 we present a Semi-Parametric Hierarchical Structural Equation Model (SPHSEM), the main contribution of this thesis. Chapter 5 explains the Bayesian estimation procedure, the algorithms and algebraic derivations of the model. Chapter 6 presents both a simulation and an application studies that provide an empirical idea of the performance of the proposed SPHSEM. Lastly, we conclude and present some future research opportunities around the SPHSEM.

CHAPTER

1

Causality: An introduction

The main objective of scientific endeavors is to establish, understand and even find new causal relationships in the world we know. Nonetheless, causality is commonly a subject framed within controversial discussions. Aside from the mere philosophical motivation of truly understanding the causes and effects related to a given phenomenon, causal knowledge permits the researcher to build up a system or model that allows for predicting new outcomes for a variable of interest given an intervention. Assessing the impact of an intervention on a causal system is straightforward in experimental conditions. However, not always researchers can count on data from experimental designs and, in many cases, randomized experiments cannot be properly conducted, especially in the social sciences. In the latter cases it might be just too expensive or even unethical to perform experiments where people are involved, and therefore, social scientists usually resort to observational data in order to get their research going. This issue makes causal knowledge hard, since the associations present in observational data do not necessarily imply direct causal relations. Moreover, from observational data alone we are only able to compute associational quantities, such as correlations, and there might be some confounding variables present in the underlying structure of the data that cannot be directly measured and that have to be taken into account when posing causal claims. The reader might already be aware that resorting to observed data (i.e. realizations of random variables) necessarily comes with uncertainty. In light of that, Statistics plays a key role in deriving causal claims from non-experimental observations, that is, a probabilistic approach to causal analysis.

1.1.

In the search of a causal language

A probabilistic approach to causality We can, indeed, understand the world we are living in as a set of deterministic causal systems that are unknown and not revealed to the researcher, but that can be somehow inferred from the information obtained from or provided by the environment. However, given the fact that we do not know these causal systems, uncertainty will play a central 1

CHAPTER 1. CAUSALITY: AN INTRODUCTION

2

role in causal analysis. In simple cases, knowing the causes allows us to predict the consequences, without any further effort. For example: Proposition A: It rains (cause), Proposition B: The pavement gets wet (effect), Evidence, logic, or simple common knowledge states that the relationship between propositions A and B goes as A → B. In this particular example uncertainty is not an issue of major concern, and therefore, observing A allows for thinking that B well certainly happen as well. Notwithstanding, in more elaborate examples, knowing A will render B more likely to be observed, not absolutely certain (Suppes, 1970). For example: Proposition A: A patient is given certain medicine X (cause), Proposition B: The patient recovers from a disease Y (effect), In this case we have that the relationship between propositions is still A → B, but also that P (B ∣ A) ≥ P (B). As stated above, the occurrence of causes A increases the probability of occurrence of the consequence B, i.e. P (B ∣ A) > P (B ∣ ¬A) (see Suppes, 1970; Cartwright, 1983, 1989). In the latter example, relationship A → B implies an underlying theoretical causal model that is familiar to the researcher, and observational data only allows for verifying the validity of the proposed causal model. However, nothing in the conditional probability P (B ∣ A) alone allows the researcher to elaborate causal claims derived from an external intervention, say A′ . Many scientists have proposed different epistemological and empirical frameworks based on conditional probabilities that allow for the elucidation of counterfactuals and/or causal effects of a given treatment variable (A or A′ ) on an outcome variable of interest (B). However, the problem that most of social scientists, biologists, medical researchers, public policy makers, among others, face is what it is known as the fundamental problem of causal inference (Holland, 1986): in most cases researchers cannot supply an alternative treatment (A′ ≠ A) to the same subject under the same conditions and/or at the same time in which the original experiment was conducted and, therefore, cannot formulate conclusions about the effect of such treatment on an outcome variable. That is, observational data provides information about P (B ∣ A), but not about P (B ∣ A′ ). Given that we cannot derive causal claims from observational data alone, causal inference is the scientific process by which these relationships are inferred and (fully) characterized from observational data, but only after assuming a causal model driving the relationships between random variables. Put another way, as described in Pearl (2016), assume an unknown, real-world, invariant, and true data generating process, M , that generates a set of observed random variables (data), D, and an associated multivariate probability distribution, P , as shown in Figure 1.1. The target of scientific inquiry in traditional statistical analysis is a probabilistic quantity, Q(P ), which summarizes some attribute of D that is of interest of the researcher. Q(P ) can be estimated from P and D alone. However, causal analysis is different from statistical analysis in the sense that the former is interested in the effect of an external intervention (treatment) of interest to the causal system M , that is, when experimental conditions change. This intervention acts as a specific modification to the data-generating

3

CHAPTER 1. CAUSALITY: AN INTRODUCTION

M (Data generating process)

D

P

Q (P)

(Data)

(joint distribution)

(Target quantity)

Statistical Inference

Figure 1.1. Traditional statistical inference paradigm, adapted from Pearl (2016).

model M , giving rise to an unobserved (counterfactual) set of data D′ and a distribution P ′ . This ‘change’ is known as the causal effect the intervention, i.e. the changes in the data generating process that generate hypothetical (unobserved) D′ and P ′ . Then, a causal target parameter Q(P ′ ) is computed, a quantity that summarizes the causal effect of the given intervention (or treatment). The problem is that in observational studies the researcher only has access to D (and therefore P ), while D′ (and P ′ ) remain unknown. D or P alone cannot give an answer to the causal quantity of interest. That is why the researcher resorts to a set of (un)testable causal assumptions that allow for estimating Q(P ′ ) from D and P , as in Figure 1.2. With these assumptions at hand, the idea is to mathematically express Q(P ′ ) in terms of both D and P , leaving D′ and P ′ out. These assumptions come from the expertise and previous experience of the researcher.

M (Data generating process)

Intervention (modification to the DGP)

D

P

P’

(Data)

(joint distribution)

(joint distribution)

Q (P’) (Target quantity)

(D,P) + Causal Assumptions = Causal Inference

Figure 1.2. Causal inference paradigm, adapted from Pearl (2016).

Moreover, given P , D consists of a sample of randomly distributed exogenous (treatment) variables, T ∈ D, from which the researcher is to deduce the causal structure that determines the values of the endogenous (outcome) variables, Y ∈ D, from a set of possible structures (including the true one, M ), by imposing testable assumptions on functional relationships - linear or nonlinear - between input and output variables, and

CHAPTER 1. CAUSALITY: AN INTRODUCTION

4

distributional forms for exogenous variables. Also, confounding and/or baseline variables may be present, X ∈ D, so that spurious relationships between an outcome Y and a treatment T , generated by X should be ‘weeded out’ from the true causal effects of T on Y . To do that, the concept of conditional independence is the key to probabilistic causal inference. Based on Dawid (1979) and Pearl and Paz (1987) (to be formally defined in the subsequent sections), given a set of random variables, Y , T and X, the former are conditionally independent given the latter, denoted as (Y á T ∣ X), if once we know the value X = x, knowing the value obtained by T does not provide any further (causal ) information about Y . In words, causal quantities can be only estimated when in a causal model M , an hypothetical randomized trial is mimicked by conditioning on the right set of variables X. Once specified the causal system M and the links of both P and D to the causal model, the next step in causal inference is to define the target parameter or quantity Q(P ′ ). As described in Petersen and van der Laan (2014), the researcher should translate the scientific inquiry to a formal causal quantity usually defined as a (set of) parameter(s) of the joint distribution of the counterfactual scenario, P ′ . Several causal quantities can be of interest to the researcher, such as the Average Causal Effect - ACE, the Average Treatment Effect on the Treated - ATT, the Conditional Average Treatment Effect CATE, the Population and Sample Average Treatment Effects - PATE and SATE (see Rubin 1974; Holland 1986, 1988; Imbens 2004, and references therein), the Average Causal Mediation Effect - ACME (see Baron and Kenny 1986; Imai et al. 2010a,b,c, and references therein), Direct and Indirect Effects - DE and IE (see Pearl 2001b, 2005, 2009b, chapter 4, and references therein), etc. Further details on some quantities are presented in the Appendix section. The estimators involve expressions of P ′ in terms of P , and therefore can be estimated through a handful of statistical methods. It is up to the researcher to decide which one is the most appropriate tool for his/her scientific and empirical goals. Following Barringer et al. (2013), in the next subsections we present a brief but concise historical compilation of statistical theoretical, and empirical developments in causal inference literature that allow for the specification of M , and the subsequent estimation of causal quantities Q(P ′ ) (either in a parametric or non-parametric ways) based on conditional independences between manifest random variables. Each of these approaches comes with identification rules and particular causal assumptions. Some models require stronger assumptions than others, but come with easily implementable statistical estimation methods and interpretability. Yet again, the choice its almost a matter of philosophical debate.

Rubin’s Causal Model, Randomization, and the Potential Outcome Framework This approach views experiments as the ruling paradigm in causal discovery. It is grounded on the concept of random treatment assignment, a critical condition in experimental designs. Holland (1986) presents a section with a rather epistemological discussion about causality, from which we highlight his interpretation of a causal effect when establishing a bridge between the Rubin’s Causal Model (RCM) and Mill’s (1843) ideas about the subject. Holland argues that, once defined the treatment, T , and the outcome variable of scientific interest that depends on (a function of) the treatment,

CHAPTER 1. CAUSALITY: AN INTRODUCTION

5

Y (T ), causes are the “circumstances” in which instances Y (T ) and Y (T ′ ) differ, once the researcher controls for variables that do not differ (not affected by, or pretreatment variables) between experiments, X. In this case, the differences in Y are caused by different treatment regimes T and T ′ , once X remains constant, as if experimental conditions were met. Now, assume a sample of size N . In order to assess the causal effect of a treatment, Ti = ti , on a variable of interest, Yi (Ti = ti ), for a particular individual i ∈ N (subscript notation), the former should have been assigned following random mechanisms that do not obscure the outcomes of an experimental design. That is, results should be comparable to those hypothetically obtained if the control treatment was applied to the same individual, Yi (Ti = t′i ) (under the same experimental conditions). The idea behind random treatment assignment is that, once controlling for pretreatment variables xi ≅ xi′ , differences between observed values yi and yi′ , are only due to differences between ti and ti′ , which are assigned by chance. However, controlled randomized experiments are not always possible and researchers are obligated to resort to non-randomized observational data. The solution is to statistically mimic experimental conditions and randomized treatment assignment. Deep within the roots of this framework lays the well known motto “No causation without manipulation” (Holland, 1986). The idea behind this assertion is that ideal experimental conditions are required to establish causal effects from treatments that are willfully but randomly assigned to an individual. If these ‘laboratory conditions’ cannot be fully met, even by any statistical procedure (Holland, 2003), then causal claims lack of validity. The role of randomization was first presented by Rubin (1974, 1978), who was largely influenced by the pioneering work of Neyman (1923) on design of experiments in agricultural sciences. The main idea of the RCM is, intuitively, to estimate the overall causal effect of a treatment Ti = ti (over a control Ti = ci , absence of treatment) as the difference between what would have happened if the unit i had been exposed to the treatment Ti = ti , Yi (ti ), and what would have happened if i had been exposed to Ti = ci , Yi (ci ). This definition uses a formal language full of counterfactuals, the mathematical basis of the potential outcome framework (Rubin, 2005). More specifically, given a set of exogenous (not ‘affected’ by the treatment) variables Xi , define Yi (Xi , ti ) as the value of an outcome variable Y measured for a particular unit i, with pretreatment variables Xi , who was randomly assigned to the treatment group (i.e. received Ti = ti ), and Yi (Xi , ci ) the value of Y measured for the same unit i given that it was assigned to the control group Ti = ci . Then, the causal effect of Ti = ti over Ti = ci for i is defined as τi = Yi (Xi , ti ) − Yi (Xi , ci ). The problem is that since we are using non-experimental data, only one of the latter values is actually measured for i. These values, Yi (ti ) and Yi (ci ) are known as the potential outcomes for unit i, but only one of them is actually observed from the observational study design. In many cases, the other is estimated from the data. Therefore, causal inference within the RCM framework is reduced to a missing data problem.

CHAPTER 1. CAUSALITY: AN INTRODUCTION

6

The main assumption in RCM is related to the way missing information is estimated from observed data. The missing potential outcome for an unit i with observed treatment ti , is computed using information from other units, i′ ≠ i, with similar pretreatment variables, xi ≅ xi′ , but with different treatments, Ti′ = ci′ . Put another way, in order to compare potential outcomes, the researcher is urged to attempt to find comparable units to compare, that is, to increase homogeneity in the counterfactual scenario (Sekhon, 2008). Suppose that the researcher observes Yi (ti ), but he knows, by the random nature of treatment assignment or from previous information or by some other reasoning, that the way i responds to an hypothetical intervention Ti = ci is about the same way another individual i′ responds to Ti′ = ci′ , an unit for which Yi′ (ci′ ) is measured. By some ‘matching’ procedure, the researcher is able to ‘inform’ the potential outcome for i, Yi (ci ), with the extra information obtained from i′ , therefore, being able to compute the causal quantity of interest. In essence, matching is about finding units in the sample that did not receive the treatment, but that are statistically equivalent (in terms of pretreatment covariates) to those that actually received it. Unit homogeneity is an important characteristic for claiming causal effects in observational studies, otherwise, estimates of causal quantities would be biased and would yield misleading conclusions. In the RCM, three other assumptions are needed in order to obtain unbiased estimates of causal quantities using observational data from a ‘homogenized’ sample: i) the perfect compliance assumption, that is, individuals who are randomly assigned to receive a treatment do, in fact, take the treatment; ii) the unconfoundedness/ignorability assumption, which states that treatment assignment is (conditionally) independent of the outcomes; and iii) stable unit treatment value assumption (SUTVA), which assumes no interference between units and that the treatments for all units are comparable (Rubin, 1978), i.e. potential outcomes for one unit do not depend on the treatment assigned to any other unit in the sample. The latter is of special interest in this thesis since its violation is very common on most social science contexts, where interference between units its quite common. Unit homogeneity is thus achieved by using matching methods (Rubin, 1977; Rosenbaum and Rubin, 1983; Rubin, 2006) or other related statistical methods such as regression discontinuity designs. Also, it has been argued that bias and variance for causal parameters of interest decrease as the homogeneity in the sample increases Rosenbaum (2005). This document does not present further explanations on matching methods, but we refer the reader to Rosenbaum and Rubin (1983, 1984); Rubin (1973, 2006) for relevant literature on the propensity score matching (PSM), Cochran and Rubin (1973); Rubin (1979, 1980) on multivariate matching based on Mahalanobis distance, and Sekhon (2009) and others on more advanced matching algorithms, such as Genetic Matching.

Angrist’s Instrumental Variables and Heckman’s Control Functions Instrumental variables (IV) are commonly used by economists and econometricians in causal inference because of its capability to mimic random treatment assignments. This approach is quite similar to the RCM framework, but computation of causal quantities resort to different estimation methods. The pioneering work of Angrist (1990) and Angrist

CHAPTER 1. CAUSALITY: AN INTRODUCTION

7

et al. (1996) were the first to use IV for estimating causal quantities and then stress the relationship between IV and RCM. The main assumption in this framework is that IVs are random variables Z that affect the assignment of a given treatment T , but do not have direct effects on the outcome variable Y . This assumption (which is known as the exclusion restriction) is also seen as a weakness for the IV approach to causal inference, since it is quite difficult to assess the validity and exogeneity of the instruments from the unexplained part of the model. Nonetheless, the IV approach is quite useful when SUTVA assumption is violated, since by controlling for an instrument Z (a procedure that mimics a random assignment of the treatment) that affects the outcome variable Y only through the treatment T , assures, by far, that treatment assignment is independent from those of other observations, i.e. Y á T ∣ Z. Also, IVs allow for estimating causal quantities such as the local average treatment effect (LATE, Imbens and Angrist, 1994), which are more complex in nature and provide some insights to the causal system when RCM assumptions are violated. On the other hand, Heckman’s control functions (Heckman, 2005) serve as an approximation to causal inference when exogeneity between treatment assignment and outcome (strong ignorability) assumption is violated. This happens when, for example, individuals in the sample are more prone to ‘force’ themselves to the treatment (self-selection bias) and therefore, treatment assignment becomes a non-random process, different from what assumed in the RCM. Heckman’s control functions are akin to the IV approach in the sense that the researcher needs to model the endogeneity in the assignment mechanism by identifying an unobservable factor whose impact in the outcome variable is restricted to be indirect only through the matching equation (Barringer et al., 2013). The reader shall refer to Heckman and Vytlacil (2007a,b) for a deeper insight in causal inference using observational data.

Robins’ “G-methods” The approach presented by Robins (1986, 1987, 1997) and colleagues is a bridge between the potential outcome framework, and the structural causal model proposed by Pearl and others (to be presented below). Robins was concerned about the causal effects of a time-varying treatment on an outcome variable of interest within a framework with (possibly time-varying) concomitant variables/confounders. This is why Robins’ approach to causal inference is very popular among epidemiologists, public health, and medical researchers (see, for example Robins et al., 2004). The key assumptions are similar to those in the RCM. However, given the different nature of the problem and the causal parameter of interest, matching methods in the PO framework will yield inconsistent and biased estimators of the causal quantity if the study contemplates time-varying confounders that are themselves affected by the (past) treatment exposures. We extend the notation in previous subsections. Let i be an individual belonging to a sample of N units. Also, let t denote a follow-up visit with t ∈ {1, ..., T + 1}, occurring within the lapses τ1 , ..., τT +1 . Robins (1986, 1987) expanded the static framework in Rubin (1974, 1978) to a longitudinal setting with direct and indirect effects of time-varying treatments (Ai,t ), on confounders, (Li,t ), concomitant variables, (Xi,t ), and an outcome of interest, (Yi ), for an individual i at time t. Said that, the RCM is a particular case of Robins’

CHAPTER 1. CAUSALITY: AN INTRODUCTION

8

approach in a baseline scenario, with unique t = t0 . Consider a longitudinal setting in which the N individuals enter the study at an initial time τ0 , and receive baseline treatment Ai,0 . The intervention is assumed to be assigned at random. A collection of covariates Xi,0 and Li,0 (confounders and concomitant variables, respectively) are measured. There are t = 1, ..., T + 1 follow-up visits occurring at times τ1 , ..., τT +1 , where measures are taken on the treatment variable (Ai,t ) and on the covariates (Li,t and Xi,t ) for each subject i ∈ N . An outcome variable Yi is measured at the end of the study, i.e. at visit T + 1. The treatment history for subject i up to visit t is denoted as Ai,t = (Ai,0 , ..., Ai,t ). Similarly, we denote the confounders’ and concomitants’ history as Li,t and X i,t , respectively. For sake of simplicity, in this subsection we assume that concomitant variables are implicitly included in Li,t and are not explicitly expressed. Denote Ai ≡ Ai,T as the history up to the end of the study (same for confounder and concomitant variables). Assuming either binary or continuous treatments, Ai takes values on the Euclidean space A = A0 × ⋯ × AT , where each At is the set of all possible values of the treatment (common for all subjects) at visit t. Therefore, for every realization ai ∈ A, we define a counterfactual (potential outcome) Yiai as the value the outcome variable of interest for subject i would have attained had i been exposed to treatment history ai . It is clear that Robins’ potential outcomes, Yiai , are similar to those of Rubin’s in the RCM, Yi (ai ), but with a dynamic treatment Ti = ai instead of a static one ti . As in every causal inference framework, we need untestable assumptions to claim causal results and to identify the effect of a time-varying treatment on the outcome variable of interest. First, we assume no interference between units (akin to SUTVA), and consistency, that is, Yi = YiAi . Also, we need the conditional exchangeability (i.e. Y a á At ∣ Lt , At−1 , ∀ a ∈ A and ∀t ∈ {0, ..., T }, or in words: “treatment assigned at random given the past”) and positivity (if PL (l) ≠ 0 then PA (a ∣ l) > 0) assumptions outlined in Robins and Hern´ an (2009). Given the latter, a formal definition of the causal effect of a sequential treatment exposure on the outcome variable of interest can be formulated. Definition 1.1.1. (Causal Effect; Daniel et al., 2013): Given Y, the support of Y , the causal effect of an intervention history A on Y is the mapping q ∶ A × Y → R+ , where q(a, y) gives the value of the probability function of Y a evaluated at y, P (Y a = y ∣ A = a), provided a ∈ A and y ∈ Y. Definition 1.1.1 is akin to the one in Pearl’s causal model, 1.2.9, to be presented in the following subsections. In essence, the causal effect of a time-varying intervention is the conditional probability distribution defined over the outcome variable Y when A = a. This conditional probability distribution can be graphically represented by an event tree that Robins (1986, 1987) called Causally Interpreted Structured Tree Graphs (CISTGs). A natural extension, the random CISTGs (RCISTGs), is an event tree in which the treatment received at visit t is randomly allocated given the past values of the treatment and other covariates (following the conditional exchangeability assumption). Given this graphical representation, the close relationship with the structural causal model (SCM) of Pearl an others (to be presented in the next subsection) has been extensively argued by both authors (see, for example, Robins, 1995; Greenland et al., 1999; Pearl, 2001a and

CHAPTER 1. CAUSALITY: AN INTRODUCTION

9

Pearl, 2009b, sections 3.6.4 and 11.3.7). Robins (1986) proved that within the RCISTGs framework, if the the exchangeability, consistency and positivity assumptions hold, the causal effect P (Y a = y ∣ A = a) of a time-varying intervention a over Y , can be further expressed as T

P (Y a = y ∣ A = a) = ∑ P (y ∣ lT , aT ) ∏ P (lt ∣ lt−1 , at−1 ) lT

(1.1)

t=1

Equation (1.1) is known as the “G-computation formula”. In essence, this equation expresses the counterfactual distribution P ′ in terms of P and D, which are non-experimental data observed by the researcher. With the causal effect defined and identified, a more ‘parsimonious’ and concise causal quantity Q(P ′ ) of interest can be estimated. Very often this quantity would be the average causal effect of a particular treatment history a on Y , i.e. E(Y a ). Note, however, that a conventional potential outcome approach would fail to yield causally-interpretable results. Assume a binary treatment at every time t. The main drawback is that we would have to deal with a large number of counterfactual treatment regimes (i.e. 2T +1 values of a), and hence, with a same number of potential outcomes {E(Y a ) ∶ a ∈ A}. Robins and coauthors came up with statistical methods to estimate this target causal parameter for a time-varying intervention: the G-computation algorithm (Robins, 1986, 1987); a class of semi-parametric estimators of Structural Nested Models (SNM) (Robins et al., 1992; Robins, 1992, 1993); and double-robust, including the inverse probability weighted (IPW), estimation of Marginal Structural Models (MSM) (Robins, 1998; Robins et al., 2000). We do not extend much on the latter, but an accessible introduction to these methods is summarized in Robins and Hern´an (2009), Daniel et al. (2013) and Vansteelandt and Joffe (2014).

Pearl’s Structural Causal Model (SCM) This approach was a major breakthrough in the understanding of causal systems, as it views the system as a set of ‘mechanisms’ (unknown to the researcher) that can be individually studied, modeled, and changed. Pearl’s Structural Causal Model (SCM) went even further than RCM, and aimed to the identification of the causal model M from observed data D and its associated distribution P . Causal effects are defined as changes in M due to an external intervention to the functional mechanism of a variable T , expressed in terms of a logical operator, do(T = t). By applying such manipulation to the system, a modified sub-model M ′ is generated, from which hypothetical (non-observable) D′ and P ′ are expected to be obtained. The main objective, yet again, is to translate those counterfactuals in terms of observed, non-experimental data. Pearl’s SCM needs more untestable assumptions that other approaches in order to provide causal interpretability to the estimation results. In this case, assumptions are mostly related to the structure imposed to M . That is, based on its expertise or previous results, the researcher decides (or infers from data) how random variables in D are

CHAPTER 1. CAUSALITY: AN INTRODUCTION

10

causally interrelated. The latter means defining conditional (in)dependencies between random variables in D. Some other assumptions are similar to those already explained in the RCM and Robins’ approach. The SCM departs from the theoretical bases set by Wright (1920, 1921, 1934) on what it is known as the path analysis method. In his work, Wright translated (conditional) correlations into path coefficients (standardized regression coefficients) with a causal interpretation. The key to understand path coefficients as carriers of causal information is to realize that there are usually prior beliefs based on previous experience from the researcher or experimental grounds that allow for claiming that some factors (variables) are direct causes of variations in others, or that pairs of variables are related if both are effects of a common cause (Wright, 1921, page 559). Path coefficients are ultimately interpreted as the fraction of the standard deviation of the outcome (dependent) variable for which the factor (exogenous variable) is directly responsible, keeping all other factors constant. However, Wright himself argued that path coefficients (derived from partial correlations) have to be interpreted carefully when it comes to deriving causal claims, this because of simultaneous causation, or improper definition of the diagram that represents the relationships between factors (Wright, 1934). In addition to path coefficients, another contribution by Wright was the path diagram, a graphical representation of the causal relationships between factors that show direct and indirect effects. Despite of path analysis being formulated in the early 1920s and formalized in the 1930s, it was not until the 1960s that applied social scientists began using path analysis to study the (reciprocal) causal relationships between socioeconomic variables (Duncan et al., 1968). Path analysis was complemented with features from simultaneous equation models (Haavelmo, 1943) and from factor analysis (Spearman, 1904). Later in the 1970s, J¨oreskog (1973), jointly with contributions from Keesling (1972) and Wiley (1973), developed an analytical framework known as Structural Equation Models (SEM), or LISREL (LInear Structural RElationships, J¨oreskog and S¨orbom, 1996). SEM are a statistical tool used in the social sciences to assess causal problems and to estimate direct and indirect effects of treatments on a multivariate set of outcome variables (see, e.g. Goldberger, 1972, 1973; Goldberger and Duncan, 1973; Duncan, 1975). The causality-oriented focus was not fully acknowledged until the publication of Bollen (1989) Structural Equations with Latent Variables. Bollen emphasized the importance of previous/expert knowledge in model (path) building. However, SEMs themselves are not an formal causal framework, but a statistical method that lacked of causal theory. It was the work of Pearl (1995, 2000, 2009b) and Spirtes et al. (1991, 2000), who build upon the advances on probabilistic graphical models and causal analysis in the context of artificial reasoning (Pearl and Paz, 1987; Pearl, 1986, 1988b; Verma and Pearl, 1988); that set the foundations of what we currently know as the Structural Causal Model (SCM). Pearl’s and coauthors allowed for the formulation of causal effects of interventions (as already defined above), following graphical criterion backed by solid formal probabilistic foundations. The definition of causal effects in the SCM is akin to that presented in (1.1), as it shall explained in the following subsection. Causal quantities can be estimated after defining the causal effect formula. Identification of the model is achieved following a

CHAPTER 1. CAUSALITY: AN INTRODUCTION

11

graphical representation of the causal system (usually a Directed Acyclic Graph, DAG). The natural statistical method to estimate such target causal parameters is the parametric SEM, as it defines a set of functional causal relationships that represent the modular way nature works. Interventions (treatment) are modeled through the do-operator. An introductory, yet detailed, explanation of the SCM can be found in Pearl (2009a, 2010a,b) and Spirtes (2010). SEM’s parametric approach has been found useful and accurate when estimating causal quantities of interest (see Chapter 2). More recently, fully nonparametric methods have been developed, such as the Targeted Maximum Likelihood Estimation (TMLE) approach (van der Laan and Rubin, 2006; van der Laan, 2010a,b; van der Laan and Rose, 2011). As a truly nonparametric method, TMLE is the perfect match for SCM, as Pearl himself acknowledges. However, a detailed explanation of TLME is out of the scope of this document.

1.2.

Going deeper into Pearl’s Structural Causal Model (SCM)

It is important to distinguish the differences between standard statistical analysis and causal analysis. On one hand, the former aims to assess and to estimate parameters of a probability density from samples drawn from that distribution, such as regression parameters and/or (conditional) correlations, and use them to claim relational associations between variables of interest. These inferences are valid under the assumption of stable experimental conditions, that is, no conditions are changed by means of the introduction of treatments or external interventions to the causal system. On the other hand, causal analysis is concerned not only about the associational relationships in static conditions, but also when external interventions (treatments) are introduced into the system (Pearl, 2009a,b). More explicitly, a joint probability distribution cannot tell anything about how that distribution changes in the presence of any external change. Instead, this information must be provided by causal assumptions which identify relationships that are not affected by dynamic external conditions. These assumptions cannot be, in principle, tested from observational data, but are theoretically or judgmentally based claims related to the way the researcher understands the world. By testable, Pearl refers to the ideal situation in which direct manipulation from the researcher is open to happen, e.g. in controlled experimental settings, something that is not likely to happen in the Social Sciences. Therefore, it is clear that any scientific claim arguing causal relationships between variables needs causal assumptions that cannot be inferred from, or even defined in terms of standard probabilistic language alone. In the SCM, untestable assumptions are typically directed relationships between pairs of variables X, Y , belonging to a set of measurements, D, from individuals (units) i ∈ {1, ..., N } in a non-experimental setting. For consistency of notation with the SCM literature, define V ≡ D onwards. X is assumed to be a direct cause Y if there exists a causal path, defined as a collection of edges, from X to Y . These assumptions are

CHAPTER 1. CAUSALITY: AN INTRODUCTION

12

clearly not verifiable from the joint probability distribution defined over V . However, these directed links are, in fact, related to conditional probabilities in a way such that Y is independent of X (Y á / X) if X and Y are not assumed to share a causal connection (X → / Y ), as we shall see promptly. The process by which the researcher aims to define a logically well defined set of directed links consistent with the joint distribution implicit in the observed data is known as causal discovery. The set of variables, together with the set of causal (directed) links consistent with the joint probability distribution governing the DGP has a visual representation, called graph, that allows for straightforward interpretations from the causal system itself and the causal relationships therein assumed. We will not extend much on the explanation of causal discovery, since it is a wide topic that deserves rigorous attention by itself but is off the scope of this document. With the mathematical relationships between graphs and probabilistic dependencies being formally treated in the 1980s by computer scientists, mathematicians and philosophers, important developments were achieved on how causal relationships can be inferred from observational data (after making certain assumptions about the underlying DGP). Moreover, computational improvements fostered the development of complex algorithms designed to find patterns of conditional independencies from the true DGP that were also coherent with partial sections of the assumed causal model (see Pearl, 2009a, Chapter 2, for an introduction to causal discovery). Algorithms such as the Inductive Causation and the Inductive Causation with latent variables algorithms (IC and IC* algorithms, Verma and Pearl, 1990; Verma, 1993), the PC algorithm (Spirtes and Glymour, 1991), the FCI algorithm (Fast Causal Inference algorithm, Spirtes et al., 2000; Spirtes, 2001), among some others explained with greater detail in Spirtes et al. (2000, 2010); are designed to suggest candidate causal models that i) follow the assumptions encoded in the causal paths defined by the researcher, ii) are capable of generating the true DGP, and iii) follow the model minimality, Markov and stability conditions/assumptions defined in Pearl (2009a, Chapter 2), requirements necessary to efficiently distinguish causal structures from non-experimental data. In this document we assume that the researcher has already ‘discovered’ a set of plausible equivalent causal structures that are consistent with the joint probability distribution governing the observed data.

From Conditional Independence to Probabilistic Graphical Models (PGM) to Directed Acyclic Graphs (DAG) We have already emphasised, causes only render consequences more likely, but no absolutely certain (Pearl, 2009a). This allows to understand causal relationships between variables in terms of conditional probabilities and thus, conditional independence. Some definitions and properties: Definition 1.2.1. (Conditional Independence; based on Dawid, 1979): Let V be a set of discrete or continuous random variables. Let P (⋅) be a multivariate probability density defined over V , and let X, Y and Z be any three independent subsets of V . X and Y

CHAPTER 1. CAUSALITY: AN INTRODUCTION

13

are said to be conditional independent given Z = z if P (x ∣ y, z) = P (x ∣ z) whenever P (Y, Z) > 0, and it is written X á Y ∣ Z. In words, “learning the value of Y does not provide additional information about X, once we know Z” (Pearl, 2009a, page 11). To denote conditional independence of X and Y given Z we write (X á Y ∣ Z), or more specifically (X á Y ∣ Z)P to explicitly denote conditional independence when using a certain probability measure P . Some useful properties satisfied by the conditional independence relation are (Dawid, 1979): ˆ Symmetry: If (X á Y ∣ Z), then (Y á X ∣ Z), ˆ Decomposition: If (X á Y W ∣ Z), then (X á Y ∣ Z), ˆ Weak union: If (X á Y W ∣ Z), then (X á Y ∣ ZW ), ˆ Contraction: If (X á Y ∣ Z) and (X á W ∣ Y Z), then (X á Y W ∣ Z), and ˆ Intersection: If (X á W ∣ ZY ) and (X á Y ∣ ZW ), then (X á Y W ∣ Z).

These properties were independently presented in Pearl and Paz (1987) and Geiger et al. (1990); Geiger and Pearl (1990, 1993) under the name of graphoid axioms. These set of axioms are important and serve as a basis for the definition of informational relevance, specially in computation and AI frameworks (Pearl, 1988b). More specifically, in the context of graphs, these properties assure that (X á Y ∣ Z), read as ‘X is conditionally independent of Y given Z’, is translated from a probabilistic language into that of graphs, as ‘all paths from X to Y are intercepted by Z’. Graphs, or more specifically, Probabilistic Graphical Models (PGM) are representations of the multivariate probability density defined over the measurable space generated by a set of random variables. PGMs summarize the conditional (in)dependence relationships between random variables given a previously defined (causal) graph structure. An introductory approach to PGMs can be found in Koller and Friedman (2009) and Pearl (1988b). Some advantages of PGMs over conventional multivariate probability functions are i) their ease on interpretation, ii) their capability of presenting high dimensional multivariate probability functions (in the sense that in a set of random variables of cardinality M , we would have to deal with 2M conditional probabilities), and iii) their adaptability to graph theory. The link between PGMs and multivariate density functions is possible due to the Hammersley-Clifford theorem (Hammersley and Clifford, 1971), which ultimately relates Markov properties and factorization of conditional probabilities over graphs. More formally, Definition 1.2.2. (Graph; Koller and Friedman, 2009): A graph G is a mathematical object consisting of a set V of random variables, each one of them called a node, and a set E of edges (links) that connect pairs of nodes. The notation used throughout this document will be G = ⟨V, E⟩ when referring to a graph G with nodes V and edges E. The (absence of ) edges between two variables Vi , Vj ∈ V represent conditional (in)dependencies. For example, given the set of variables V = {V1 , V2 , V3 , V4 , V5 } and the set of edges E = {(1, 2), (1, 3), (3, 5), (4, 5)}, the graph G = ⟨V, E⟩ shall be represented as in Figure 1.3. The most basic type of graphs, from which we build upon, are known as Markov Networks (MN) or Markov Random Fields (MRF), and were introduced by Besag (1974).

CHAPTER 1. CAUSALITY: AN INTRODUCTION

14

V1 V2 V3 V4 V5 Figure 1.3. G = ⟨V, E⟩, an undirected Markov Network.

MRFs are undirected graphs, that is, do not carry causal assumptions and links merely represent symmetric probabilistic dependencies between nodes (random variables). This type of graph satisfies the Markov property, that is, the conditional probability for a given node is ‘memoryless’ (does not add new information) with respect to variables to which it is not connected to. More formally, given the undirected graph G = ⟨V, E⟩, the set of variables X = {Xν } ∈ V forms a MRF with respect to graph G if the following (non-equivalent) Markov properties are satisfied: ˆ Pairwise Markov property: Xu á Xv ∣ XV /{Xu ,Xv } , if {(u, v) ∉ E}; i.e. any two non-adjacent nodes are conditionally independent given all the other variables in V . ˆ Local Markov property: Xv á XV /{NG [v]} ∣ XNG (v) , where NG (v) is defined as the neighbourhood set 1 of node Xv in graph G, and NG [v] is defined as the closed neighbourhood set 2 of node Xv , i.e. a variable is conditionally independent of all other variables given its neighbours, and, ˆ Global Markov property: XA á XB ∣ XC . This property implies that any two sets of nodes {A, B} ⊂ V are conditionally independent given a separating set C ⊂ V , i.e. C is defined as a set of nodes such that every path from a node in A to a node in B goes through C.

In order to make Markov properties clearer we present an example using the graph G presented in Figure 1.3. We have that V4 á V3 ∣ {V1 , V2 , V5 } (pairwise Markov property), V1 á {V5 , V4 } ∣ {V2 , V3 } (local Markov property), and, given X = {V1 , V2 }, Y = V4 , and Z = {V3 , V5 }, X á Y ∣ Z (global Markov property). However, the type of graphs that we are interested in are known as Bayesian Networks (BN) or Directed Graphs (DG), as presented in the context of causal inference and causal discovery by Pearl (1988b) and, from a mathematical point of view, by 1

The neighbourhood set of a node Xv in a graph G is the induced subgraph of G consisting of all nodes adjacent to Xv , i.e. the induced subgraph G[S](v) = ⟨S, E ∗ ⟩, such that S ∶= {Xu ∈ V ∶ u ≠ v, (u, v) ∈ E} with S ⊂ V , and E ∗ ∶= {(a, b) ∈ E ∶ {Xa , Xb } ∈ S} ∪ {(a, v) ∈ E ∶ Xa ∈ S}. 2

The closed neighbourhood set of a node Xv is defined as NG [v] ∶= Xv ∪ NG (v).

CHAPTER 1. CAUSALITY: AN INTRODUCTION

15

Thulasiraman and Swamy (1992) and Lauritzen (1996). More specifically, we are interested in those graphs that display an acyclic behaviour (Directed Acyclic Graphs, DAG). The term ‘directed’ comes from the assumption of directional (causal) associations between random variables. That is, flow of information goes from one node to another, but not the other way around (except in those graphs with bidirected links, but these shall not be considered in this document). A directed edge is known as a path, and given a graph G = ⟨V, E⟩ it can be understood as every pair of nodes in V , (u, v) ∈ E, such that the former (known as parent) directly influences or induces a dependency relationship to the latter (known as child ), and it is represented as Xu → Xv . We recall that DAGs also satisfy the Markov conditions described above, but the presence of directed paths requires additional definitions. We begin by defining the set of nodes that renders Markov properties satisfied in a given DAG G. Definition 1.2.3. (Markovian Parents; Pearl, 2009b): Let V be a set of random variables and P a joint probability measure defined over the measurable space generated by V . A set of variables P Aj ⊂ V is known as the set of Markovian parents of Vj if P Aj is a minimal set of predecessors of Xj that once conditioned on, renders Xj independent of all its others predecessors. In others words, P Aj is any subset of V that satisfies P (Vj ∣ P Aj ) = P (Vj ∣ {V /Vj }), and no other proper subset of P Aj satisfies the latter, for every Vj ∈ V . It can be shown that the set of Markovian parents P Aj is unique if P belongs to the space of all positive measures M, P ∈ M (Pearl, 1988a). Moreover, making use of the chain rule of probabilities, we define the Markov property for a given DAG G. Definition 1.2.4. (Markov Property for DAGs; Lauritzen, 1996): Let G be DAG. We say that a probability density P ∈ M, obeys the Markov properties of G if P (x1 , x2 ..., xn ) = n

n

i=1

i=1

∏ P (xi ∣ x−i ) = ∏ P (xi ∣ pai ), with x−i = {x1 , ..., xi−1 , xi+1 , ..., xn }. The set of all positive

densities obeying the Markov properties for G is denoted as M(G).

Whenever we define a probability function P ∈ M(G), such that admits a factorization like the one in definition 1.2.4 relative to the causal structure in a graph G, we say that either G represents P , or that G and P are compatible, or that P is Markov Relative to G. This characteristic is known as Markov compatibility . Assuring compatibility between probability measures and DAGs is important in statistical modelling mainly because Markov compatibility is a sufficient and necessary condition for a DAG G to explain observed empirical (non-experimental) data represented by P , that is, a stochastic causal model capable of generating P (Pearl, 1988a, 2009b). The set of conditional independencies defined by the generated probability measure P and implied by a DAG G that Markov compatible with P can be read off following graphical criterion as well. This criterion is called d-separation, and it allows for a clearer and more straightforward interpretation of the relationships between random variables. It was introduced in Pearl (1986, 1988b) and treated extensively throughout the refinement of the SCM. This concept shall play a major role in causal inference in this document. It is formally defined as Definition 1.2.5. (d-Separation; Pearl, 2009b): A path p (sequence of consecutive edges) is said to be d-separated by a set of nodes Z if, and only if

CHAPTER 1. CAUSALITY: AN INTRODUCTION

16

1. p contains a chain i → m → j, or a fork i ← m → j, such that the middle node m ∈ Z. 2. p contains an inverted fork (collider) i → m ← j such that m ∉ Z and such that no descendant of m ∈ Z. A set Z is said to d-separate X from Y if and only if Z blocks every path from a node in X to a node in Y . The intuition behind d-separation is easily understood by giving a causal meaning to each directed edge in the DAG. Condition 1 in definition 1.2.5 states that the two extreme random variables will become conditionally independent (i.e. blocked) once we know the value of the middle node (i.e. condition). That is, conditioning on m blocks the flow of causal information from i to j (and vice versa) along path p. Condition 2, representing two causes having a common effect, becomes clearer by realizing that to know the value of m renders i and j conditionally dependent, because confirming one will reduce the probability of observing the other. The connection between d-separation and conditional independence is established through a theorem in Verma and Pearl (1988) and Geiger et al. (1990): Theorem 1.2.1. (Probabilistic implications of d-Separation; Pearl, 2009b): If any sets X and Y are d-separated by Z in a DAG G, then X is independent of Y conditional on Z in every distribution compatible with G. Conversely, if X and Y are not d-separated by Z in a DAG G, then X and Y are dependent conditional on Z in at least one distribution compatible with G. This can be expressed more succinctly as it follows. For any three disjoint subsets of nodes (X,Y ,Z) in a DAG G, and for all probability functions P ∈ M we have: 1. If (X á Y ∣ Z)G , then it holds that (X á Y ∣ Z)P whenever G and P are comparable; 2. If (X á Y ∣ Z)P holds in all distributions compatible with G, then it follows that (X á Y ∣ Z)G . where the graphical notion of d-separation (X á Y ∣ Z)G is distinguished from the probabilistic notion of conditional independence (X á Y ∣ Z)P . Theorem 1.2.1 sets the logical foundations for building a ‘mathematical bridge’ between the language of statistics and probabilities, and that of graphical models, particularly DAGs, which have a natural capacity for carrying causal assumptions and information. This ‘bridge’ shall be the backbone throughout the development of the SCM. DAGs themselves are not the causal system. Nonetheless, there are two main advantages of approaching causal knowledge through DAGs: i) the model implications (causal hypotheses) become more meaningful, accessible and reliable; and ii), DAGs are capable of dealing with external interventions of interest. The ability to manipulate states of the nodes in a DAG resembles an intervention to the node X in a non-experimental setting, which is the goal of measuring causal effects.

CHAPTER 1. CAUSALITY: AN INTRODUCTION

17

DAGs and Interventions: Towards causal inference As emphasized at the beginning of this subsection, causal analysis is concerned about the effect of external interventions on the causal system. Assuming that a given DAG represents stable and autonomous causal relationships between variables, then an intervention should be understood as the physical phenomena in which a single or a set of those causal relationships is changed without changing the structure of the causal system or changing the other relationships (Pearl, 2009b). In other words, DAGs allow for manipulating the value of selected nodes in V in order to resemble an intervention in a non-experimental setting. This logical operation, known as the do operator, denoted as do(X = x) and first presented in Goldszmidt and Pearl (1992), or set(X = x), as in (Pearl, 1995); acts by eliminating the directed link from any predecessor to the intervened variable (cause) of interest, X such that the value attained (x) does not correspond to a stochastic result, but instead a deterministic process fixed by the researcher. Note the difference between P (y ∣ x) and P (y ∣ do(X = x)). The former is a simple, passive observation. The latter is an active action, that is, an intervention to the natural (functional) process by which the random variable X is defined. As stated by Pearl (2009b), the ability of causal networks to predict the effects of those interventions require a set of assumptions that rest on causal knowledge (not associational), that ensure the system to respond in accordance to what we described as the principle of autonomy. We now introduce the concept of causal Bayesian networks: Definition 1.2.6. (Causal Bayesian Network; Pearl, 2009b): Let P (v) be a probability distribution over a set of variables V, and let Px (v) denote the distribution resulting from the intervention do(X = x). Let P∗ (v) be the set of all interventional distributions Px (v), with X ⊆ V . Note that P (v) ∈ P∗ (v), i.e no intervention (X = ∅). A DAG G is said to be a causal Bayesian network compatible with P∗ (v) if, and only if, the following three conditions hold: 1. Px (v) is Markov relative to G. 2. Px (vi ) = 1 for all Vi ∈ X, whenever vi is consistent with X = x. 3. Px (vi ∣ pai ) = P (vi ∣ pai ) for all Vi ∈ X, whenever pai is consistent with X = x, i.e., each P (vi ∣ pai ) remains invariant to interventions not involving Vi . By departing from Bayesian causal networks to pursue causal analyses, once performed the intervention, equation in definition 1.2.4 can be expressed as the truncated factorization: Px (v) = ∏ P (vi ∣ pai ) (1.2) {i∶Vi ∉X}

for every v consistent with do(X = x). In other words, we are limiting our interventional space of distributions P∗ to a certain set of multivariate probability densities which fulfill specific hypothesis and that are Markov relative to a causal Bayes Network G. Once we define P∗ , the following properties must hold for every P ∈ P∗ : ˆ Property 1: For all i, P (vi ∣ pai ) = Ppai (vi ). That is, the set P Ai is exogenous relative to Vi (i.e. the conditional probability P (vi ∣ pai ) coincides with setting do(P Ai = pai ) by intervention).

CHAPTER 1. CAUSALITY: AN INTRODUCTION

18

ˆ Property 2: For all i and every subset S ⊆ V , such that S ∩ {Vi , P Ai } = ∅, we have Ppai ,s (vi ) = Ppai (vi ). That is, once we control for the direct causes of Vi (i.e. setting do(P Ai = pai )), no other interventions (i.e. do(S = s)) will affect the outcome of Vi .

Interventions and causal Bayesian networks are mathematical objects that allow for the derivation of several causal quantities in terms of probability distributions, many of them being hypothetical counterfactual analyses from observed data (Balke and Pearl, 1995). Notwithstanding, Pearl also understood causality in a modular way. He followed Laplace’s (1814) conception of nature’s laws being deterministic, and randomness arising due to our ignorance of the underlying causal system. Therefore, DAGs and the causal influences implied therein are interpreted as a set of deterministic, functional relationships perturbed by random disturbances, as first stated in Pearl and Verma (1991). These functions, called structural equations, represent each child-parent link in a DAG G as Vi = f (P Ai , Ui ), where ui are iid stochastic error terms following a distribution P (U ). Formally: Definition 1.2.7. (Causal Model; Pearl, 2009b): A causal model is a triple M = ⟨U, V, F ⟩ where

ˆ U is a set of background, exogenous, variables, determined by factors outside the model, ˆ V is a set {V1 , ..., Vn } of endogenous variables, determined by variables inside the model, i.e. U ∪ V , ˆ F is a set of functions {f1 , ..., fn } such that each fi is a mapping from U ∪ (V /Vi ) to Vi and such that F forms a mapping from U to V , i.e. if fi yields a value for Vi given U ∪ V , then the entire system has a unique solution V (u) through F . The set of equations can be represented as vi = fi (pai , ui ), i = 1, ..., n where pai is the realization of the unique minimal set of Markovian Parents (P Ai ) sufficient for representing fi . Likewise, Ui ⊆ U represents the minimal set of exogenous variables sufficient for representing fi . The submodel created from the intervention do(X = x) in model M represents changes in the system after removing the functional relationship X = f (P AX , UX ). The latter is done by removing all incoming links to node X in DAG G (Markov equivalent to a causal model M ). Formally: Definition 1.2.8. (Submodel; Pearl, 2009b): Let M be a causal model, X a set of variables in V , and x a realization of X. A submodel Mx of M is the causal model Mx = ⟨U, V, Fx ⟩ where Fx ∶= {fi ∶ Vi ∉ X} ∪ {X = x}. That is, Fx is formed by deleting from F all the functions fi corresponding to nodes in X, and replacing them by the constant function fX = x. Submodels are useful when representing interventions to the causal system of the form do(X = x). When solving for the distribution of another node (set of nodes), Xj , P (xj ∣ do(X = x)) yields the causal effect of X on Xj . The formal definition can be read as it follows:

CHAPTER 1. CAUSALITY: AN INTRODUCTION

19

Definition 1.2.9. (Causal Effect; based on Pearl, 2009b): Given two disjoint sets of variables, X and Y , the causal effect of X on Y , denoted as P (y ∣ do(x)), Px (y), or P (y ∣ x ˆ); is a function g ∶ X → P, where P is the space of probability functions P defined over Y . g(ˆ x, y) gives the probability of observing Y = y induced by deleting all the equations corresponding to variables in X from the set of functional causal relationships in M , and substituting then by X = do(x) in the remaining equations. That is, g(ˆ x, y) = P (y ∣ x ˆ). Causal quantities Q(P ′ ), such as the ATE, can be expressed in terms of causal effects. See that E(y ∣ do(x)) − E(y ∣ do(x′ )), as in Rosenbaum and Rubin (1983), can be computed from the probability distributions P (y ∣ do(x)) and P (y ∣ do(x′ )), both belonging to the interventional space of distributions P∗ . We also recall that the expression P (y ∣ do(x)) is equivalent to that presented in Rubin (1974) describing the potential outcome notation, P (Yx = y). Now that we defined the concept of causal effects, we present the logical framework that allows for their actual identification and computation. The total effect of an intervention do(X = x) on a set of outcome variables of interest can be computed due to the following theorem: Theorem 1.2.2. (Adjustment for direct causes; Pearl, 2009b): Let Xi and P Ai denote a random variable and its direct causes. Also, let Y be any set of variables disjoint from Xi and P Ai , that is Y ∩ {Xi ∪ P Ai } = ∅. The effect of the intervention do(Xi = x′i ) on Y is (1.3) P (y ∣ do(x′i )) = ∫ P (y ∣ x′i , pai )P (pai ) pai

where P (y ∣

x′i , pai )

and P (pai ) represent preintervention probabilities.

Equation (1.3) calls for conditioning P (y ∣ do(Xi = x′i )) on the parents of Xi , and then averaging the result weighting by the probability of P (P Ai = pai ). This operation is known as ‘adjusting for P Ai ’ (Pearl, 2009b). Adjusting for the set P Ai eliminates spurious correlations (paths) between the cause do(Xi = xi ) and the effect (Y = y). In other words, adjusting for the parents of the intervened variable Xi acts as if the researcher partitioned the sample into homogeneous groups with respect to P Ai , assessed the effect of the intervention in each homogeneous group, and then averaged the results (equation 1.3). However, other sets of variables can also accomplish the task of eliminating spurious causal paths between interventions and outcome variables. One of the many advantages of the SCM is the usage of simple graphical criterion for the identification of the right set Z of ‘adjusting covariates’, as opposed to some other cumbersome concepts in the related literature, such as the ‘ignorability” assumption in the potential-outcome framework (Rosenbaum and Rubin, 1983). Assume that we are provided with a causal DAG G, along with non-experimental data V . Suppose we want to estimate the effect of the intervention do(X = x) on Y , where both X and Y are subsets of V . More formally, we want to estimate P (y ∣ do(X = x)) using the observed information coded in P (v), given the causal assumptions encoded in G. A simple graphical test, known as the “back-door criterion” (Pearl, 1993) is used to test whether a set of nodes Z is sufficient for identifying P (y ∣ x ˆ): Definition 1.2.10. (Back-Door; Pearl, 2009b): It is said that a set of variables Z satisfies the back door criterion relative to other variables Xi and Xj in a DAG G if:

CHAPTER 1. CAUSALITY: AN INTRODUCTION

20

1. No variable in Z is a descendant of Xi ; 2. Z blocks (d-separates) every path between Xi and Xj that contains an arrow into Xi . Similarly, if X and Y are two disjoint subsets of nodes in G, then Z is said to satisfy the back-door criterion relative to (X, Y ) if it satisfies the criterion relative to any pair (Xi , Xj ) such that Xi ∈ X and Xj ∈ Y . The idea of “back-door” is essential in the measurement of causal quantities. To be clearer, assume we wish to compute a certain causal effect do(X = x), but instead, we are only provided with or able to observe the dependency P (y ∣ x) that results from the paths in a DAG G. This dependency is contaminated by some spurious correlations (the back-door paths), while some are genuinely causal (the direct paths from X to Y ). In order to remove this bias, we need to modify the measured dependency looking forward to make it equal to the desired quantity (Pearl, 2009b). To do so, we condition on a set of variables Z that satisfy the criterion previously defined. Once we ‘adjust’ for the set Z, we can further identify the causal quantity by the following theorem, for which a proof can be found in Pearl (1993, 2009b): Theorem 1.2.3. (Back-Door Adjustment; Pearl, 2009b): If a set of variables Z satisfies the back-door criterion relative to (X, Y ), then the causal effect of X on Y is identifiable by the formula P (y ∣ do(X = x)) = ∫ P (y ∣ x, z)P (z)dz

(1.4)

z

The “back-door” criterion does not adjust for the accompanying random variables that are actually affected by the intervention do(X = x). These variables can be also used in the process of causal knowledge. Another criterion, introduced by Pearl (1995) under the name of “front-door” criterion, will be important inasmuch it aids in the identification of causal effects. Second, assume a DAG with a causal path X → Z → Y as the one presented in Figure 1.4, where X and Y are simultaneously influenced by a fourth latent variable U . Z does not satisfy any of the back-door conditions in definition 1.2.10, but nonetheless measures of Z can help to consistently estimate P (y ∣ do(X = x)). As shown in Pearl (2009b), this can be achieved by by translating P (y ∣ do(X = x)) into formulas that are computable from P (x, y, z). The joint distribution of the hypothetical example described above can be further decomposed as P (x, y, z, u) = P (u)P (x ∣ u)P (z ∣ x)P (y ∣ z, u). From the intervention do(X = x), we remove the factor P (x ∣ u) from this expression, and after summing over z and u (since we are interested in a causal quantity defined for y), we have P (y ∣ do(x)) = ∑ P (z ∣ x) ∑ P (y ∣ z, u)P (u) z

(1.5)

u

The following equalities hold due to the conditional assumptions encoded in Figure 1.4: P (u ∣ z, x) = P (u ∣ x) and P (y ∣ x, z, u) = P (y ∣ z, u). The latter implies the following

CHAPTER 1. CAUSALITY: AN INTRODUCTION

21

U

X

Z

Y

Figure 1.4. A diagram representing the front door criterion.

equalities: ∑ P (y ∣ z, u)P (u) = ∑ ∑ P (y ∣ z, u)P (u ∣ x)P (x) u

x

u

= ∑ ∑ P (y ∣ x, z, u)P (u ∣ z, x)P (x) x

u

= ∑ P (y ∣ x, z)P (x)

(1.6)

x

and therefore, replacing these expressions in equation (1.5) yields P (y ∣ do(x)) = ∑ P (z ∣ x) ∑ P (y ∣ x′ , z)P (x′ ) z

x′

(1.7)

The expression in equation (1.7) can be estimated from non-experimental data and allows for an unbiased estimate of the causal effect of X on Y , P (y ∣ do(X = x)), in definition 1.2.9. The derivations in (1.7) can be interpreted as a two-step application of the back-door criterion. As explained in Pearl (2009b), the first step is finding the causal effect of X on Z, a concomitant variable that is directly affected by the treatment but that also is not of our main interest. Since we do not have any unblocked back-door path from X to Z in Figure 1.4 we have that P (z ∣ do(x)) = P (z ∣ x). The second step can be understood as the computation of the causal effect of Z on Y . Since there is a back-door path Z ← X ← U → Y that renders Z and Y dependent, we adjust for X, a node that d-separates this path, and therefore, allowing for the computation of P (y ∣ do(z)) = ∑x′ P (y ∣ x′ , z)P (x′ ). By combining the two steps, we have: P (y ∣ do(x)) = ∑ P (y ∣ do(z))P (z ∣ do(x)) = ∑ P (z ∣ x) ∑ P (y ∣ x′ , z)P (x′ ) z

z

x′

(1.8)

This result allows for defining the front-door criterion and and its corresponding theorem for causal effects identification: Definition 1.2.11. (Front-Door; Pearl, 2009b): A set of variables Z is said to satisfy the front-door criterion relative to an ordered pair of variables (X, Y ) if: 1. Z intercepts all directed paths from X to Y ; 2. There is no unblocked back-door path from X to Z; and

CHAPTER 1. CAUSALITY: AN INTRODUCTION

22

3. All back-door paths from Z to Y are blocked by X. Theorem 1.2.4. (Front-Door Adjustment; Pearl, 2009b): If Z satisfies the frontdoor criterion relative to (X, Y ) and if P (x, z) > 0, then the causal effect of X on Y is identifiable and is given by the formula ′ ′ ∑ P (z ∣ x) ∑ P (y ∣ x , z)P (x ) x′

z

(1.9)

Despite of having derived some formulas by which a causal effect of an intervention of the form do(X = x) can be estimated, we still need some ‘inference rules’ that allow for the translation from hypothetical treatments P (y ∣ x ˆ) into equivalent expressions involving only but standard probabilities of observed quantities (realizations). This set of inference rules is called the do-calculus, and it is readily presented. First, we start by introducing new notation. Following Pearl (1995, 2009b); let X, Y , and Z be arbitrary disjoint sets of nodes in a causal DAG G. We denote by GX the subgraph obtained by removing all the directed edges pointing to nodes X. Likewise, let GX as that subgraph resulting from deleting all the directed edges emerging from any node in X. Also, P (y ∣ x ˆ, z) ≜ P (y, z ∣ x ˆ)/P (z ∣ x ˆ) stands for the probability of Y = y by holding (intervening) X at level x, and by chance observing Z = z. Now, three basic inference rules are presented. Proofs can be found in Pearl (1995). Theorem 1.2.5. (Rules of do-calculus; Pearl, 2009b): Let G be the DAG associated with a causal model, and let P (⋅) be the probability distribution induced (implied) by that model. For any disjoint set of variables X, Y , Z and W , we have the following rules:

ˆ Rule 1 (Insertion/deletion of observations): P (y ∣ x ˆ, z, w) = P (y ∣ x ˆ, w)

(Y á Z) ∣ (X, W )GX ,

if

ˆ Rule 2 (Action/observation exchange): P (y ∣ x ˆ, zˆ, w) = P (y ∣ x ˆ, z, w)

if

(Y á Z) ∣ (X, W )GXZ ,

ˆ Rule 3 (Insertion/deletion of actions): P (y ∣ x ˆ, zˆ, w) = P (y ∣ x ˆ, w)

if

(Y á Z) ∣ (X, W )GX,Z(W ) ,

where Z(W ) is the set of (Z)-nodes that are not ancestors of any (W)-node in GX¯ A short explanation of each of these rules goes at if follows. Rule 1 stems from the fact that d-separation is a valid criterion for testing conditional independence between two sets of variables. The intervention do(X = x), or in other words, deleting the equations for X from the causal system (resulting in GX ), does not introduce new dependencies among the rest of the random variables in the sample. Rule 2 provides a condition for external interventions do(Z = z) to have the same effect on Y as the passive observations Z = z. The condition is that (X, W ) are blocking the back-door paths from Z to Y . It is clear that this holds since we are already fixing do(X = x) and also not letting Z affect any of its descendants, by ‘manipulating” the graph (Spirtes et al., 2000) until we have the DAG GXZ . Rule 3 allows for introducing a new intervention do(Z = z) without changing the probability of Y = y. This rule is valid once we manipulate the DAG G by eliminating all the equations corresponding to Z, hence GX,Z(W ) . A detailed explanation of the rules

CHAPTER 1. CAUSALITY: AN INTRODUCTION

23

of do-calculus can be found in Pearl (1995). The main result here is that, after simple manipulations, causal effects can be computed from observed probabilities. ˆ on a set of outcome Corollary 1.2.6. The causal effect of a set of interventions do(X) variables of interest Y , Q = P (y1 , ..., yk ∣ x ˆ1 , ..., x ˆm ), is identifiable in a causal model characterized by a DAG G if there exists a finite sequence of transformations, each one conforming to one of the inference rules in Theorem 1.2.5, that reduces Q into a standard probability expression involving observed quantities. We have briefly presented some theoretical results on nonparametric identification of causal effects from an intervention or treatment X on an outcome random variable of interest Y . What we wanted to highlight was that this quantity, P (y ∣ do(x)), can be recovered from standard probability distributions from observed non-experimental data, in contrast to what we have presented in the Potential Outcome framework (Rubin, 1974; Rosenbaum and Rubin, 1983; Holland, 1986; Rubin, 2005, 2006) and the Instrumental Variables approach (Angrist, 1990; Angrist et al., 1996), where causal claims are derived from hypothetical probability distributions defined over non-observable (counterfactual) random variables. We will bring up this issue in the following subsection, where SCM and RCM are compared. It is important to recall that throughout this subsection we did not make any assumption about the nature of the set of variables X, Y , and Z involved in the process of causal inference. Although Pearl (2009b) developed his structural causal theory based on observed random variables, this approach can be extended in the case in which causal relationships exist between latent (or non-observable) variables. The main challenge when dealing with latent variables comes in the causal discovery step, that is, recovering the true underlying DAG (or a set of plausible DAGs) from the observed data. In response to this issue, different algorithms were developed in order to recover causal structures with latent variables from observed data (see, for example, Glymour and Spirtes, 1988; Scheines et al., 1991; Spirtes et al., 1995; Spirtes and Richardson, 1997; and Spirtes et al., 1997). These documents gave a causal interpretation to the structural part of a structural equation model (SEM), as we shall see in the following sections. Also, SCM and RCM differ with respect the main causal quantities of interest. While RCM is usually concerned with the computation of causal parameters like the ATE or the ATT, in SCM the researcher estimates mostly total, direct, and indirect effects of a given intervention T on a variable of interest Y , with both Y, T ∈ V . This is possible due to the explicit, modular structure of M in the SCM (a feature that is not declared in the RCM), in which direct (and indirect) causal paths are explicitly modeled. Therefore, depending on the graph structure, an intervention T has multiple ways in which affects an outcome Y . Pearl (2005) presents the formal definitions of these causal quantities (see the Appendix section). Linear SEMs are handy statistical tools that yield parametric, but consistent and unbiased estimates of coefficients from which total, direct, and indirect effects of an intervention can be estimated. As presented in Chapter 2, the estimation of causal quantities within the SCM can be achieved through SEM. See Bollen (1987), Sobel (1987, 1990) for a causal interpretation of the estimated coefficients of a linear SEM and how are used to compute total, direct, and indirect (mediation), causal effects. Also, Muth´en (2011) and Muth´en and Asparouhov (2015) explain how these effects can be computed in the context of nonlinear SEMs with latent variables.

CHAPTER 1. CAUSALITY: AN INTRODUCTION

1.3.

24

The relationship between SCM and RCM: Why the former and not the latter?

There has been a continuous discussion about which approach is better to assess causal claims from observed (non-experimental) data, yet no definite conclusion has been reached. In this subsection we present a brief introduction to the current debate involving both SCM and RCM scholars in the causal inference field. First, we present the most radical view on why scholars in the RCM literature believe that SCM cannot cope with causal inference. After, a more moderate approach that acknowledges some relevant features of the SCM, and finally, a third paragraph on how the SCM subsumes the RCM.

An agnostic point of view from the RCM towards SCM Rubin’s (and colleagues) arguments against the SCM rely on their claims about the SCM approach lacking of the capacity to simulating proper experimental conditions, from which causal-wise conclusions can be obtained. This is in fact summarized in the well known motto “no causation without manipulation” in Holland (1986). For those versed in the RCM framework, it is the scientist that decides which intervention to apply and, subsequently, measure the causal effect of that intervention on some outcome variable of interest. In contrast, they argue, in the SCM there is no a clear distinction between which intervention variables are result of a decision made by the researcher, or whereas it is just a mere outcome variable. We expand on the latter discussion. First, in the RCM the key assumption is that treatment is assigned at random by some unknown mechanism (as in a proper experimental condition). The ignorability assumption then states that the potential outcomes are conditionally independent of the treatment given a set of observed (pre-treatment) variables. This is akin to the notion of randomization in design of experiment studies. Therefore, if this conditional independence holds, then the set of observed variables must be sufficient to eliminate confounding on the causal effect estimates coming from other sources different from the treatment and the outcome variable of interest. In favor of Pearl’s SCM, we argue that if the researcher chooses the right set of variables that allow for blocking those spurious correlations (back-door criterion), then the causal effect of a ‘treatment’ on an outcome variable of interest is identifiable. In other words, if the d-separation is fulfilled, then we meet the conditions that mimic proper experimental conditions (i.e. the strong ignorability assumption in the RCM). On the other hand, a very enlightening discussion has arose from a series of papers and blog entries by Gelman and Pearl himself. Gelman (2011) summarizes the flaws of the SCM in a couple of points: i) he criticizes the idea of an algorithm telling the researcher which is the ‘correct’ graph that resembles the DGP and that suits the observational data; and ii) he claims that there is not such thing as ‘true zeros’ in the social, behavioral and environmental sciences (when referring to conditional independence). For the sake of an objective discussion, we give Gelman’s arguments credit. Indeed, as a starting point it is clear to the reader that in this document we do not address the discovery step in the causal inference process. Instead, we depart from an established, assumed DAG, previously defined by the researcher. Given the recent advances on artificial intelligence

CHAPTER 1. CAUSALITY: AN INTRODUCTION

25

and machine learning, the discussions related to the developments of the discovery step have gone far beyond the scope of this paper. With respect to his second argument, we argue that DAGs help to encode experts’ knowledge on how the world works, even if it just a na¨ıve representation of it. Nonetheless, Gelman has a point. DAGs represent assumptions on conditional independence between random variables, something that is plausible in settings where there is a clear understanding of the mechanisms that operate under such experiment. However, in the social and cognitive sciences, these mechanisms are more cumbersome.

A moderate, second opinion by Dawid Another (more moderate) opinion is that of Dawid (2008, 2010) and colleagues, who argue that, albeit very useful when it comes to carrying causal information, conditional probabilities implied in a certain causal DAG might also be replicated by other DAGs encoding different causal assumptions. This said, causality is a complex subject, and it cannot be left only to be expressed in terms conditional independence properties inferred from purely observational data. A formal causal language has to be defined: Dawid (2008) stresses the importance of distinguishing between probabilistic DAGs, causal DAGs, and those expressing both conditional independence relationships and causal assumptions, the Pearlian DAGs (akin to those in Definition 1.2.6). Also, Dawid is skeptical about how assumed interventions (like those in the SCM) propagate across passive, undisturbed systems (observational data) by just inferring a set of ‘structural equations’ from the observed data. This critic is somewhat similar to that of Gelman on causal discovery. However, Dawid argues that in order to validate these causal discovery results (which he believes are logically consistent and well obtained), strong causal assumptions have to be made. Pearl himself acknowledges this point, and provides a strong theoretical framework to the the discovery step in his SCM.

On why SCM and not RCM Now we list a series of advantages that several authors and ourselves believe the SCM has over the RCM. First, DAGs are an easy, clear, intuitive and accessible way to communicate causal relationships between random variables through conditional independence properties. In fact, DAGs are a graphical representation of how the researcher thinks the world works. Pearlian DAGs provide causal information without using any ‘algebra’ but resorting only to rigorous and intuitive graphical rules. In fact, as stated in Pearl (2009b, Section 3.6.2) DAGs are mathematical objects that allow for the incorporation of substantive knowledge from the researchers, such as that of the form “Y does not directly affect X ”, which are rather causal judgements than constraints on probability distributions with counterfactual variables (as in in RCM). This is why SCM is so popular in Sociology, Computer Science, Epidemiology and other Biomedical Sciences. As a matter of fact, Pearl (2009b, Section 11.3.4, page 347) says (emphasis added): The power of graphs lies in offering investigators a transparent language to reason about, to discuss the plausibility of (causal ) assumptions and, when consensus is not reached, to isolate differences of opinion and identify what additional observations would be needed to resolve differences. This facility

26

CHAPTER 1. CAUSALITY: AN INTRODUCTION

is lacking in the potential-outcome approach where, for most investigators, “strong ignorability” remains a mystical black box. Nonetheless, RCM advocates say that DAGs are not clear in terms of expressing counterfactuals, a key concept in Rubin’s approach. Counterfactuals are understood as hypothetical manipulations of a random variable, like those obtained under ideal experimental conditions. DAGs encode causal information about several counterfactuals (not only one), while PO models are clear about which intervention is of interest. Single World Intervention Graphs (SWIGs), a novel approach by Richardson and Robins (2013a,b), aim to unify the causal DAGs and the PO models using node-splitting transformations. Their cutting edge work allows for understanding counterfactuals through graphical, yet powerful, tools. Richardson and Robins (2013a,b) show that SWIGs satisfy d-separation (factorization, global Markov condition) and modularity properties, key concepts rooted in the SCM framework and necessary in RCM. Second, Pearl formally demonstrated equivalence between the structural and potential outcome frameworks. Recall the PO framework in Section 2.1. According to RCM’s notation, Yi (xi , t) can be read as “the value Y would attain for individual i with background characteristics Xi = xi , had treatment been T = t”. Pearl (2009d) shows that the structural interpretation of the latter is Yi (xi , t) ≡ YMt (xi )

(1.10)

where YMt (xi ) is the unique solution of Y in the submodel Mt of M (as in definition 1.2.8) under the realization X = xi . While the term unit in RCM refers to the individual, in SCM refers to the set of attributes that characterize that individual, represented as the vector xi in structural modeling. According to Pearl, equation (1.10) defines the formal connexion between a counterfactual approach and the intervention do(T = t). In fact, as Pearl (2009d) states, if T is treated as a random variable then Yi (xi , t) ≡ YMt (xi ) is also a random variable. An important criticism to RCM is that PO researcher will take the counterfactual as a primitive (axiom) and will think of P (x1 , ..., xn ) as a marginal distribution of an extended probability function P ∗ defined over both set of observed variables and counterfactuals. Instead, SCM researchers will understand counterfactuals as the result of interventions to the causal system that change the structural model (and the distribution) but keep all variables the same. In other words, RCM views Y under do(T = t), Yi (xi , t), to be a different variable than YMt (xi ); Pearl having already shown to be equivalent. For a wider explanation of the subject refer to Pearl (2009b, Chapters 3, 7 and Section 11.3.5). Third, SCM (through DAGs) give better orientations on which variables are relevant (necessary) for the estimation of causal effects than RCM does. Pearl (2009b,c) criticizes Rubin’s (and colleagues) advice on conditioning on all pretreatment variables, while Rubin (2007, 2008) views SCM as “non-scientific ad hockery”, because of its apparent difficulty on coping with untested assumptions related to the experts’ knowledge.

CHAPTER 1. CAUSALITY: AN INTRODUCTION

27

The discussion started with Shrier’s comment (2008) on Rubin (2007). Latter, with Rubin (2008) and other several replies from both RCM and SCM scholars (see Pearl, 2009c; Sj¨ olander, 2009; Rubin, 2009 for further details on this topic). The main idea is that, depending on the underlying structure of the DAG, spurious correlations (might) arise due to unnecessary variables in the PO exercise yielding biased results about the causal effect of the selected intervention on outcomes of interest. This issue is common in the presence of confounding/mediating variables or M-structures in a causal DAG. The simplest M-structure, as shown in Figure 1.5 (see Greenland et al., 1999; Hern´ an et al., 2004, and Pearl, 2009d), contains the path X ← U1 → Z ← U2 → Y , where X, Y and Z are measured, while U1 and U2 are latent variables. Conditioning on Z renders a spurious dependence between X and Y (since Z is in an inverted fork path), thus a standard PO analysis would yield biased estimates of the causal effect do(X = x) on Y .

U1

U2 Z

X

Y

Figure 1.5. A causal DAG demonstrating M-Bias.

This spurious dependence is a result of Berkson’s paradox, which states that two independent causes of a common effect (say U1 and U2 ) become dependent once we condition on the observed effect (Z). RCM scholars do not acknowledge the logical consequences of not avoiding conditioning on variables that describe subjects even before the treatment. Therefore, in most cases, PS results yield biased results. Some authors in the RCM literature (e.g. Greenland, 2003; Liu et al., 2012 in epidemiological studies, and Brookhart et al., 2006; Setoguchi et al., 2008; Clarke et al., 2011 in RCM) have recently adressed this issue. Their findings are similar to those originally claimed by Pearl and colleagues.

CHAPTER

2

Causal Inference Through Parametric, Linear Models

In Chapter 1 we presented the theoretical foundations of causal inference in the SCM. However, we did not explain how causal quantities are estimated through models. Statistical models play a key role in causal inference since allow for setting a bridge between observed, non-experimental data, (D, P ), and causal assumptions, M , to answer causal questions, derived from Q(P ′ ) and (D′ , P ′ ). Hern´an and Robins (2016) dedicated a whole volume of their book to explain causal inference through models. Even Andrew Gelman, a prominent statistician, says in one of his blog’s entries (http://andrewgelman.com/2009/07/07/more on pearls/, emphases and parentheses added): Statistical modeling can contribute to causal inference. In an observational study with lots of background variables to control for, there is a lot of freedom in putting together a statistical model different possible interactions, link functions, and all the rest [...]. Better modeling, and model checking, can lead to better causal inference. This is true in Rubin’s framework as well as in Pearl’s: the structure may be there (referring to M ), but the model still needs to be built and tested. In summary, computing causal effects through statistical models, whether in a parametric or nonparametric way, allows for estimating causal quantities from nonexperimental data. In the parametric case, models impose prior restrictions on the distribution of the data and inferences are valid as long as the model is correctly specified. In the nonparametric one, no restrictions are imposed, at the expense of interpretation and parsimony. Either way, statistical models give causal structure and interpretation to noisy data. It is through models that researchers answer causal questions. Within the context of causal inference, statistical models are not just carriers of probabilistic information, but have causal content and thus shall be interpreted in a causal way. Pearl (2000, 2009b) made clear how causal knowledge can be understood in a modular fashion and how to give a causal interpretation to a system of equations. The quasideterministic approach (see definitions 1.2.7 and 1.2.8) that has been accepted in the SCM 28

CHAPTER 2. CAUSAL INFERENCE THROUGH PARAMETRIC, LINEAR MODELS

29

literature allows for expressing causal ties between random variables and their Markovian parents as vi = fi (pai , ui ), i = 1, ..., n, which is a nonparametric generalization of the (linear) Structural Equation Model (SEM): vi = ∑ βi,k vk + ui ,

i = 1, ..., n.

(2.1)

k≠i

Both strong and weak causal assumptions are first implied by the causal DAG and then translated into the SEMs’ specification (Bollen and Pearl, 2013). On one hand, strong causal assumptions are represented by the absence of directed links or bi-directed arcs between nodes or error terms in the DAG. When translated to SEMs’ language, these assumptions are expressed as restrictions over the parameters in form of zero partial correlations and/or conditional independence (absence of causal relationships) between random variables. On the other hand, weak causal assumptions exclude some values for a parameter but allow another range of values. Directed links and bi-directed arcs in DAGs express the weak causal assumption of nonzero effect. A causal model M is called Markovian (recursive in SEM literature) if its graph G contains no directed cycles and if error terms (Ui , one for each endogenous variable) are mutually independent (i.e. no bi-directed arcs). A model is semiMarkovian if its graph is acyclic but some errors are dependent. It is common to assume that Ui are multivariate normal. If so, Vi in equation (2.1) will be also multivariate normal and (if centered), its distribution is completely characterized by the correlation coefficients ρik , i ≠ k. In fact, the partial structural (regression) coefficients in equation (2.1) can be expressed in terms of both partial correlations and standard deviations, as βY,X ≡ βY X⋅Z = ρY X⋅Z (σY ⋅Z /σX⋅Z ). Now, consider a directed path X → Y in a graph G. In order to assess the causal effect of an intervention of the form do(X = x), one needs to estimate the structural parameters in a given system of causes and effects. Within the context of path analysis, the structural (regression) coefficient can be decomposed as the sum of paths coefficients (Wright, 1921, 1934), as βXY = α + IXY , where α is the direct (causal) effect of X on Y , and IXY is the sum of other effects through paths connecting X and Y , excluding the direct link X → Y . Now, assume we remove the edge X → Y in G, resulting with a subgraph with no direct effects of X on Y , denoted as Gα . We now introduce a theorem that allows for causal effects identification using the d-separation criterion presented in Chapter 1: Theorem 2.0.1. (d-Separation in General Linear Models; based on Pearl, 2009b): For any linear model structured according to a graph G, which may include cycles and bi-directed arcs, the partial correlation ρXY ⋅Z vanishes if the nodes in set Z d-separate node X from Y in graph G. In this context, each bi-directed arc Xi ←→ Xj is interpreted as a latent common parent Xi ← L → Xj . Now, assume a set of variables Z in Gα . If, and only if, nodes in Z d-separate X and Y (following theorem 2.0.1), we have that βXY ⋅Z = α + IXY ⋅Z = α,

(2.2)

that is IXY ⋅Z = 0. The latter is true because the sum of other path coefficients indirectly connecting X and Y , IXY ⋅Z , vanishes once we ‘control for’ Z. This yields βXY ⋅Z = α as the

CHAPTER 2. CAUSAL INFERENCE THROUGH PARAMETRIC, LINEAR MODELS

30

estimate of the direct causal effect of X on Y . This result is summarized in the following theorem: Theorem 2.0.2. (Single-Door Criterion for Direct Effects; Pearl, 2009b): Let G be any path diagram in which α is the path coefficient associated with the link X → Y , and let Gα denote the diagram that results when X → Y is deleted from G. The coefficient α is identifiable if there exists a set of nodes Z such that i) Z contains no descendant of Y , and ii) Z d-separates X from Y in Gα . If Z satisfies these two conditions, then α is equal to the regression coefficient βY X⋅Z . Conversely, if Z does not satisfy these conditions, then βY X⋅Z is not a consistent estimand of α (except in rare cases of measure zero). Theorem 2.0.2 is useful when the structure of M is relatively simple. However, it can be extended to the case in which more complex paths are considered and where the identification of the structural (path) parameters is achieved through the identification of total rather than direct effects: Theorem 2.0.3. (Back-Door Criterion in SEM; Pearl, 2009b): For any two variables X and Y in a causal diagram G, the total effect of X on Y is identifiable if there exists a set of variables Z such that 1. no member of Z is a descendant of X; and 2. Z d-separates X from Y in the subgraph GX obtained after deleting from G all arrows emanating from X. Moreover, if the two conditions are satisfied, then the total effect of X on Y is given by βXY ⋅Z . Theorem 2.0.3 states that after controlling for Z, X and Y are not (causally) connected through spurious, indirect (back-door) paths. These results also hold for nonlinear SEMs (Pearl, 2009b). It is clear, now, how the (structural) parameters’ estimates of a SEM are useful to estimate causal quantities in the SCM, such as direct and total effects of an intervention of the form do(X = x). Theorems 2.0.1 to 2.0.3 allow for interpreting structural parameters in a causal way.

Path coefficients, Direct and Indirect Effects The structural parameters will be estimated from non-experimental data through SEM, as we show in the following subsections. For now, assume the researcher knows the values of the coefficients in the linear, structural equations. With the (estimated) structural coefficients at hand, the computation of direct and indirect effects is straightforward in a setting of linear, simultaneous, structural equations. Following Wright (1920, 1921, 1934), the total causal effect (correlation) of an intervention, do(X = x), on an outcome variable, Y , is equal to the sum of the product of the path coefficients along all direct and indirect paths connecting X and Y . By theorems 2.0.2 and 2.0.3, these structural (path) coefficients can be approximated by estimated parameters. For example, consider Figure 2.1 in Pearl (2009b, p. 151). This DAG represents a system of linear structural equations with observed variables V = {X, Y, Z1 , Z2 }, and

CHAPTER 2. CAUSAL INFERENCE THROUGH PARAMETRIC, LINEAR MODELS

31

latent variables (dashed arcs). In this case, we are interested on the effect of an intervention do(X = x) on Y . Assume ∆X = 1 with respect to the baseline scenario. Wright’s path coefficient method states the effect of the latter intervention can be computed as the sum of the product of the structural parameters for each path connecting X and Y , i.e. α + βγ. From theorems 2.0.2 and 2.0.3, it is clear that we can compute βY X ⋅Z2 = α + βγ (as Z2 d-separates every back-door (spurious) causal path between X and Y ).

β Z1

X

Z2

α γ Y

Figure 2.1. Graphical Identification of Total Effects, in Pearl (2009b).

The computation of direct and indirect (mediation) effects has been recently studied within the context of nonlinear SEMs. Muth´en (2011) and Muth´en and Asparouhov (2015) present how (natural) direct and indirect effects can be estimated with binary, discrete, and continuous outcomes or mediators (e.g. Z1 , in the latter example), even in the presence of latent variables. But, how are regressions’ and SEMs’ estimates different? This question has long been addressed, and it is important to clarify the differences. First, Pearl (2009b, p.159) gives a superb explanation on what are structural coefficients and how should they be interpreted. SEMs must be interpreted as carriers of causal information and not as a byproduct of statistical analysis of non-experimental data. As stated by Pearl, an equation y = βx +  is called structural if interpreted as if in an ideal experiment, once performed the action do(X = x) and once set Z = z on any other set of variables Z not containing X or Y , the value of Y would be βx + , where  is not a function (does not causally depend on) of the values x and y. Also, if we were to perform the action do(Y = y) and set Z = z, then we cannot say the value of X would be (y − )/β. This is because the relationship between X and Y is not bidirectional (otherwise stated), and the flow of causal information goes as X → Y , but Y → / X. Second, and similar to the first consideration, Bollen and Pearl (2013) argue that the equality sign should be removed and replaced with an assignment symbol (∶=) instead, which represents an asymmetrical transfer of causal information. It is precisely the claim that regressions and SEMs are equivalent that obscured the causal meaning of the latter in the causal inference literature. We now present some insights on Structural Equation Models as a statistical tool for estimating causal quantities, without forgetting its key role in causal inference.

CHAPTER 2. CAUSAL INFERENCE THROUGH PARAMETRIC, LINEAR MODELS

2.1.

32

Structural Equation Models (SEMs)

We have already discussed how causal quantities, Q(P ′ ), derived from an intervention (treatment) do(X = x), on an outcome variable of interest Y , can be computed through the estimated values of the structural parameters (path coefficients) in linear SEMs. In this subsection we present an introduction to the statistical properties and estimation procedures of Structural Equation Models. SEMs are data-driven models with causal implications that describe the (causal) relationships between random variables. SEMs are a system of simultaneous regression equations, with less restrictive assumptions, that allow for measurement errors in both endogenous and exogenous variables, and that can be viewed as a generalization of much simpler statistical methods (linear models, ANOVA, factor analyses, etc). One important feature of SEMs is their ability to adapt researchers’ expertise in the model specification, estimation and testing steps. The empirical results obtained from a SEM analysis can be given causal meaning only within the context of a substantive informed setup (Bollen, 1989). SEMs consist of three main parts: i) a system of structural equations, ii) the path diagram, and iii) the covariance structure (matrix). The first part, the system of structural equations is a set of equations that represent the view of how the researcher believes random variables are causally related. These equations contain random variables and structural parameters. Random variables can be either latent (concepts, unmeasured, factors), observed (measures, indicators, proxies), or disturbance terms; while the structural parameters are invariant constants that “quantify” the causal links between variables and allow to compute causal quantities of interest (Bollen, 1989).

The Linear System of Structural Equations Recall a set of random variables, both manifest and latent, V , causally interrelated following the structure and assumptions implied by a causal model M . Variables in V are assumed to be linearly related to its direct causes (parents), as in equation (2.1). We introduce new notation that should not interfere with the SCM causal framework presented in Chapter 1. Let η and ξ, with (η, ξ) ∈ V , be sets of endogenous (determined within M ) and exogenous (determined outside M ) latent variables, respectively. Also, let Y and X, with (Y, X) ∈ V , be sets of endogenous and exogenous manifest variables. Also, assume a sample of N individuals. For each individual i = 1, ..., N , we measure Yi = yi and Xi = xi , and assume the existence of the latent constructs ηi and ξi . For ease on notation, we leave out the subscript i, and assume that the causal assumptions implied by M (and summarized by the linear equations of the form of 2.1) hold for all individuals in the sample. Following the LISREL model (J¨oreskog and S¨orbom, 1996), two major subsystems are part of the system of structural equations: a linear latent variable (structural) model and a linear measurement model. On one hand, the latent variable model summarizes the

CHAPTER 2. CAUSAL INFERENCE THROUGH PARAMETRIC, LINEAR MODELS

33

causal relationships between factors: η = +Bη + Γξ + ζ,

(2.3)

where η is a q1 × 1 vector of endogenous latent variables, B is a q1 × q1 matrix of structural coefficients for the latent endogenous variables, ξ is a q2 × 1 vector of exogenous latent variables with covariance matrix E(ξξ ′ ) = Φ, Γ is a q1 × q2 matrix of structural coefficients for the latent exogenous variables, and ζ is a q1 × 1 vector of disturbance terms that is assumed to have E(ζ) = 0 and covariance matrix E(ζζ ′ ) = Ψ. Moreover, the error term satisfies E(ζξ ′ ) = 0. An important assumption with respect to the structural parameters in the latent variable model is that diag(B) = 0, i.e. a variable is not an immediate and instantaneous cause of itself; and that (I − B) is non-singular so that (I − B)−1 exists. For statistical identification purposes, restrictions are imposed over the parameters. These restrictions also represent claims about (the absence of) causal effects between random variables. On the other hand, measurement equations represent the (linear) causal links between latent and manifest variables. It is assumed that the observed variables are highly correlated with the latent constructs they measure. That is, they provide quantitative evidence of their latent counterparts: Y = ΛY η + 

(2.4)

X = ΛX ξ + δ

(2.5)

In equations (2.4) and (2.5), Y is a p1 × 1 vector of endogenous observed variables, ΛY is a p1 × q1 matrix of structural coefficients (factor loadings) linking endogenous latent (causes) and observed variables (consequences),  a p1 × 1 vector of disturbances, X is a p2 × 1 vector of exogenous manifest variables, ΛX is a p2 × q2 matrix of coefficients linking exogenous latent variables and exogenous observed variables, and δ a p2 × 1 vector of error terms. It is important to recall that the coefficients in ΛY and ΛX represent the expected change in the observed variable resulting from a one unit increase in its corresponding latent variable. To do so, one must assign a scale to the latent variable in order to fully interpret the coefficient, and therefore, the analyst typically fixes either the latent variable scale to one (e.g. λij = 1 for some Xi and ξj ) or standardizes its variance to one (e.g. var(ξi ) = 1). Also, error terms satisfy E() = E(δ) = E(δ ′ ) = E(η ′ ) = E(ξ ′ ) = E(δη ′ ) = E(δξ ′ ) = 0. Moreover, the covariance matrices for the disturbance terms are assumed to be E(′ ) = Θ and E(δδ ′ ) = Θδ , respectively.

The Path Diagram Path diagrams are the graphical representation of the causal assumptions coded in the system of structural equations. These are equivalent to DAGs in the SCM framework and, accordingly, have the same properties. As mentioned before, path analysis began with Wright (1920, 1921, 1934) as a way to compute causal effects of hypothetical interventions in (non-experimental) observed data. Within SEMs’ context, latent variables are presented by circle nodes, manifest variables by squared nodes, direct causal (linear) relationships by single headed arrows, and when unmodeled causal association between two random variables (usually error disturbances) is assumed, then a bi-directed arrow links those two

CHAPTER 2. CAUSAL INFERENCE THROUGH PARAMETRIC, LINEAR MODELS

34

random variables. The absence of a bi-directed arc between two error terms i and j reflects the causal assumption (and the correspondent parameter restriction) cov(i , j ) = 0. An example of a path diagram (in Bollen, 1989) is Figure 2.2.

δ3

δ4

ε3

ε4

X3

X4

Y3

Y4

λ3

λ4

ξ1

λ7

γ11

λ8

η1

ζ1

β21

ξ2 λ1

λ2

β12

η2

γ22

ζ2 λ6

λ5

X1

X2

Y1

Y2

δ1

δ2

ε1

ε2

Figure 2.2. SEM example in Bollen (1989).

The latter is a graphical representation of the structural and measurement equations in (2.3), (2.4) and (2.5), with η = (η1 , η2 )′ , ξ = (ξ1 , ξ2 )′ , Y = (Y1 , ..., Y4 )′ , X = (X1 , ..., X4 )′ , δ = (δ1 , ..., δ4 )′ ,  = (1 , ..., 4 )′ , ζ = (ζ1 , ζ2 )′ and the following causal assumptions encoded in the matrices:

0 β12 B=[ ], β21 0

γ 0 Γ = [ 11 ], 0 γ22

Λ′X = [

Λ′Y = [

0 0 λ3 λ4 ], λ1 λ2 0 0

0 0 λ7 λ8 ], λ5 λ6 0 0

with cov(ζ1 , ζ2 ) ≠ 0 in Ψ, and cov(ξ1 , ξ2 ) ≠ 0 in Φ.

The Covariance Matrix J¨oreskog (1967, 1973, 1978) and Bollen (1989), among others, show how the theoretical covariance matrix of manifest variables Σ = E(ZZ′ ), with Z = (Y, X)′ , can be written in terms of a set of parameters θ in equations (2.3) to (2.5). Therefore, the covariance matrix Σ(θ) can be expressed as Σ (θ) ΣY X (θ) Σ(θ) = [ Y Y ] (2.6) ΣXY (θ) ΣX (θ) After replacing equations (2.3) to (2.5) in (2.6), we have that ′

ΣY Y (θ) = ΛY (I − B)−1 (ΓΦΓ′ + Ψ) [(I − B)−1 ] Λ′Y + Θ ′

ΣXY (θ) = ΛX ΦΓ′ [(I − B)−1 ] Λ′Y y

CHAPTER 2. CAUSAL INFERENCE THROUGH PARAMETRIC, LINEAR MODELS

35

ΣY X (θ) = Σ′XY

ΣXX (θ) = ΛX ΦΛ′X + Θδ It is clear from the former equations that in basic SEM, the covariance matrix of manifest variables can be expressed as a linear combination of parameters characterizing the system of equations. The aim is to estimate those structural parameters θ from observed, non-experimental data. More specifically, θ is estimated from the empirical covariance matrix S =

1 N

N

∑ zi z′i , where i = 1, ..., N denotes an individual i ∈ N , and N the

i=1

total number of observations in the sample. In essence, we aim to estimate θ such that the discrepancies between S and Σ(θ) are minimized. Formally, θ is the vector of parameters that satisfies θ ∈ arg min F (θ) ∶= ∣∣S − Σ(θ)∣∣. The method of Maximum Likelihood (ML) is the default in most SEM analysis and software packages. It can be demonstrated (J¨oreskog, 1973; Mulaik, 2009; Kline, 2011) that maximizing an assumed likelihood function is equivalent to minimizing the loss function FM L (θ) = log ∣Σ(θ)∣ + tr {SΣ−1 (θ)} − log ∣S∣ − (p1 + p2 ). As a full-information method, ML estimation aims to find θ such that the differences between log ∣S∣ and log ∣Σ(θ)∣ and between tr {SΣ−1 (θ)} and tr(I) = p1 + p2 are simultaneously minimized. Although robust, ML estimation requires making strong assumptions about multivariate normality on latent variables and disturbance terms. Also, a large sample size is needed in order to obtain unbiased, consistent and asymptotically efficient estimations. Another important features of ML estimators are scale invariance and scale freedom, which means that ML estimators are no bound to the measurement scale of the manifest variables or their correspondent covariance matrix. See J¨oreskog (1973); Chou and Bentler (1995) and Bollen (1989, Chapters 4, 8, 9) for further details on ML estimation in SEM. Other less restrictive but equally robust covariance-based (CB) estimation methods have been proposed in the SEM literature: Ordinary Least Squares (OLS) and Two-Stage Least Squares (2SLS), Generalized Least Squares (GLS), as in Fox (1984) and Johnston and DiNardo (1997); Unweighted/Weighted Least Squares (ULS/WLS), as in Browne (1982, 1984); among others, with their corresponding penalty functions: 1 FOLS (θ) = tr {[S − Σ(θ)]′ [S − Σ(θ)]} 2 1 2 FGLS (θ) = tr {([S − Σ(θ)] W−1 ) } 2 1 FU LS (θ) = tr {[S − Σ(θ)]2 } 2 FW LS (θ) = [s − σ(θ)]′ W−1 [s − σ(θ)] where W is a positive-definite weight matrix (usually W = S in GLS), and s and σ(θ) are a vector of sample variances/covariances and the model implied vector of population elements in Σ(θ), respectively. We do not further present detailed explanations on

CHAPTER 2. CAUSAL INFERENCE THROUGH PARAMETRIC, LINEAR MODELS

36

estimation procedures and fit indexes analysis for linear SEM. However, we refer the reader to comprehensive treatises and handbooks on classic (causal) SEM, such as Goldberger and Duncan (1973); Duncan (1975); Bollen (1989); Hoyle (1995); Spirtes et al. (2000); Hancock and Mueller (2006); Morgan and Winship (2007); Marcoulides and Schumacker (2009); Mulaik (2009); Kline (2011), among others. Irrespective of which estimation method is used, these are still based on the empirical and theoretical covariance matrices. Some CB methods depend on the multivariate normality assumption of observed and latent variables, while others depend on very large sample sizes. Moreover, as pointed out in Arminger and Muth´en (1998) and Lee (2007), basic CB-SEM methods cannot easily cope with nonlinear terms or interactions between variables, and rely on strong assumptions when dealing with dichotomous and ordered categorical data, missing data or small sample size. SEM methods based on raw individuals or random observations, rather than the empirical covariance matrix (see, e.g. Lee and Zhu, 2000, 2002; Skrondal and Rabe-Hesketh, 2004; Mehta and Neale, 2005), have overcome the latter shortcomings and have some advantages: i) estimation methods based on first moments avoid loss functions based on second moments, which are more difficult to handle, ii) direct or indirect estimation of latent variables is more straightforward, and iii) the estimating equations have a direct interpretation, as they are similar to standard regression analyses in most cases.

2.2.

Bayesian Estimation of Structural Equation Models

Among the individual based (IB) SEM methods, the Bayesian approach has additional advantages in terms of flexibility in both estimation and dealing with required assumptions. For example, in Social Sciences it is very uncommon to satisfy multivariate normality or to have a large sample size, necessary to derive the asymptotic properties of ML estimators. There is evidence of lack of robustness of ML estimates with small sample sizes (see, for example Chou et al., 1991; Hoogland and Boomsma, 1998). In contrast, Bayesian analyses of SEM, which depend less on asymptotic theory, produce reliable results even with small sample sizes (Lee and Song, 2004; Lee, 2007). Also, Bayesian SEM allows for incorporating expert’s knowledge with respect to the structural parameters and latent variables through restrictions on the prior distribution functions. With the computational advances in recent decades, Bayesian analysis in SEM (and its extensions) has received greater attention from the academic community. Let M be a causal SEM including a path diagram and a set of structural equations with parameters θ. Let Y = {Y1 , ..., YN } be the observed data of N individuals in the sample. In the Bayesian approach θ is considered as a random variable with (conditional) distribution function p(θ ∣ M ) ≅ p(θ). Also let p(Y, θ ∣ M ) be the joint probability density function of Y and θ under M . Based on the well known identity (and taking logs), we have log p(θ ∣ Y, M ) ∝ log p(Y ∣ θ, M ) + log p(θ ∣ M ). (2.7)

CHAPTER 2. CAUSAL INFERENCE THROUGH PARAMETRIC, LINEAR MODELS

37

In equation (2.7), p(θ ∣ Y, M ) is known as the posterior distribution, p(Y ∣ θ, M ) is the log-likelihood function, and p(θ ∣ M ) the prior distribution (defined by the researcher). Note that these distributions are model-contingent, that is, are determined by the structure of the causal model M . Also note that the likelihood function depends on the sample size, while the posterior does not. As N increases, the former dominates the latter and the posterior estimates are equivalent to the ML approach. Therefore Bayesian and ML are asymptotically equivalent. Within this framework, assume a set of equations as in (2.3), (2.4), and (2.5). Again, let Y = {Y1 , ..., YN } and Ω = {ω1 , ..., ωN }, with ωi′ = (ηi′ , ξi′ )′ , for i = 1, ..., N , the set of manifest and latent variables in proposed model. Let θ be the set of unknown parameters in B, Γ, Ψ, ΛY , ΛX , Θ and Θδ . In the posterior analysis, Y is augmented with the matrix of latent variables Ω and then a sample from the posterior distribution p(Ω, θ ∣ Y) is drawn. A sufficiently large number of observation are generated from this distribution using the Gibbs Sampler, as it follows. At the (j + 1)th iteration with current values Ω(j) and θ (j) (i) Generate Ω(j+1) from p(Ω ∣ θ (j) , Y)

(ii) Generate θ (j+1) from p(θ ∣ Ω(j+1) , Y)

(2.8) (2.9)

Let θY be the parameters in the measurement equations (2.4), and (2.5), and θω those in the structural equation (2.3). It is assumed that the prior distributions of θY and θω are independent, i.e, p(θ) = p(θY ) p(θω ). Moreover, p(Y ∣ Ω, θ) = p(Y ∣ Ω, θy ) and p(Ω ∣ θ) = p(Ω ∣ θω ). From equation (2.8) it is clear that latent variables are directly generated from the posterior distribution. For further details on Bayesian estimation in SEM refer to Lee (2007) and Song and Lee (2012a,b). We will make use of the notation and procedures therein explained in further chapters. Note, however, that in conventional Bayesian SEM analysis, {Y1 , ..., YN } are assumed to be independent. This assumption is equivalent to a sample of independent individuals, i.e. no causal interference between observational units. However, this assumption does not hold in most Social Science studies.

CHAPTER

3

Causally Connected Units

As explained in previous chapters, most of the causal inference literature is based on the assumption of independent units that are not causally connected. This assumption, known as the Stable Unit Treatment Value Assumption (SUTVA), asserts that there is no causal interference among individuals. Using the potential outcome notation introduced in Chapter 1, SUTVA means that treatment supplied to unit i′ , Ti′ = ti′ , and/or outcome variable for unit i′ , Yi′ (Xi′ , ti′ ), have no effect on unit’s i outcome, Yi (Xi , ti ), with i ≠ i′ and both (i, i′ ) ∈ N . In other words, Yi (Xi , Ti ) á Ti′ and Yi (Xi , Ti ) á Yi′ (Xi′ , Ti′ ). Common causal target parameters of interest (Q(P ′ )) are evaluated using individual level information, and therefore, their estimation and inference are based on the asymptotic properties of classical statistical estimators (van der Laan, 2014). However, these causal estimators fail to provide causally interpretable estimates of structural parameters when this assumption is violated. On one hand, SUTVA violation has been extensively explored in the RCM framework, challenging existing causal theory and statistical methodologies. The main issue of SUTVA violation is the fact that potential outcomes in the RCM are based on counterfactuals, i.e. unobserved random variables that enter in the estimation of causal quantities. More formally, assume a sample of i = 1, ..., N units. Also, assume a binary treatment variable Ti ∈ {0, 1} and a continuous outcome variable of interest Yi measured for each unit i ∈ N . Let an individual i receive treatment Ti = 1. As explained in previous chapters, the potential outcome of unit i, i.e. Ti = 0 and therefore Yi (Xi , Ti = 0), is an unobserved variable. The estimation of causal quantities require a set of observed data and counterfactuals for every individual i in the sample. Assuming SUTVA, and if T is a binary random variable, then each individual i has one, and only one, potential outcome (keeping everything else constant). However, when SUTVA is violated, a larger set of counterfactuals for each individual i ∈ N emerge. The larger set of potential outcomes in the context of SUTVA violation renders the causal quantity of interest Q(P ′ ) unidentifiable. That is, the potential outcome of unit i with treatment Ti = 1, Yi (Xi , Ti = 0, ⋯), depends not only on Ti = 0 but also on treatments assigned/received to units i′ ≠ i, ∀ i′ ∈ {N /i}, ti′ , ..., ti∗ . For ease on notation, let the subscript −i = {N /i} be used in a vector of either a treatment or outcome for 38

CHAPTER 3. CAUSALLY CONNECTED UNITS

39

all other units in N but i, e.g. T−i = {t1 , ..., ti−1 , ti+1 , ..., tN }. The latter means that the potential outcome of unit i is Yi (Xi , Ti = 0, T−i = t−i ). In this particular example (assuming binary treatments), SUTVA violation in a sample of N individuals leads to a number of 2N potential outcomes. Besides, Sobel (2006) proved that in the presence of SUTVA violation, estimates of causal parameters (such as the ATE) are not informative about the sole effect of an intervention over the outcome variable, but instead are the sum of two distinct effects, one for the intended intervention and one for the spillover effects. When not analyzed independently, results can yield misleading conclusions about the effectiveness of a treatment. It is clear how SUTVA violation calls for the definition of new statistical methodologies and causal parameters of scientific interest. The violation of SUTVA assumption has been extensively studied within the RCM framework. A series of papers by Halloran and Struchiner (1995); Sobel (2006); Rosenbaum (2007); Hudgens and Halloran (2008); VanderWeele and Tchetgen Tchetgen (2011); Tchetgen Tchetgen and VanderWeele (2012); VanderWeele et al. (2012); Liu and Hudgens (2014); Lundin and Karlsson (2014), among others, approached this issue through extending Rubin’s potential outcomes notation and by defining new causal target parameters. The problem is presented as it follows: Assume a sample of N individuals divided in G groups (clusters), each one with Ng units, g = 1, ..., G (i.e. N = ∑G g=1 Ng ). Let Tig denote the treatment of individual i ∈ g. Let Tg = (T1,g , ..., TNg ,g ) be the observed treatment program assigned to cluster g. Same notation applies for the outcome of interest, Yig and Yg . Also, let T (Ng ) be the set of all possible allocations of length Ng . That is, if treatment is a binary random variable, Tg would be one of the 2Ng possible orderings of the allocation program in T (Ng ). For each cluster g, we assume a set of counterfactuals Yg (⋅) = {Yg (tg ) ∶ tg ∈ T (Ng )}. Therefore, the observed set of outcomes Yg is, in fact, a function of the actual treatment program, i.e. Yg = Yg (Tg ). Note that Yg = {Yig (Tg ) ∶ i ∈ g}, which means that the individual potential outcome for a unit i might depend not only on its treatment, but on the entire allocation program, i.e. Yig (Tg ), as opposed to the paradigm in classical RCM, where Yig (Tig ). In order to compute the missing counterfactuals and estimate the new target causal parameters of interest, SUTVA violation requires additional assumptions. First, for simplicity authors assume perfect compliance. Second, the authors resort to the assumption of partial interference (Sobel, 2006) to assure that the outcome variable of unit i, Yi (⋅), depends only on the treatments assigned to other units belonging to the same cluster unit i belongs, i.e. i′ ∈ Ng , with i′ ≠ i and i, i′ ∈ Ng . Third, a two-stage randomization program/treatment assignment is assumed. That is, clusters are randomly assigned to an allocation scheme and then, within the cluster, units are randomly assigned to a specific treatment value. We do not go further on SUTVA violation in the RCM framework, but we do encourage the reader to check the aforementioned references for the formal definitions of the parametric causal parameters and the estimation methods therein presented. In summary, the study of SUTVA violation within the RCM called for new definitions and estimators for causal parameters.

CHAPTER 3. CAUSALLY CONNECTED UNITS

40

Akin to the latter approach, Hong and Raudenbush (2003, 2005, 2006, 2013); Verbitsky-Savitz and Raudenbush (2012) presented an extension of the propensity score matching (PSM) techniques (Rosenbaum and Rubin, 1983) to the case in which data is hierarchically organized and SUTVA assumption is violated. Built on the assumption of strong ignorability (exogeneity between treatment assignment mechanism and potential outcomes, given observed covariates and outcome), these resort to the use of lineal models as means to estimate causal quantities. The idea is that in multilevel settings, units are causally connected not only due to the clustering effect, but also because of the possible interactions between units within each group. We use the same notation presented above. Again, let Yig , Tig , Yg , Tg , T−i,g be the same random variables (vectors) as described earlier. The multilevel PSM framework mimics the two-stage stratification and random treatment assignment assumptions already discussed. However, in this approach the potential outcome Yig (Tig , T−i,g ) is assumed to be a linear function of its arguments. Also, given the multiplicity of n possible values Tg and T−i,g can attain, the authors propose a function r ∶ R2 → R that allows to summarize the effect of the allocation program on the outcome of unit i, i.e. Yig (Tig , T−i,g ) = Yig (Tig , r(T−i,g )). This approach is an improvement in the sense that it simplifies the analysis and reduces the number of arguments in the potential outcome’s function. On the other hand, less has been proposed within the SCM framework. An initial approach was presented by van der Laan (2012); VanderWeele and An (2013); Ogburn and VanderWeele (2014); van der Laan (2014). The authors developed nonparametric TMLE estimators of causal quantities in settings were SUTVA assumption was violated and an allocation program varied over time. Their approach, however, is based on the study of networks of causally connected units, which means that it assumes/requires a previously known network structure. Notwithstanding, we acknowledge the way the authors used causal diagrams (DAGs) and the rules of structural causal inference in order to identify and estimate causal parameters of interest in a non-experimental setting with interrelated units. Also, we remark that the TLME method represents a fully non-parametric and flexible approach to the estimation of causal target parameters of interest (Pearl, 2009b). In this thesis we use a hierarchical-SEM (HSEM) based approach to model SUTVA violation within the SCM, partly because it is easier to understand the causal mechanisms in M through the use of causal diagrams. Also, the linear SEM framework presented in Chapter 2 allows for a straightforward computation of direct and indirect effects. The estimated structural parameters give a notion of the magnitude and strength of the causal linkages between random variables in the study, something that is missing in the non-parametric TLME approach. Although this topic is not covered in this thesis, we believe that computation of direct and indirect effects of a cluster-based interventions and/or causally connected units can be extended from what was presented in Chapter 2 (i.e. using estimates of structural parameters), through the proposed HSEM framework. Finally, we do not assume a previously known network structure, mainly because in many Social Science studies this information is not available to the researcher.

CHAPTER 3. CAUSALLY CONNECTED UNITS

41

Recall that within the SCM, given a causal model M , direct and indirect causal effects are computed using the linear SEM’s estimates of the structural parameters. However, in the context of SUTVA violation (i.e. clustered, no independent, units), conventional, linear SEM’s estimators would yield inefficient estimates of the parameters, and therefore, the computation of direct and indirect effects would be distorted and possibly biased. If this clustered design is not considered in the modeling specification, then inferences on the parameter estimates will be wrong, since the standard errors of the coefficients will be underestimated (i.e. shorter confidence intervals for the hypothesis testing and therefore, the researcher is prone to type I errors). Also, multilevel modeling allows for estimating cluster-specific structural parameters (or random effects). These group-specific coefficients will prove to be useful when the researcher has substantive interest in group-specific treatment (causal) effects. Said that, we believe that core ideas in the hierarchical modeling approach in Hong’s and Raudenbush’s work provide a clear departure framework for modeling causal relationships between random variables in a non-experimental setting with causally connected units. Throughout the rest of this thesis, we will extend the notation presented for the SEM framework in Chapter 2. Let N be a sample of individuals clustered within G groups, each one with Ng individuals (as above). We assume that every single unit i belongs to one, and only one, group g, i.e. no multiple membership is allowed. We add the subscript g to the notation in Chapter 2 to denote cluster-membership, i.e. Yig , Xig , ξig , ηig denote the vectors of endogenous and exogenous measured and latent variables for unit i ∈ g, respectively. Given the multilevel design, we let units i and i′ , with i, i′ ∈ g, to be causally connected if an outcome variable for both units share common causes varying at group level. That is, i and i′ are causally connected if there exists a direct or indirect causal path between them, usually at the cluster level. We extend the causal diagram framework in introduced Chapters 1 and 2 to a setting that graphically represents the hierarchical structure of the causal model M , using observed, non-experimental data. The extension to the path diagram uses the plate notation, first introduced Rabe-Hesketh et al. (2004) in the context of multilevel linear models with latent variables. As an example, assume a two-level dataset with latent variables. In Figure 3.1, the outer block represents a cluster g ∈ G, and each inner block represents individuals i, i′ ∈ g. We show that units i and i′ are causally connected through the causal path ηig ←Ð ηg Ð→ ηi′ g (as in the SCM framework presented earlier). In this case, ηg d-separates ηig and ηi′ g . The latter means that once we ‘know’ the value of a random variable varying at the cluster level, ηg , variables varying at the individual level become independent, i.e. ηig á ηi′ g ∣ ηg . It is clear how HSEMs are a way of specifying a statistical model structure when SUTVA assumption is violated and the sample consists of N clustered, causally connected units. If a linear specification is imposed, each path in Figure 3.1 represents a structural parameter whose estimates can be used to compute direct and indirect causal effects of an cluster intervention of the form do(Xg = xg ), or individual based interventions of the form do(Xig = xig ). As explained in Chapter 2, HSEMs are also capable of estimating counterfactuals from non-experimental, hierarchical data. We introduce HSEMs and its estimation methods in the following subsections. However, we do emphasize that the

42

CHAPTER 3. CAUSALLY CONNECTED UNITS

computation of direct or indirect causal effects in a multilevel setting is not presented in this thesis.

ηi'g

X1

ηg

…. Xk

g

i'

Xk

ηig

X1

i Figure 3.1. HSEM example introducing the plate notation.

3.1.

Hierarchical Structural Equation Models (HSEM)

Hierarchical SEM (HSEM) is not a new concept. First attempts focused on estimating two separate models (for a two level model) for the between and within cluster covariance matrices, as in Longford and Muth´en (1992) and Muth´en (1989, 1994). Almost simultaneously, a different approach was undertaken by Goldstein and McDonald (1988); McDonald and Goldstein (1989) and Lee (1990), who developed a joint ML estimation procedure of both the within-cluster and between-cluster covariance matrices for a two level model. Despite these efforts, the latter papers assumed continuous data and (almost always) balanced designs. Later Ansari et al. (2000) developed a Bayesian approach with MCMC methods and Lee and Shi (2001) a ML approach using a MC-EM algorithm, limited to a two-level factor model (no structural part) though. Shortly after, Rabe-Hesketh et al. (2004, 2007, 2012) overcame these limitations and formulated the Generalized Linear Latent and Mixed Modeling framework (GLLAMM). Rabe-Hesketh and Skrondal (2012) show the correspondence between mixed modeling and multilevel modeling (Raudenbush and Bryk, 2002). The main difference from the previous approaches to HSEM was that estimation procedure in their model is IB rather than CB. GLLAMM-SEM is a maximum likelihood estimation procedure where the likelihood function is optimized using first order approximations and the Newton-Raphson algorithm for the measurement equations, and where the latent variables are integrated out using adaptive quadrature in the structural equations. For further detail see Skrondal and Rabe-Hesketh (2004). We recall that the link between CB-SEM approach and GLLAMM-SEM is presented in Rabe-Hesketh et al. (2007). Conditional on the latent variables (belonging to higher or traversal levels), the response model (or measurement equation) in Rabe-Hesketh et al. (2004) follows closely the linear prediction equation in the Generalized Linear Model (GLM) framework of

43

CHAPTER 3. CAUSALLY CONNECTED UNITS

McCullagh and Nelder (1989). The linear predictor is accompanied by a link function defined accordingly after a distribution from the exponential family. In GLLAMM-SEM, the measurement equations depend on the assumed distribution of the observed responses, conditional on the latent variables and covariates. Also, in the structural part the latent variables are regressed on other latent variables and (possibly) observed exogenous covariates. Exogenous disturbances allow for non-observed variability among the latent and observed variables. As in most SEMs, restrictions over parameters are imposed to achieve identifiability. More formally, let N individuals or observational units (level 1) be organized in independent clusters within L levels of hierarchy. In each level l = 2, ..., L we observe Gl clusters, each of them with nGl individuals. The subscript Gl indicates membership of an individual i to cluster Gl in level l. Given the full information up to level L, Rabe-Hesketh et al. assume independent clusters within level l, something that is known as partial interference, or no interference between clusters (Sobel, 2006). We keep this key assumption throughout the rest of this paper. Also we assume that each one of these observations belongs to one, and only one, cluster at a certain level within the hierarchy. Cases of multiple memberships are not considered here. As for the measurement equation, let νj(l) be the [i × p] × 1 vector of p responses for every unit i (level 1) belonging to the unit-group j in level l. Also, let Xj(l) be a [i × p]×K matrix of exogenous variables at the individual level and β a vector of K × 1 parameters. (l) (l) Moreover, let Λj(l) and ηj(l) be the [i × p] × q (l) matrix of structural parameters (factor loadings) and the q (l) × 1 vector of latent variables varying at level (l) (following the superscript) for the cluster (or l-level unit) j(l) (following the subscript), respectively. For a unit (cluster) z(L) at the top level L, the measurement equation is L

(l)

(l)

νz(L) = Xz(L) β + ∑ Λz(L) ηz(L)

(3.1)

l=2

(l)

Note that, as Skrondal and Rabe-Hesketh (2004) point out, superscript (l) in Λz(L) denotes that the matrix of structural coefficients is specific to the level-l latent variables. Equation (3.1) can be further expressed in a more succinct way by replacing the sum term (2)

(L)

(2)′

(L)′ ′

by the appended matrix Λz(L) = [Λz(L) , ..., Λz(L) ] and the vector ηz(L) = [ηz(L) , ..., ηz(L) ] : νz(L) = Xz(L) β + Λz(L) ηz(L)

(3.2)

A very important assumption in Skrondal and Rabe-Hesketh (2004) and Rabe-Hesketh et al. (2004) is that of latent variables varying at level l, η (l) , following a multivariate Gaussian distribution N [0, Σ(l) ], with Σ(l) being a semi-positive definite covariance matrix for each l = 2, ..., L (i.e. latent variables at level l are not necessarily independent). Also, the authors assume that latent variables at different levels are assumed to be independent. Now, conditional on the vector of all realizations of the L levels latent variables, η (L) , the vector of response variables for all individuals, a, is linked to the linear predictor, ν,

44

CHAPTER 3. CAUSALLY CONNECTED UNITS

through a continuous and double differentiable function g, known as the link function: g (E [a ∣ X, η (L) ]) = ν

(3.3)

Note that specification in equation (3.3) allows for modeling continuous and discrete distributions from the exponential family for the response variables. As for the latent variable equations, for each unit-cluster j in level l Skrondal and Rabe-Hesketh (2004) assume a linear system of the reduced form ηj(l) = Bηj(l) + Γwj(l) + ζj(l)

(3.4)

In equation (3.4), ηj(l) is the same as in (3.2), but for the j th l-level unit-cluster. B is a q × q upper-block diagonal matrix of regression parameters, with q = ∑L l=2 ql , wj(l) = (2)

(3)

(L) ′

[wjk..z , wk..z , ..., wz ] is a vector of r = ∑L l=2 rl covariates, Γ is an q×r matrix of regression parameters, and ζj(l) is the vector of q disturbance terms. Recall that each element of ζj(l) varies at the same level l as the corresponding latent variable in ηj(l) . As shown in Skrondal and Rabe-Hesketh (2004), the extensive representation of equation (3.4) is ⎡η (2) ⎤ ⎡ (22) ⎢ jk..z ⎥ ⎢B ⎢ (3) ⎥ ⎢ ⎥ ⎢η ⎢ k..z ⎥ = ⎢⎢ 0 ⎢ ⋮ ⎥ ⎢ ⋮ ⎥ ⎢ ⎢ (L) ⎥ ⎢⎢ ⎢ ηz ⎥ ⎣ 0 ⎦ ⎣

B(23) B(33) ⋮ 0

⋯ ⋯ ⋱ ⋯

B(2L) ⎤⎥ B(3L) ⎥⎥ ⎥ ⋮ ⎥⎥ B(LL) ⎥⎦

⎡η (2) ⎤ ⎡Γ(22) ⎢ jk..z ⎥ ⎢ ⎢ (3) ⎥ ⎢ ⎥ ⎢η ⎢ k..z ⎥ + ⎢⎢ 0 ⎢ ⋮ ⎥ ⎢ ⋮ ⎥ ⎢ ⎢ ⎢ (L) ⎥ ⎢ 0 ⎢ ηz ⎥ ⎣ ⎦ ⎣

Γ(23) Γ(33) ⋮ 0

⋯ ⋯ ⋱ ⋯

Γ(2L) ⎤⎥ Γ(3L) ⎥⎥ ⎥ ⋮ ⎥⎥ Γ(LL) ⎥⎦

⎡ζ (2) ⎤ ⎡w(2) ⎤ ⎢ jk..z ⎥ ⎢ jk..z ⎥ ⎢ (3) ⎥ ⎢ (3) ⎥ ⎥ ⎢ ⎥ ⎢w ⎢ k..z ⎥ + ⎢ ζk..z ⎥ ⎢ ⋮ ⎥ ⎢ ⋮ ⎥ ⎥ ⎢ ⎥ ⎢ ⎢ (L) ⎥ ⎢ (L) ⎥ ⎢ ζz ⎥ ⎢ wz ⎥ ⎦ ⎣ ⎦ ⎣ (3.5)

Also, Skrondal and Rabe-Hesketh (2004) assume recursive SEM (i.e. no feedback ′ ′ effects, as explained in Bollen, 1989), therefore each Bll and Γ(ll ) are matrices with ′ diag(Bll ) = diag(Γ(ll) ) = 0. This disposition of B also assumes that lower-level latent variables have no effect on higher-level latent variables. The estimation procedure of HSEM in the GLLAMM framework is based on maximum marginal likelihood estimation using adaptive quadrature rules. Let θ be the vector of all parameters to be estimated, including β, Λz(l) , the non-duplicated elements of Σ(l) for l = 2, ..., L; B and Γ. Also let y and X be the response vector and matrix of explanatory variables for all units in the sample. The conditional likelihood of y given the latent and observed variables is obtained after substituting the structural (latent variables) model into the response (measurement) model. Let the conditional density of the response vector for a l-level unit to be ′ ′ ′ f (y(l) ∣ X(l) , ζ (l+) ; θ), where ζ (l+) = (ζ (l) , ..., ζ (L) ) is a vector of latent variable errors for level l and above. Moreover, let the multivariate normal density of the latent variables at level l be h(l) (ζ (l) ; θ). Therefore, the multivariate density for a l-level unit, conditional on the latent variables at level l + 1 and above is: (l)

f (l) (y(l) ∣ X(l) , ζ (l+1+) ; θ) = ∫ h(l) (ζ (l) ; θ) ∏ f (l−1) (y(l−1) ∣ X(l−1) , ζ (l+) ; θ) dζ (l) , (3.6)

45

CHAPTER 3. CAUSALLY CONNECTED UNITS

Note the recursive nature of equation (3.6). This means that the likelihood function for upper levels of hierarchy is computed using information from lower levels. Furthermore, the total marginal likelihood is the product of the contributions from all the highest level (L) units: L (θ; y, X) = ∏ f (L) (y(L) ∣ X(L) ; θ) . (3.7) The Newton-Raphson algorithm is used to maximize the marginal log-likelihood derived from (3.7). For a given set of parameters θ, the multivariate integral over the latent variables at level l, ζ (l) , is evaluated numerically using an adaptive quadrature rule. The integral is evaluated over ql independent, standard, normally distributed latent variables v(l) , with ζ (l) = Cl v(l) , where Cl is the Cholesky decomposition of Σl . By let′ ′ ting v(l+) = (v(l) , ..., v(L ) )′ , the integral can be approximated by the Cartesian product quadrature as (l) (l) (l−1) (y(l−1) ∣ X(l−1) , ζ (l+) ; θ) dζ (l) ∫ h (ζ ; θ) ∏ f (l)

(l)

) ⋯ ∫ φ (v1 ) ∏ f (l−1) (y(l−1) ∣ X(l−1) , v(l) , v(l+1+) ; θ) dv1 ⋯ dvq(l) = ∫ φ (vq(l) l l ≈ ∑ πrql ⋯ ∑ πr1 ∏ f (l−1) (y(l−1) ∣ X(l−1) , αr1 , ..., αrql , v(l+1+) ; θ) , rql

r1

where φ(⋅) is the standard Gaussian density, rql is the number of quadrature points for latent variable ql , and πr and αr are quadrature weights and locations, respectively. The reader might refer to Rabe-Hesketh et al. (2002, 2004, 2005) and Skrondal and RabeHesketh (2004) for further details on HSEM estimation within the GLLAMM framework.

3.2.

Bayesian Estimation of HSEM

HSEM estimation within the GLLAMM framework has some drawbacks. First, latent variables are integrated out the analysis and there is not direct computation of their values (some figures might be assigned though, see Skrondal and Rabe-Hesketh, 2004, Chapter 7). Also, the basic algorithm is computationally intensive and is not suitable for modeling nonlinear relationships between latent variables. A Bayesian alternative to cope with the latter issues in HSEM was first presented in Song and Lee (2004). They built upon the work of Ansari and Jedidi (2000); Ansari et al. (2000); Dunson (2000), who presented Bayesian procedures for factor models (latent variables) estimation. However, these multilevel models did not cope with cross-level effects. That is, causal information flowing from of latent variables in the groups to the latent variables within the individuals’ level. Lee and Tang (2006) and Song and Lee (2012a, Chapter 9) present a HSEM model with cross-level effects that allows for group characteristics to have causal impact on the behaviors of the individuals. The authors present a two level HSEM framework that is easily extended to an arbitrary number of levels L. We use a different notation than the one presented in the original papers in order to make both GLLAMM-HSEM and Bayesian HSEM comparable. Let aig be a p × 1 vector of random variables from unit i belonging to group g. Each group (cluster) g = 1, ..., G is conformed by i = 1, ..., Ng individuals. Lee

CHAPTER 3. CAUSALLY CONNECTED UNITS

46

and Tang (2006) and Song and Lee (2012a, Chapter 9) assume a measurement equation that relate the observed variables with the latent variables at the individual and group levels: aig = Axig + Λ1 ω1,ig + Λ2 ω2,g + ig (3.8) where xig is a k × 1 vector of exogenous variables for unit i and A a p × k vector of fixed parameters. Moreover, let Λ1 be a p×q1 matrix of factor loadings and ω1,ig a q1 ×1 random vector of latent variables varying at the first level, Λ2 a p × q2 matrix of factor loadings for latent variables at level 2, ω2,g a q2 × 1 vector of second level latent variables, and ig is a p × 1 random vector error measurements with distribution N(0, Ψ ). In this model ig is assumed to be independent of both ω1,ig and ω2,g , and also Ψ to be diagonal. For sake of simplicity, the authors assume a factor model for the second level measurement equations. Notwithstanding, extensions to (3.8) for observations at any level l ∈ L is straightforward. Let (l) denote the level at which the observed, latent variables, and factor loadings are varying. Therefore, for an arbitrary unit i belonging to group j, up until group z at an arbitrary level l, equation (3.8) can be rewritten as L

L

l=1

l=1

aig,...,z = ∑ A(l) x(l) z + ∑ Λ(l) ω(l),z + ig,..,z

(3.9)

′ ′ )′ be a partition of ω1,ig . η1,ig is a , ξ1,ig Now, for a two-level HSEM, let ω1,ig = (η1,ig q11 × 1 vector of endogenous latent variables, and ξ1,ig a q12 × 1 vector of exogenous latent variables, both at the first level of hierarchy. Song and Lee (2012a, Chapter 9) (which is a more general version of Lee and Tang, 2006) considers the following (nonlinear) structural equation: η1,ig = ΓF (ξ1,ig , ω2,g ) + δ1,ig (3.10)

where F (ξ1,ig , ω2,g ) = [f1 (ξ1,ig , ω2,g ) , ..., fm (ξ1,ig , ω2,g )]′ is a m × 1 vector of nonzero vector valued functions with differential known functions, f1 , ..., fm with m ≥ max {q12 , q2 }, and Γ is a q11 × m matrix of unknown coefficients. In addition, ξ1,ig and δ1,ig are assumed to be distributed N (0, Φ1 ) and N (0, Ψδ ), respectively, being Ψδ a diagonal matrix. Moreover, δ1,ig is assumed to be independent of η1,ig and ω2,g . Yet again, an extension of the structural equation system to an arbitrary number of levels L is straightforward. Due to the complexity of the correlation structure between latent and manifest terms of this model, Lee and Tang (2006) and Song and Lee (2012a) draw upon data augmentation techniques (Tanner and Wong, 1987) in their Bayesian algorithm. The idea behind this procedure is to complement the observed data with latent constructs from previous MCMC steps. Traditional Gibbs sampler (Geman and Geman, 1984) and Metropolis-Hastings (Metropolis et al., 1953; Hastings, 1970) algorithms are used, similar to that in Section 3.2. The reader might refer to the original papers for deeper explanations on Bayesian estimation of HSEM with cross-level effects. Despite the advantages of this framework in terms of modeling causal relationships between random variables (both manifest and latent) in the presence of causally connected units, one drawback consists of assuming a known set of functional forms F(⋅) in the structural part of the HSEM. We aim to overcome this issue by expanding the Bayesian HSEM framework with a semi-parametric structure for the structural part of the model.

CHAPTER 3. CAUSALLY CONNECTED UNITS

47

This extended version of the HSEM implicitly assumes unknown functional forms for the causal relationships between latent variables, both at the same and higher levels.

CHAPTER

4

A Semi-Parametric Hierarchical Structural Equation Model (SPHSEM)

4.1.

The observed random variables

The proposed model builds upon the developments presented in Song et al. (2013), Lee and Tang (2006), Song and Lee (2012a, Chapter 9) and Lee (2007), among others. First, consider a hierarchically structured dataset with an arbitrary number L of levels. Let aig be a vector of p × 1 random variables for an individual i = 1, ...Ng (level 1) belonging to a group g = 1, ..., G (level 2), which in turn can be part of another group in a higher hierarchy (level 3 onwards). Subscripts to other levels are omitted for ease on notation. Note that ng might be different for every g, which means that this framework allows for unbalanced sets. For now, we assume a two-level setting, but extension to a further number of higher levels is straightforward. Random variables in aig can be ordered categorical (zig ), continuous (yig ), count (vig ), or unordered categorical variables (uig ). Therefore, without loss of generality, assume aig = (aig,1 , ..., aig,p )′

= (zig,1 , ..., zig,r1 , yig,r1 +1 , ..., yig,r2 , vig,r2 +1 , ..., vig,r3 , uig,r3 +1 , ..., uig,r4 )′ .

In order to provide a clear framework for the generalized SPHSEM, let a∗ig be an underlying vector defined as a∗ig = (a∗ig,1 , ..., a∗ig,p )′

∗ ∗ ∗ ′ ′ ∗ ∗ ∗ )′ , , ..., vig,r , wig,r , ..., wig,r = (zig,1 , ..., zig,r , yig,r , ..., yig,r , vig,r 3 +1 4 2 2 +1 3 1 1 +1

48

CHAPTER 4. A SEMI-PARAMETRIC HIERARCHICAL STRUCTURAL EQUATION MODEL (SPHSEM) 49

such that is linked to aig as it follows: ∗ ⎧ ), k = 1, ..., r1 , zig,k = gk (zig,k ⎪ ⎪ ⎪ ⎪ ⎪ ∗ ⎪ ⎪yig,k = gk (yig,k ), k = r1 + 1, ..., r2 , ⎨ ∗ ⎪ ), k = r2 + 1, ..., r3 , κig,k = gk (vig,k ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩uig,k = gk (wig,k ), k = r3 + 1, ..., r4 ,

where κig,k = E(vig,k ) for all k = r2 + 1, ..., r3 , and gk (⋅)’s are the threshold, identity, exponential and multinomial probit link functions, respectively. Following Muth´en (1984), Lee and Song (2003a) and Lee (2007), for ordered categorical variables, i.e zig,k for k = 1, ..., r1 , that take integer values in the set {0, 1, ..., Zk − 1}, gk (⋅) is the threshold link function defined as: Zk −1

∗ ∗ zig,k = gk (zig,k ) = ∑ q × I[αk,q ,αk,q+1 ) (zig,k ),

(4.1)

q=0

where I(⋅) is an indicator function that takes the value of 1 whenever ∈ [αk,q , αk,q+1 ), and 0 if not. For each ordered categorical variable we define a set of thresholds {−∞ = αk,0 < αk,1 < ⋯ < αk,Zk −1 < αk,Zk = +∞} that define the Zk categories. ∗ zig,k

For continuous variables, i.e yig,k ∈ R for k = r1 + 1, ..., r2 , gk (⋅) is the identity link function, defined as: ∗ ∗ yig,k = gk (yig,k ) = yig,k . (4.2) For count variables, i.e. vig,k for k = r2 + 1, ..., r3 , it is commonly assumed that vig,k ∼ Poisson(κig,k ). As in the generalized linear model framework presented by McCullagh and Nelder (1989), gk (⋅) is the log link, defined as it follows: ∗ log(κig,k ) = gk−1 (E(vig,k )) = vig,k ,

κig,k = E(vig,k ) =

∗ gk (vig,k )

=

or

∗ exp(vig,k ).

(4.3)

Finally, for the unordered categorical variables, i.e. uig,k for k = r3 + 1, ..., r4 , we assume that take values on the set {0, 1, ..., Uk − 1}. For sake of simplicity, and as in Song et al. (2013), we assume that Uk = U for all k. However, as the original authors state, this assumption can be relaxed easily. Following Imai and van Dyk (2005) and Song et al. (2007), uig,k is modeled in terms of an unobserved continuous multivariate normal random vector wig,k = (wig,k,1 , ..., wig,k,Uk −1 )′ , such that: ⎧ ⎪ ⎪0, uig,k = gk (wig,k ) = ⎨ ′ ⎪ ⎪ ⎩u

if if

max(wig,k ) ≤ 0 max(wig,k ) = wig,k,u′ > 0

(4.4)

In order to clarify the idea, for an unordered categorical variable uig,k with three categories, we have wig,k = (wig,k,1 , wig,k,2 )′ . If both wig,k,1 ≤ 0 and wig,k,2 ≤ 0, the uig,k takes the value of 0. Accordingly, if wig,k,1 > wig,k,2 and wig,k,1 > 0, then uig,k = 1. Finally,

CHAPTER 4. A SEMI-PARAMETRIC HIERARCHICAL STRUCTURAL EQUATION MODEL (SPHSEM) 50

if wig,k,2 > wig,k,1 and wig,k,2 > 0, then uig,k = 2. A generalization to U − 1 (or Uk − 1) categories is straightforward.

4.2.

The measurement equations

Once defined the nature of the manifest random variables for every individual i in group g, we set up the measurement equations consistent with different types of observed data. We extend the semiparametric SEM in Song et al. (2013) by combining features of the Multilevel SEM (with crossed-level effects) considered in Lee and Tang (2006), Lee (2007), Song and Lee (2004, 2012a), and to some extent in Rabe-Hesketh et al. (2004), Skrondal and Rabe-Hesketh (2004) and Rabe-Hesketh et al. (2007). First, we define an aggregated measurement equation for the underlying manifest random vector for the ordered, continuous and count variables, a∗ig , related to that measured on an individual i belonging to group g, aig , through link equations gk (⋅): L

(l)

L

(l)

a∗ig = µ + ∑ A(l) xig + ∑ Λ(l) ωig + ig , l=1

(4.5)

l=1

in which for every set k = {1, .., ri } defined in equations (4.1) to (4.4), µ = (µ1 , ..., µpk )′ (l) (l) is a pk × 1 vector of intercepts, A(l) and Λk are unknown pk × rx and pk × q (l) parameter (l)

(l)

matrices for fixed and latent variables belonging to level l, respectively; xig is a rx × 1 (l)

vector of exogenous variables, ωig is a vector of q (l) × 1 latent random variables, and ig is a pk × 1 random vector of error measurements such that ig ∼ N [0, Ψ,k ], where Ψ,k is (l)

a diagonal sub-matrix of dimension k in Ψ , and ig is independent of ωig at all levels. By

(1)′

defining (L)′

(1)′

Λ(l)

(L)′

[A(1) , ..., A(L) , Λ(1) , ..., Λ(L) ],

=



[xig , ..., xig , ωig , ..., ωig ] , as

and

ωig,(l)

=

equation (4.5) can be expressed more succinctly a∗ig = µ + Λ(l) ωig,(l) + ig .

(4.6)

Given ωig,(l) and the parameters in the aggregated matrices in equation (4.6), it is clear that the measurement equations for a∗ig,k , k = 1, ..., pk , are given by: ∗ zig,k = µk + Λ′k,(l) ωig,(l) + ig,k

∗ yig,k = µk + Λ′k,(l) ωig,(l) + ig,k ∗ vig,k = µk + Λ′k,(l) ωig,(l)

for k = 1, ..., r1 ,

(4.7)

for k = r1 + 1, ..., r2 ,

(4.8)

for k = r2 + 1, ..., r3 ,

(4.9)

(l)

L (l) where µk are the corresponding intercepts in µ, Λk,(l) is a (∑L l=1 rx + ∑l=1 q )×1 vector corresponding to the k th row of the matrix Λ(l) , and ig,k ∼ N[0, ψ,k ], with ψ,k ∈ diag(Ψ ). For the unordered categorical variables, the underlying vector wig,k defined for each of the

CHAPTER 4. A SEMI-PARAMETRIC HIERARCHICAL STRUCTURAL EQUATION MODEL (SPHSEM) 51

k = {r3 + 1, ..., r4 } observed uig,k ’s, has the following measurement equation: wig,k = µk + 1Uk −1 Λ′k,(l) ωig,(l) + w ig,k

for k = r3 + 1, ..., r4 ,

(4.10)

where 1Uk −1 is a (Uk − 1) × 1 vector with 1’s at all elements, µk is a vector of Uk − 1 intercepts, and w ig,k ∼ N[0, Ψ,wk ] are error terms, independent of ωig , and where Ψ,wk is a sub-matrix of Ψ . With respect to equation (4.10), the probability mass function of uig,k is p(uig,k = u′ ∣ µk , Λk,(l) , ωig,(l) , Ψ,w ) = ′ ∫ ΦUk −1 (wig,k ; µk + 1Uk −1 Λk,(l) ωig,(l) , Ψ,wk )dwig,k (4.11)

Ru′

where ΦUk −1 (⋅; µ, Σ) denotes the density function of a (Uk − 1)-variate normal random variable with mean µ and covariance matrix Σ, and ⎧ ⎪ ⎪{wig,k ∶ max(wig,k ) < 0} R ⎨ ⎪ ′ ⎪ ⎩{wig,k ∶ max(wig,k ) = wig,k,u > 0} u′

u′ = 0 u′ = 1, ..., Uk − 1.

The latter means that the multivariate density function for wig,k is proportional to the truncated multivariate normal distribution with density function ΦUk −1 (wig,k ; µk + 1Uk −1 Λ′k,(l) ωig,(l) , Ψ,wk )I{Ru′ } (wig,k ), where I{Ru′ } (⋅) is an indicator function that takes the value of 1 if wig,k ∈ Ru′ and 0 otherwise.

4.3.

The structural equations

The structural equations proposed in this model follow closely those presented in Song et al. (2013) and resemble those in Skrondal and Rabe-Hesketh (2004) and Rabe-Hesketh (l) (l)′ (l)′ (l) (l) et al. (2004). Consider a partition of ωig = (ηig , ξig )′ , where ηig = (ηig,1 , ..., ηig,q(l) )′ is (l)

(l)

(l)

1

a q1 × 1 vector of endogenous (outcome) latent variables, and ξig = (ξig,1 , ..., ξig,q(l) )′ is 2

(l)

a q2 × 1 vector of exogenous (explanatory) latent variables that are assumed to follow a (l) multivariate normal distribution, ξig ∼ N[0, Φ(l) ]. As in the latter section, we allow for (l)

(l)

(l)

a set of exogenous covariates to enter into the structural equation, cig = (cig,1 , ..., cig,m(l) )′ .

CHAPTER 4. A SEMI-PARAMETRIC HIERARCHICAL STRUCTURAL EQUATION MODEL (SPHSEM) 52 (l)

We propose the following general structural equation. For an arbitrary element ηig,k (l)

in ηig , (l) ηig,k

m(l)

= ∑

j1 =1

(l) πk,j1 (cig,j1 ) +

(l)

q2



j2 =1

(l) fk,j2 (ξig,j2 ) +

∗ ⎤ ⎡m(l∗ ) q (l ) ⎢ ∗) (l∗ ) ⎥ (l) (l ∗ ∗ ⎢ ∑ ⎢ ∑ πk,j1∗ (cig,j ∗ ) + ∑ fk,j2∗ (ωig,j ∗ )⎥⎥ + δig,k , 2 1 ⎥ ∗ l∗ >l ⎢ j2∗ =1 ⎦ ⎣ j1 =1 (4.12)

L

∗ ∗ (⋅) are unspecified smooth functions with (⋅) and fk,j where πk,j (⋅), fk,j (⋅), πk,j (l)

continuous second order derivatives for k = 1, ..., q1 , and j as in the sets j1 , j1∗ , j2 and j2∗ (l) defined in equation (4.12). δig,k is a random residual term which is assumed to follow a (l)

(l)

(l)

(l)

normal distribution N[0, ψδ,k ], with ψδ,k ∈ diag(Ψδ ); and to be independent of ξig at all levels. Even if the structural equation in (4.12) is quite general, it is important to bear in mind that endogenous latent variables at level l only depend on exogenous latent variables belonging to the same level or latent variables belonging to superior levels l∗ > l. As stated by Skrondal and Rabe-Hesketh (2004, chapter 4): “it would not make sense to regress a higher level latent variable on a lower level latent variable or observed variable since this would force the higher level variable to vary at a lower level ”. Also, although not explicitly read from equation (4.12), this structural equation setting is not intended for recursive structural equation models.

4.4.

A Note on Bayesian P-splines

Following Song and Lu (2010) and Song et al. (2013), we consider Bayesian penalized ∗ splines as an initial approach for estimating the unknown functions πk,j (⋅), fk,j (⋅), πk,j (⋅) ∗ and fk,j (⋅) in equation (4.12). These functions can be modeled by a sum of basis splines (B-splines, De Boor, 1978) defined over a set of knots in their respective domains. For (l) (l) (l) simplicity, allow the structural equation to be ηig,k = f (ξig,1 )+δig,k , a special case of (4.12), (l)

as in Song et al. (2013). Using the B-splines, f (ξig,1 ) is modeled as (l)

˙ K

(l)

f (ξig,1 ) = ∑ γk˙ Bk˙ (ξig,1 ),

(4.13)

˙ k=1

where K˙ is the number of knots (splines), K˙ ≤ N˙ , being N˙ the total number of observations (individuals or groups) belonging to level l; γk˙ ’s are unknown parameters, and functions Bk˙ (⋅) are uniformly continuous polynomial functions of appropriate order (l) n (i.e. n ≤ K˙ − 1) defined over the domain of ξig,1 . Song et al. argue that a natural choice for Bk˙ (⋅) is the cubic B-spline, for a number of nodes typically ranging from 10 to 60 in order to ensure enough flexibility (Song and Lu, 2012). The main drawback of traditional cubic B-splines is that the basis functions Bk˙ (⋅) (l)

are defined in a fixed finite interval. As the reader might realize, the realizations of ξig,1 generated by the MCMC algorithm iterations might not always fall inside the fixed interval.

CHAPTER 4. A SEMI-PARAMETRIC HIERARCHICAL STRUCTURAL EQUATION MODEL (SPHSEM) 53

Song et al. (2013) propose to solve these difficulties by transforming the explanatory latent (l) variables through the probit function, i.e. the authors model f (ξig,1 ) as (l)

˙ K

(l)

f (ξig,1 ) = ∑ γk˙ Bk˙ (Φ∗ (ξig,1 )),

(4.14)

˙ k=1

where Φ∗ (⋅) is the cumulative distribution function of N [0, 1]. Equation (4.14) (l) transforms the original scale of ξig,1 , (−∞, ∞), to the scale of the cumulative probability (l)

p(ξ ≤ ξig,1 ), which is the closed interval [0, 1]. Since Φ∗ (⋅) is a monotonically increasing function, the composite function Bk˙ (Φ∗ (⋅)) in the right hand of (4.14) will allow for (l)

(l)

the same interpretation of the relationship between ξig,1 and f (ξig,1 ). This modeling technique has some advantages of the ones presented in Lang and Brezger (2004) and Song and Lu (2010), specially when it comes to defining the positions of the knots and computational simplicity and efficiency. Despite of its flexibility and robustness, B-splines are subject to over-fitting if a large number of knots is used. Eilers and Marx (1996) proposed a frequentist approach to the so called P-splines, an extension to the B-splines model in which a penalization is imposed on coefficients of adjacent B-splines to avoid over-fitting and regularize the problem. The coefficients of this penalty function minimize 2

˙ ˙ K K ⎛ (l) ⎞ 2 ∗ (l) t ∑ ηig,k − ∑ γk˙ Bk˙ (Φ (ξig,1 )) + β ∑ (∆ γk˙ ) ⎠ i=1 ⎝ ˙ ˙ k=1 k=t+1 ng

(4.15)

where β is a smoothing parameter for controlling the amount of penalty, and ∆t γk˙ denotes the difference operator of order t. Usually first or second order differences are enough (Brezger and Lang, 2006). It is important to recall that equation (4.15) essentially mirrors the maximization of a penalized likelihood estimation, as in Eilers and Marx (1996, 1998). In matrix notation, equation (4.15) can be written as 2

˙ K ⎛ (l) ⎞ ∗ (l) ′ ∑ ηig,k − ∑ γk˙ Bk˙ (Φ (ξig,1 )) + βγ Mγ γ ⎝ ⎠ i=1 ˙ k=1 ng

(4.16)

where γ = (γ1 , ..., γK˙ )′ and Mγ is the associated penalty matrix to the second sum in the RHS of equation (4.15). For an explicit description of the penalty matrix Mγ , see Fahrmeir and Raach (2007). Within a Bayesian framework, the unknown parameters γk˙ , k˙ = 1, ..., K˙ are regarded as random variables an have to be assigned appropriate prior distributions. These are defined by replacing the difference penalty in equation (4.15) by its stochastic analogue, i.e. ∆t γk˙ = ∆t γk−1 + ek˙ , (4.17) ˙

CHAPTER 4. A SEMI-PARAMETRIC HIERARCHICAL STRUCTURAL EQUATION MODEL (SPHSEM) 54

where ek˙ ’s are independently distributed as N[0, τk˙ ]. For example, for t = 1, equation 4.17 becomes γk˙ = γk−1 + ek˙ , and for t = 2, γk˙ = 2γk−1 − γk−2 + ek˙ . This is analogous to define ˙ ˙ ˙ a prior distribution by specifying the conditional distributions of a particular parameter γk˙ given its left and right neighbors (Lang and Brezger, 2004). Therefore, the conditional means of γk˙ ’s can be understood as locally linear or quadratic fits at a particular knot position. In this modeling approach, the amount of smoothness is controlled by the additional variance parameter τk˙ , which corresponds to the inverse of the smoothing parameter β in the classical approach of equation (4.15). Accordingly, equation (4.12) can be reformulated as it follows: (l) ηig,k

˙ m(l) Kc,j1

= ∑ ∑

j1 =1 k=1 ˙

(l) γj1 ,k˙ Bcj ,k˙ (cig,j1 ) + 1

(l)

q2

˙ ξ,j K 2

ξ ∑ ∑ γj2 ,k˙ Bj

j2 =1 k=1 ˙

∗ (l) ˙ (Φ (ξig,j2 ))

2 ,k

⎡ ∗ K˙ l∗ ∗ ⎤ ˙ ∗ K ⎥ (l ) c ,j q (l ) ω,j2∗ L ⎢ 1 ∗ ∗ ⎢m ⎥ (l) ∗ ∗ l l (l ) (l ) + ∑ ⎢⎢ ∑ ∑ γj ∗ ,k˙ Bcj ∗ ,k˙ (cig,j ∗ ) + ∑ ∑ γj ∗ ,k˙ Bωj ∗ ,k˙ (Φ∗ (ωig,j ∗ ))⎥⎥ + δig,k (4.18) 1 2 1 2 1 2 ⎥ ∗ ˙ ˙ l∗ >l ⎢ j2∗ =1 k=1 ⎢ j1 =1 k=1 ⎥ ⎣ ⎦ where K˙ {a,b} denotes the number of nodes defined for the bth random variable of type a. For simplicity, and without loss of generality, throughout the rest of this paper we ˙ ∀ a, b. assume that K˙ {a,b} = K,

4.5.

Identification Constraints

As it is common in SEM literature, the model proposed in equations (4.5) and (4.18) is not identified without imposing identifiability constraints on the model parameters. Song et al. (2013) discuss appropriate solutions to the identification issues. Common restriction practices in both generalized HSEM and NPSEM arise from: ∗ 1. Existence of ordered (zig,k ) and unordered (wig,k ) categorical variables:

∗ Given the fact that the scale is not defined for zig,k nor wig,k , then the rest of parameters in equations (4.7), (4.1), and (4.10), (4.4) cannot be simultaneously estimated. For the ordered categorical variables case, Song et al. (2013) propose to ∗ ∗ fix αk,1 = Φ∗−1 (fk,1 ) and αk,Zk−1 = Φ∗−1 (fk,Z ), where Φ∗ (⋅) is the standard normal k−1 ∗ ∗ ∗ distribution function, fk,1 is the frequency for the first category of zig,k , and fk,Z k−1 is the cumulative frequency of categories zig,k < Zk−1 (see Shi and Lee, 2000). For the unordered categorical variables case, we fix the covariance matrix Ψ,wk = IUk , where IUk is an identity matrix of appropriate dimensions (see Dunson, 2000).

2. Measurement equation parameters: This type of identifiability issue has been well addressed before (see, for example, Bollen, 1989 and Lee, 2007). For example, in the measurement equations (4.7) to (4.10) (we assume no exogenous variables), if an arbitrary nonsingular matrix C is ∗ introduced, such that yig,k = µk + Λ′k,(l) ωig,(l) + ig,k = µk + Λ′k,(l) CC−1 ωig,(l) + ig,k = ′







∗ ∗ µk + Λ∗k,(l) ωig,(l) + ig,k , with Λ∗k,(l) = Λ′k,(l) C and ωig,(l) = C−1 ωig,(l) ; then a definite

CHAPTER 4. A SEMI-PARAMETRIC HIERARCHICAL STRUCTURAL EQUATION MODEL (SPHSEM) 55

solution cannot be estimated. Therefore, as it is common in the SEM literature, we overcome this issue by fixing appropriate elements of Λk,(l) such that the only nonsingular matrix C that satisfies the imposed conditions is the identity matrix. This is usually achieved by fixing one factor loading at 1 (in order to introduce a scale to the corresponding latent variable), and/or by imposing a non-overlapping structure to Λ, i.e. only one latent variable enters into the measurement equation for item k. 3. Unknown functions in the structural equation: The unknown functions in the structural equations are not identified up to a constant, i.e. adding and subtracting an arbitrary constant c in equation (4.12) from, say ∗ ∗ πk,1 (⋅) = πk,1 (⋅) + c and πk,2 (⋅) = πk,2 (⋅) − c will yield the same result, thus, resulting in an unidentified model. Following Song and Lu (2010), restrictions are imposed on every unknown function at each MCMC iteration such that, for every p in j1 , (l)

G ng

j2 , j1∗ and j2∗ , and for all k = 1, ..., q1 , the constraint ∑ ∑ fk,p (κig,p ) = 0 holds for g=1 i=1

appropriate f ’s and κ’s. The latter can be equivalently formulated in matrix notation as 1′N Fk,p = 0, where Fk,p = (fk,p (κ11,p ), ..., fk,p (κnG G,p ))′ , and 1N is a ∑G g=1 ng = N × 1 vector with all elements fixed to 1. After accounting for the nonparametric specification, the constraint is also equivalent to 1′N Bp γp = 0, where γp is a K˙ × 1 vector of spline parameters and Bp is a N × K˙ matrix for which each of the N rows are defined as [Bp,1 (κig,p ), ..., Bp,K˙ (κig,p )], for i = 1, ..., ng and g = 1, ..., G, where its elements are the B-spline basis of natural cubic splines. Extensions to exogenous variables is straightforward. After restricting the mean of each function in (4.12) to be zero, the additive structural model is fully identified. We recall that these are only sufficient but not necessary conditions for identification.

CHAPTER

5

Bayesian Estimation of the SPHSEM

We shall use a Bayesian framework to analyze the proposed model. First of all, advantages of Bayesian techniques include, but are not limited to, i) the usage of prior information about the model parameters in addition to that provided by the data itself, ii) the power of dealing with complex model structures (e.g. intractable integrals) via simulation, and iii) its ability to provide reliable results even with small sample sizes. In subsection 5.1 we first describe the prior distributions chosen upon the model’s parameters in subsections 4.1 to 4.4. Then, we present the posterior distributions in subsection 5.2. Let ag = (a1g , ..., ang g ) be the observed data nested in the g th group, and a = (a1 , ..., aG ) be the overall observed data for all G groups. Also let a∗g = (a∗1g , ..., a∗ng g ) and a∗ = (a∗1 , ..., a∗G ) be the underlying data nested in the g th group and the overall underlying data respectively. Rows in a∗ are linked to those in a through a = g¯(a∗ ), where g¯(⋅) is a piecewise function consisting of link functions described in equations (4.1) to (4.4). Let θm the set of parameters associated with the measurement and structural equations in (4.7) to (4.10) and (4.12), respectively, i.e. θm = {µ, Λ(l) , Ψ , Ψδ , Φ}, where (1)

(L)

Ψδ = {Ψδ , ..., Ψδ }, and Φ = {Φ(1) , ..., Φ(L) }; θs be the set of parameters associated with the nonparametric equations, i.e. θs = {γ, τ }, where γ = {γj1 , γj2 , γj1∗ , γj2∗ }, for which γj1 = {γ1 , ..., γm(l) }, γj2 = {γ1 , ..., γq(l) }, and γj1∗ = {γ1 , ..., γm(l∗ ) }, γj2∗ = {γ1 , ..., γq(l∗ ) } for every l∗ > l, τ = {τj1 , τj2 , τj1∗ , τj2∗ } for vectors τj1 = {τ1 , ..., τm(l) }, τj2 = {τ1 , ..., τq(l) }, and 2

τj1∗ = {τ1 , ..., τm(l∗ ) }, τj2∗ = {τ1 , ..., τq(l∗ ) } for every l∗ > l; and α is the set of thresholds that define the values of the unordered categorical variables, i.e. α = {α1 , ..., αr1 }. Finally, let θ = {θm , θs , α} be the set of all parameters in the SPHSEM. 2

The Bayesian estimates of θ can be obtained by taking the mean of a sufficiently large number of random samples from the posterior density of θ given the observed data a, proportional to the product p(θ ∣ a) ∝ p(a ∣ θ) p(θ) (5.1) where p(a ∣ θ) is the likelihood function of a (given the parameters θ) and p(θ) is the prior density of the model’s parameters. Bear in mind that both the posterior distribution 56

57

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM

and the likelihood function in equation (5.1) are very complicated (and sometimes intractable) due to the presence of latent variables, discrete data, and a complex model structure. These features complicate the Bayesian analysis of the observed-data posterior in the left hand of equation (5.1). Therefore, data augmentation techniques as in Tanner and Wong (1987) are used to overcome the difficulties related to the posterior analysis. We follow the approach of Lee (2007); Song and Lee (2012a), as is it common in Bayesian SEM literature. Let z∗g = {z∗1g , ..., z∗ng g } be the set of underlying data related to the unordered categorical variables zig for all ng individuals belonging to the g th group, and Z∗ = {z∗1 , ..., z∗G } be the underlying data linked to the complete sample’s observed unordered categorical variables. Also let wig = {wig,r3 +1 , ..., wig,r4 }, wg = {w1g , ..., wng g } and W = {w1 , ..., wG } be the underlying vectors for the observed ordered categorical for individual i, group (l) (l) (l) g and complete dataset N , respectively. Finally, let Ωg = {ω1g , ..., ωng g } and (l)

(l)

Ω(l) = {Ω1 , ..., ΩG } be the sets of the l-level both endogenous and exogenous latent variables for group g and the aggregated latent variables for all possible observations N , respectively. There will be as many Ω(l) sets as available levels, L, such that Ω(L) = {Ω(1) , ..., Ω(L) }.

In the data augmentation procedure, the observed data a will be augmented with {Z , W, Ω(L) } to produce the complete dataset {a, Z∗ , W, Ω(L) } that will be used to evaluate the augmented posterior distribution p(θ ∣ Z∗ , W, Ω(L) , a) through MCMC methods, namely the Gibbs sampler (Geman and Geman, 1984) and the MetropolisHastings algorithm (Metropolis et al., 1953; Hastings, 1970). ∗

Said that, departing from the joint probability distribution of the observed, underlying, and latent random variables, and the model’s parameters, p(θ, a, Z∗ , W, Ω(L) ), and by following Bayes’ rule; the posterior distribution in equation (5.1) can be further expressed as p(θ ∣ a, Z∗ , W, Ω(L) ) ∝ p(a, Z∗ , W, Ω(L) ∣ θ) p(θ). (5.2) By assuming that the prior distributions for θ, p(θ) are independent between parameters (Shi and Lee, 1998), and by following Bayes’ rule, equation (5.2) can be further declared as p(θ ∣ a, Z∗ , W, Ω(L) ) ∝ p(a, Z∗ , W, Ω(L) ∣ θ) p(θ)

= p(a, Z∗ , W, Ω(L) ∣ θ) p(θm ) p(θs ) p(α)

= p(a, Z∗ , W ∣ θ, Ω(L) ) p(Ω(L) ∣ θ)p(θm ) p(θs ) p(α)

= p(Z∗ , W ∣ θ, Ω(L) , a) p(a ∣ θ, Ω(L) ) p(Ω(L) ∣ θ) p(θm ) p(θs ) p(α)

= p(Z∗ ∣ θ, Ω(L) , a) p(W ∣ θ, Ω(L) , a) p(a ∣ θ, Ω(L) ) p(Ω(L) ∣ θ) p(θm ) p(θs ) p(α)

(5.3)

Equation (5.3) can be further presented as the product of several independent distributions for random variables varying at different levels, groups and individuals. Recall that in a multilevel setting, individuals i, i′ belonging to group g ∈ G become independent once we ‘control for’ the appropriate random variables varying at group g and higher levels.

58

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM

Also, we assume that prior distributions within θm , θs and α are also independent. Therefore, after rearranging some terms, the right hand side of equation (5.3) can be expressed as it follows (we omit the variables we condition on, but are the same as in eq. 5.3): L

G ng

r1

∗ ∝ ∏ ∏ ∏ { [ ∏ p(zig,k , αk ∣ µk , Λk,(l) , ωig,(l) , ψ,k , zig )] × p(wig ∣ µ, Λ(l) , Ψ , uig )× l=1 g=1 i=1

k=1

(l)

p(ωig ∣ γ, Λ(l) , Ψδ , Φ(l) ) × p(aig ∣ z∗ig , wig , θm , Ω(L) ) × p(Ψδ ) × p(Ψ ) × p(Φ)× ⎡ r2 ⎤ ⎫ r4 ⎪ ⎢ ⎥ ⎪ p(µ) × ⎢⎢ ∏ p(Λk,(l) ∣ ψ,k ) ∏ p(Λk,(l) )⎥⎥ × p(γ ∣ τ ) × p(τ )⎬ (5.4) ⎪ ⎢k=1 ⎥ ⎪ k=r +1 2 ⎭ ⎣ ⎦ As in Lee and Zhu (2000) and Shi and Lee (1998), we can sample from the distribution above by implementing an MCMC algorithm. Samples from the joint distribution p(θ, Z∗ , W, Ω(L) ∣ a) can be independently obtained from posterior distributions resulting from concatenating terms in equation (5.4) into simpler, conventional, and more general distributions. First, Bayesian methods, particularly the Gibbs algorithm (Geman and (0) Geman, 1984), require setting arbitrary initial values (θ(0) , Z∗(0) , W(0) , Ω(L) ). Then we (m)

simulate (θ(m) , Z∗(m) , W(m) , Ω(L) ) from the distribution above, for m = 1, ..., T , following the proposed algorithm: Algorithm 5.1: (m)

1. Generate (Z∗(m+1) , α(m+1) ) from p(Z∗ , α ∣ θ(m) , Ω(L) , Z) (m)

2. Generate W(m+1) from p(W ∣ θ(m) , Ω(L) , u) (m+1)

from p(Ω(L) ∣ θ(m) , W(m+1) , Z∗(m+1) , a)

3. Generate Ω(L)

(m+1)

4. Generate θ(m+1) from p(θ ∣ W(m+1) , Z∗(m+1) , Ω(L) 4. is further decomposed into:

, a). Due to its complexity, step (m+1)

4.1. Generate µ(m+1) from p(µ ∣ W(m+1) , Z∗(m+1) , Ω(L) (m+1)

from p(Λ(l) ∣ W(m+1) , Z∗(m+1) , Ω(L)

(m+1)

from p(Ψ ∣ W(m+1) , Z∗(m+1) , Ω(L)

(m+1)

from p(Ψδ ∣ W(m+1) , Z∗(m+1) , Ω(L)

4.2. Generate Λ(l) 4.3. Generate Ψ

4.4. Generate Ψδ

(m)

(m+1)

(m+1)

(m+1)

4.6. Generate γ (m+1) from p(γ ∣ Ω(L)

(m+1)

, a, Ψδ

) (m)

, a, µ(m+1) , Ψ

)

(m+1)

, a, µ(m+1) , Λ(l)

(m+1)

4.5. Generate Φ(m+1) from p(Φ ∣ Ω(m+1) )

(m)

, a, Λ(l) , Ψ

)

, a, γ (m) , τ (m) )

, τ (m) )

4.7. Generate τ (m+1) from p(τ ∣ γ (m+1) ) Algorithm 5.1 is cycled T times. Under mild-regularity conditions, Geman and Geman (T ) (1984) showed that for sufficiently large T , (θ(T ) , Z∗(T ) , W(T ) , Ω(L) ) can be regarded as a realization of the posterior distribution in equation (5.4). Recall that Z∗ and W play a key role in the Gibbs algorithm described above because, when their values are given,

59

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM

the model becomes simpler and the estimation effort is reduced due to the presence of underlying, continuous observations (a∗ ). In subsection 5.1 we describe the prior distributions defined for the parameters in θ, and in 5.2 we describe in detail the posterior distributions involved in algorithm 5.1. Further details on the derivation of the posterior distributions can be found in the Appendix section.

5.1.

Prior Distributions

In fully Bayesian settings, a relevant subject is related to the appropriate specification of prior distributions for the unknown parameters and underlying random variables. We first consider prior distributions for the basis parameters in the P-splines equations, γ. Fahrmeir and Raach (2007) show that the stochastic analogues of the penalty to the coefficients of the B-splines (eq. 4.17) follow a Gaussian distribution, i.e. for each element in γj1 , γj2 and γj1∗ , γj2∗ with l∗ > l, we have that ⎧ ⎫ ˙ ˙ K ⎪ ⎪ 1 ′ ⎪ 1 K 2⎪ ⎨ − p(γj1 ∣ τj1 ) = ∏ p(γj1 ,k˙ ∣ γj1 ,k−1 , τ ) ∝ exp ⎬ = exp {− (γ − γ ) γ Mγj1 γj1 } ∑ ˙ ˙ ˙ j1 j , k j , k−1 1 1 ⎪ ⎪ 2τj1 k=2 2τj1 j1 ⎪ ⎪ ˙k=1 ˙ ⎩ ⎭ (5.5) for every j1 = 1, ..., m(l) and for an arbitrary differentiation order t. The same goes for ˙ every element in j2 and j1∗ , j2∗ , for l∗ > l. hen the differentiation order is t = 1, the [K˙ × K] penalty matrix, Mγj1 , is defined as

Mγj1

⎤ ⎡ 1 −1 ⎥ ⎢ ⎥ ⎢−1 2 −1 ⎥ ⎢ ⎥ ⎢ ⎥ ⋱ ⋱ ⋱ =⎢ ⎥ ⎢ ⎥ ⎢ −1 2 −1 ⎥ ⎢ ⎢ −1 1 ⎥⎦[K× ⎣ ˙ K] ˙

Moreover, given the identification constraint 1′N Bj1 γj1 = 0, the prior distribution in equation (5.5) becomes a truncated Gaussian distribution (for arbitrary t): ˙ ∗ /2) (K j

p(γj1

1 ∣ τj1 ) = ( ) 2πτj1

1

exp {−

1 ′ γ Mγj1 γj1 } I(1′N Bj1 γj1 = 0) 2τj1 j1

where K˙ j∗1 = rank(Mγj1 ) and I(⋅) is an indicator function. (l)

equation (5.6) is the same for every j1 = 1, ..., m , j2 = ∗ j2∗ = 1, ..., q (l ) , for l∗ > l.

(l) 1, ..., q2 ,

(5.6)

The specification in j1∗ = 1, ..., m(l

∗)

and

Second, for all the smoothing parameters in τ , we assume highly dispersed but proper (conjugate) priors. Following Song and Lu (2010), for every p in j1 , j2 , j1∗ and j2∗ , for l∗ > l,

60

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM

we assign gamma priors for the precision parameter (τp−1 ) given by p(τp−1 ) = Gamma[αγ0 , βγ0 ], D

(5.7)

where αγ0 and βγ0 are shape and rate hyperparameters with fixed preassigned values. In order to achieve highly dispersed priors, we set αγ0 = 1 and βγ0 = 0.005, common values in the literature (see, for example, Song et al., 2013). Third, for all the structural parameters in θm , we consider the following conjugate priors D

for k = 1, ..., r3

(5.8)

D

for k = r3 + 1, ..., r4

(5.9)

p(µk ) = N[µk0 , σk20 ] p(µk ) = N[µk0 , Hµk0 ] D

p(Λk,(l) ∣ ψ,k ) = N[Λk0 , ψ,k HΛk0 ]

and

D

p(ψ,k ) = InvGamma[αΛk0 , βΛk0 ] D

p(Λk,(l) ) = N[Λk0 , HΛk0 ] (l)

D

p(ψδ,k ) = InvGamma[αδk0 , βδk0 ]

(5.10)

for k = 1, ..., r2

(5.11)

for k = r2 + 1, ..., r4

(5.12)

(l)

for k = 1, ..., q1 and l = 1, ..., L

(5.13)

for l = 1, ..., L

(5.14)

p(Φ(l) ) = InvWishart[R0 , ρ0 ] D

where µk0 , σk20 , µk0 , Λk0 , αΛk0 , βΛk0 , Λk0 , αδk0 , βδk0 , ρ0 are appropriate hyperparameters, and Hµk0 , HΛk0 , R0 are positive semidefinite matrices whose values are assumed to be given by the prior information. Lastly, in order to reflect the uncertainty related to these parameters, we assign a non-informative prior distribution for the thresholds that define each ordered categorical variable, αk . Given that αk,1 and αk,Zk −1 are fixed for every k = 1, ..., r1 , we consider the prior distribution for αk,2 < ⋅ ⋅ ⋅ < αk,Zk −2 as follows: p(αk ) = p(αk,2 , ..., αk,Zk −2 ) ∝ c

(5.15)

with c being a fixed, arbitrary constant. Even though this prior distribution is improper, the joint conditional distribution (of the thresholds and underlying continuous variables) is proper, thus they can be sampled in the MCMC procedure. We now explore the features of the posterior distributions, obtained after replacing equations (5.8) to (5.15) into (5.4) and after requiring some algebra. Details can be revised in the Appendix section.

5.2.

Posterior Inference

We start with step 1 in algorithm 5.1. The joint posterior distribution in (m) this step can be further decomposed as the product p(Z∗ , α ∣ θ(m) , Ω(L) , Z) = (m)

(m)

p(Z∗ ∣ α, θ(m) , Ω(L) , Z) p(α ∣ θ(m) , Ω(L) , Z). The first posterior distribution, ∗ p(Z ∣ α, θ, Ω(L) , Z), can be in turn expressed as the product of several independent

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM

61

∗ , ..., zn∗g G,k )′ be the vector of all N unrandom variables, as it follows. Let Z∗k = (z11,k derlying observations of zig,k , for every k = 1, ..., r1 ; therefore, p(Z∗ ∣ α, θ, Ω(L) , Z) = r1

∏ p(Z∗k ∣ αk , θ, Ω(L) , Z). Given the appropriate current-level and higher-level explanatory

k=1

variables (both observed and latent), and the model parameters involved in the conditional distribution, we end up with independent groups and individuals, such that r1

r1

G ng

∗ ∗ ∏ p(Zk ∣ αk , θ, Ω(L) , Z) = ∏ ∏ ∏ p(zig,k ∣ µk , Λk,(l) , ωig,(l) , ψ,k , αk,zig,k , αk,zig,k +1 , zig,k ) k=1 g=1 i=1

k=1

(5.16) ∗ ∗ ), where holds true, with p(zig,k ∣ ⋅) = N[µk + Λ′k,(l) ωig,(l) , ψ,k ]I[αk,z ,αk,z +1 ) (zig,k ig,k ig,k ∗ IA (⋅) is an indicator that function takes value of 1 if zig,k ∈ A, with A = [αk,zig,k , αk,zig,k +1 ). D

Now, for the second posterior distribution, p(α ∣ θ, Ω(L) , Z), we also assume that it can be expressed as the product of a series of independent distributions, one for each r1

of the r1 ordered categorical variables, i.e. p(α ∣ ⋅) = ∏ p(αk ∣ θ, Ω(L) , Zk ), with Zk = k=1



(z11,k , ..., zng G,k ) , the vector of all N observations of zig,k , and where G ng

−1

p(αk ∣ θ, Ω(L) , Zk ) ∝ ∏ ∏ [Φ∗ (ψ,k2 [αk,zig,k +1 − µk − Λ′k,(l) ωig,(l) ]) g=1 i=1

−1

−Φ∗ (ψ,k2 [αk,zig,k − µk − Λ′k,(l) ωig,(l) ])] (5.17) for k = 1, ..., r1 . Recall that Φ∗ (⋅) is defined as the cumulative distribution function of a standard Gaussian distribution. A short explanation in the derivation of this distribution is in the Appendix section. Combining equations (5.16) and (5.17), we have that G ng

−1

∗ p(αk , Z∗k ∣ ⋅) ∝ ∏ ∏ φ (ψ,k2 [zig,k − µk − Λ′k,(l) ωig,(l) ]) I[αk,z

ig,k

g=1 i=1

∗ ,αk,zig,k +1 ) (zig,k )

(5.18)

with φ(⋅) being the standard normal density. As in Lee and Zhu (2000), we sample joint realizations for (α, Z∗ ), as it is more efficient. However, note that the posterior distribution in equation (5.18) is not standard, and therefore we should sample from it using the MH algorithm, as described later in this chapter. In a similar fashion, the posterior distribution for the underlying vectors of the unordered categorical variables in step 2, W, can be expressed as p(W ∣ θ, Ω(L) , a) = r4

′ , ..., wn′ g G,k )′ , for k = r3 + 1, ..., r4 . It is clear ∏ p(Wk ∣ θ, Ω(L) , a), with Wk = (w11,k

k=r3 +1

that following the measurement equation in (4.10), and given the appropriate current and higher level explanatory variables and the model’s parameters θ, the latter product of probability distribution functions can be further expressed as r4

r4

G ng

∏ p(Wk ∣ θ, Ω(L) , a) = ∏ ∏ ∏ p(wig,k ∣ µk , Λk,(l) , ωig,(l) , uig,k )

k=r3 +1

k=r3 +1 g=1 i=1

62

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM

with p(wig,k ∣ uig,k = u′ , ⋅) = N[µk + 1Uk −1 Λ′k,(l) ωig,(l) , IUk −1 ]IRu′ (wig,k ), where, IUk −1 is an identity matrix of order Uk − 1, and again, IA (⋅) is an indicator function that takes the value of 1 whenever wig,k ∈ Ru′ , with the Uk − 1 dimensional vector space Ru′ defined as ⎧ ⎪ if uig,k = u′ = 0 ⎪{wig,k ∶ max(wig,k ) < 0} Ru′ = ⎨ ′ ⎪ ′ ⎪ ⎩{wig,k ∶ max(wig,k ) = wig,k,u > 0} if uig,k = u = 1, ..., Uk − 1. D

We shall bear in mind that the posterior distribution in equation (5.19) is a truncated multivariate normal distribution. We follow the approach of Song et al. (2007) and Song et al. (2013) (who in turn follow the algorithm in Robert, 1995) to obtain samples from this distribution. The authors simulate partitioned variables using the Gibbs sampler. Let wig,k,−u′ the vector wig,k with wig,k,u′ = max(wig,k ) deleted, i.e. a new vector of size (Uk −2)×1. The distribution of wig,k,u′ given uig,k = u′ , wig,k,−u′ , ωig,(l) and the appropriate parameters in θ, is a univariate truncated normal distribution defined as: p(wig,k,u′

⎧ ′ ⎪ ⎪N[µk,u′ + Λk,(l) ωig,(l) , 1]I(wig,k,u′ ≥ max{wig,k,−u′ , 0}) ∣ ⋅) = ⎨ ′ ⎪ ′ ′ ′ ⎪ ⎩N[µk,u + Λk,(l) ωig,(l) , 1]I(wig,k,u < max{wig,k,−u , 0}) D

if uig,k = u′

if uig,k ≠ u′ (5.19)

where µk,u′ is the component of µk associated with wig,k,u′ . It is clear that sampling Uk − 1 random variables from a univariate truncated normal distribution consumes more computational resources, but also keeps the algorithm simple. Now, sampling from the posterior distribution in step 3, algorithm 5.1, requires more detail. Given that current and higher level explanatory variables in the measurement equations, latent variables are conditionally independent between individuals and groups, even between levels. Accordingly, recall that observed items are also conditionally inde(l) (l) pendent, i.e. ωig á ωi′ g ∣ Ω(L) and aig á ai′ g ∣ Ω(L) , for i ≠ i′ , for any arbitrary group g and arbitrary level l. Therefore, we have that: L



L

G ng

(l)



p(Ω(L) ∣ ⋅) = ∏ p(Ω(l) ∣ Ω(l ) , ⋅) = ∏ ∏ ∏ p(ωig ∣ Ω(l ) , θ(m) , W(m+1) , Z∗(m+1) , a), l=1

l=1 g=1 i=1



where Ω(l ) is the set of all the realization of latent variables for every level l∗ > ∗ (l) l. Moreover, given the rules of conditional probability, p(ωig ∣ Ω(l ) , ⋅) can be further expressed as: (l)



(l)



(l)

(l)



(l)

p(ωig ∣ Ω(l ) , ⋅) ∝ p(aig , z∗ig , wig ∣ ωig , Ω(l ) , ⋅)p(ηig ∣ ξig , Ω(l ) , ⋅)p(ξig ∣ ⋅),

(5.20)

which, for each level l, is in turn decomposed into ⎤ ⎡ ⎧ r2 ⎪ 2 2 ⎥ ⎪ 1 ⎢⎢ r1 ∗ ′ ′ ∝ exp ⎨− ⎢ ∑ (zig,k − µk − Λk,(l) ωig,(l) ) /ψ,k + ∑ (yig,k − µk − Λk,(l) ωig,(l) ) /ψ,k ⎥⎥ ⎪ 2 ⎥ ⎪ k=r1 +1 ⎩ ⎢⎣k=1 ⎦ r3

+ ∑ [vig,k (µk + Λ′k,(l) ωig,(l) ) − exp (µk + Λ′k,(l) ωig,(l) )] −

k=r2 +1 r4

′ 1 ′ ′ ∑ (wig,k − µk − 1Uk −1 Λk,(l) ωig,(l) ) (wig,k − µk − 1Uk −1 Λk,(l) ωig,(l) ) 2 k=r3 +1

63

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM (l)

(l)

˙ ξ,j ˙ c,j (l) K q2 K ⎛q1 1 2 1 ⎛ (l) m (l) (l) c + ∑ ⎜ ∑ − (l) ⎜ηig,k − ∑ ∑ γj1 ,k˙ Bj ,k˙ (cig,j1 ) − ∑ ∑ γj2 ,k˙ Bξ ˙ (Φ∗ (ξig,j2 )) j 2 ,k 1 j1 =1 k=1 j2 =1 k=1 ˙ ˙ l=1 ⎝k=1 2ψδ,k ⎝ L

⎡ ∗ K˙ l∗ ∗ ⎤ 2 ˙ (l∗ ) Kω,j ∗ ⎢m(l ) c ,j1 ⎥⎞ q 2 ⎢ ⎥ l∗ l∗ (l∗ ) (l∗ ) − ∑ ⎢⎢ ∑ ∑ γj ∗ ,k˙ Bcj ∗ ,k˙ (cig,j ∗ ) + ∑ ∑ γj ∗ ,k˙ Bωj ∗ ,k˙ (Φ∗ (ωig,j ∗ ))⎥⎥⎟ ⎟ 2 1 2 1 2 1 ⎥⎠ ∗ ˙ ˙ l∗ >l ⎢ j2∗ =1 k=1 ⎢ j1 =1 k=1 ⎥ ⎣ ⎦ −1 1 (l) ′ (l) − (ξig ) (Φ(l) ) (ξig ))} 2 L

(5.21)

Given that the posterior distribution of Ω(L) in equation (5.21) is highly complex and virtually intractable, we propose a Metropolis-Hastings within Gibbs sampler algorithm that allows us to sample random independent draws from it. Details on this piece of algorithm are presented in section 5.3. Lastly, we present a breakdown that allows for sampling from the posterior of θ in step 4 of algorithm 5.1. After assuming that different k’s are independent random variables, p(µ ∣ ⋅) is derived using the observed data likelihood and the conjugate priors in (5.8) and (5.9), as it is shown in the Appendix section. Following Song et al. (2013) and after some algebra, the posterior distributions for the intercept parameters are p(µk ∣ ⋅) = N[µ∗k , σk∗ ], D

for k = 1, ..., r1

(5.22)

∗ p(µk ∣ ⋅) = N[µ∗∗ for k = r1 + 1, ..., r2 (5.23) k , σk ], ⎫ ⎧ n ⎪ ⎪ ⎪ ⎪G g p(µk ∣ ⋅) ∝ exp ⎨ ∑ ∑ [vig,k (µk + Λ′k,(l) ωig,(l) ) − exp(µk + Λ′k,(l) ωig,(l) )]⎬ , ⎪ ⎪ ⎪ ⎪ ⎭ ⎩g=1 i=1 for k = r2 + 1, ..., r3 (5.24) D

D

ˆ k , Σµk ], p(µk ∣ ⋅) = N[µ

for k = r3 + 1, ..., r4

(5.25)

where −1

−1 σk∗ = [N ψ,k + σk−10 ] ⎡ ⎤ G ng ⎥ ∗ ∗⎢ −1 ∗ ′ −1 ⎢ µk = σk ⎢ψ,k ∑ ∑(zig,k − Λk,(l) ωig,(l) ) + σk0 µk0 ⎥⎥ ⎢ ⎥ g=1 i=1 ⎣ ⎦ ⎡ ⎤ n g G ⎥ ∗⎢ ⎢ψ −1 ∑ ∑(yig,k − Λ′ ωig,(l) ) + σ −1 µk ⎥ µ∗∗ = σ k k ⎢ ,k k0 0⎥ k,(l) ⎢ ⎥ g=1 i=1 ⎣ ⎦

and

−1 for the ordered categorical and continuous variables, and Σµk = (N IUk −1 + H−1 µk ) 0

G ng

ˆ k = Σµk [ ∑ ∑ (w ˆ ig,k ) + H−1 ˆ ig,k = wig,k − 1Uk −1 Λ′k,(l) ωig,(l) for the and µ µk0 µµk0 ], with w g=1 i=1

unordered categorical variables. Most of the posteriors distributions in (5.22) to (5.25) are normal distributions and therefore, one could draw samples directly using the standard Gibbs sampler. However, the posterior distribution in (5.24) is not standard, thus we need to sample using a MH

64

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM

algorithm. To derive the posterior distributions in steps 4.2 and 4.3 in algorithm 5.1 from the observed data likelihood and using the conjugate priors in equations (5.10) to (5.12). Again, details can be found in the Appendix. The posterior distributions p(Λk,(l) ∣ ⋅) and p(ψ,k ∣ ⋅) are: p(Λk,(l) ∣ ⋅) = N[Λ∗k,(l) , ψ,k H∗k ], D

−1 and p(ψ,k ∣ ⋅) = Gamma [

N + αΛk0 , βk∗ ] , 2

(5.26)

−1 and p(ψ,k ∣ ⋅) = Gamma [

N + αΛk0 , βk∗∗ ] , 2

(5.27)

D

for k = 1, ..., r1 , ∗ p(Λk,(l) ∣ ⋅) = N[Λ∗∗ k,(l) , ψ,k Hk ], D

D

for k = r1 + 1, ..., r2 , ⎧ n ⎪ ⎪G g p(Λk,(l) ∣ ⋅) ∝ exp ⎨ ∑ ∑ [vig,k (µk + Λ′k,(l) ωig,(l) ) − exp(µk + Λ′k,(l) ωig,(l) )] ⎪ ⎪ ⎩g=1 i=1 1 ′ − [Λk,(l) − Λk0 ] H−1 Λk0 [Λk,(l) − Λk0 ]} , for k = r2 + 1, ..., r3 , 2 ∗∗ p(Λk,(l) ∣ ⋅) = N[Λ∗∗∗ k,(l) , ψ,k Hk ], for k = r3 + 1, ..., r4 ; D

where H∗k L

=

′ −1 (H−1 Λk + ΩΩ ) , 0

Λ∗k,(l)

=

(5.28)

(5.29)

̃∗ H∗k [H−1 Λk Λk0 + ΩZk ],

being Ω a

0

(l)

[ ∑ (rx + q (l) )] × N matrix defined as Ω = [ω11,(l) , ..., ωng G,(l) ], with the vectors l=1

′ ̃ ∗ = (z ∗ − µk , ..., z ∗ ωig,(l) defined as in equation (4.6); and the N × 1 vector Z k 11,k ng G,k − µk ) ; ∗′ ∗−1 ∗ ′ −1 ̃∗ ̃ ∗′ Z and βk∗ = βΛk0 + 21 [Z k k + Λk0 HΛk Λk0 −Λk,(l) Hk Λk,(l) ]. 0

′ ∗ −1 ̃ ̃ In addition, we define Λ∗∗ k,(l) = Hk [HΛk Λk0 + ΩYk ], Yk = (y11,k − µk , ..., yng G,k − µk ) ,

and βk∗∗

0

=

βΛk0 +

1 2 −1

∗∗′ ∗−1 ∗∗ ′ −1 ̃′ Y ̃ [Y k k + Λk0 HΛk Λk0 − Λk,(l) Hk Λk,(l) ]. 0

′ −1 ̃ (H−1 and Λ∗∗∗ = H∗∗ Λk0 + (Uk − 1)ΩΩ ) k [HΛk0 Λk0 + ΩWk ], k,(l) (1′Uk −1 (w11,k − µk ), ..., 1′Uk −1 (wng G,k − µk ))′ (see Appendix).

Finally, H∗∗ k with

̃k W

= =

Furthermore, sampling from the posteriors p(Ψδ ∣ ⋅) and p(Φ ∣ ⋅) in steps 4.4 and 4.5 of (l) algorithm 5.1 goes at it follows. First, assume independent ψδ,k ’s for every k = 1, ..., q1 in every l = 1, ..., L. Therefore, the posterior distribution in step 4.4 is derived by multiplying (l) (l) numerous individual likelihoods for ηig,k ’s, times the prior distribution for ψδ,k (same k and l), as defined in equation (5.13). Thus, we have that for every k = 1, ..., q1 in every l = 1, ..., L, (l)

(l)

L

G ng

(l)

(l)

p(ψδ,k ∣ ηig,k , ⋅) ∝ ∏ ∏ ∏ p(ηig,k ∣ ⋅) p(ψδ,k ) l=1 g=1 i=1

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM

65

After some algebra (see Appendix for further details on the computation of this posterior), we end up with a posterior distribution (l),−1

p(ψδ,k (l)∗

where αδk

=

N˙ 2

(l)

(l)∗

D

(l)∗

∣ ηig,k , ⋅) = Gamma [αδk , βδk ] ,

(5.30)

+ αδk0 , with N˙ being the sum of the number of individuals (groups) (l)

belonging to level l at which ηig,k belongs; and (l)∗ β δk

(l) ˙ ⎡ ˙ c,j (l) K q2 K ξ,j2 1 1 ⎢⎢ L G ng ⎛ (l) m (l) (l) c = βδk0 + ⎢∑ ∑ ∑ ⎜ηig,k − ∑ ∑ γj1 ,k˙ Bj ,k˙ (cig,j1 ) − ∑ ∑ γj2 ,k˙ Bξ ˙ (Φ∗ (ξig,j2 )) j 2 ,k 1 2 ⎢⎢l=1 g=1 i=1 ⎝ j1 =1 k=1 j2 =1 k=1 ˙ ˙ ⎣ ⎡ ∗ K˙ l∗ ∗ ⎤ 2⎤ ˙ (l∗ ) Kω,j ∗ ⎥⎞ ⎥ (l ) c ,j q L ⎢ m 1 2 ∗ ∗ ⎢ ⎥ ⎥ (l∗ ) (l∗ ) cl ωl ∗ ⎢ ⎥ − ∑ ⎢ ∑ ∑ γj ∗ ,k˙ Bj ∗ ,k˙ (cig,j ∗ ) + ∑ ∑ γj ∗ ,k˙ Bj ∗ ,k˙ (Φ (ωig,j ∗ ))⎥⎥⎟ ⎟ ⎥ 1 2 1 2 1 2 ⎢ ⎥ ⎥ ∗ ∗ ∗ ˙ ˙ l >l ⎢ j1 =1 k=1 j2 =1 k=1 ⎥⎠ ⎥ ⎣ ⎦ ⎦

Now, for the posterior in step 4.5 of algorithm 5.1, p(Φ ∣ ⋅), we assume independent covariance matrices for the exogenous latent variables across different levels, that is, L

p(Φ ∣ ⋅) = ∏ p(Φ(l) ∣ ⋅). The posterior is computed through the exogenous latent variables l=1

at level l likelihood times the prior defined for θ (l) . After some algebra explicitly available in the Appendix section, we have that the posterior distribution for θ (l) is of the form: ′

p(Φ(l) ∣ ⋅) = InvWishart (N˙ + ρ0 , ξ (l) ξ (l) + R0 ) , D

(5.31)

for every l = 1, ..., L. N˙ is defined in a similar way as in equation (5.30). To derive the posterior distribution for other parameters associated with the nonparametric structural equations; first, let both τp and γp be independent among p’s, that is, P

p(τ ) = ∏ ∏ p(τp )

and

J p=1 P

p(γ) = ∏ ∏ p(γp ∣ τp ), J p=1

for every p in j1 to j2∗ and J = {j1 , j2 , j1∗ , j2∗ }. Second, following Song et al. (2013), each one of the posteriors for τp ’s is the result of the conjugation between their priors and the corresponding γp ’s prior. Both priors were presented in equations (5.7) and (5.6), respectively. After some computations (in the Appendix), the posterior for τp results in τp−1 ∼ Gamma [αγ0 +

γp′ Mγp γp K˙ , βγ0 + ] 2 2

(5.32)

66

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM

where K˙ is the number of knots, and p stands for the random exogenous variables in the structural equations in j1 to j2∗ . The posterior distribution for γp results as the combination of its prior (equation 5.6) and a modified version of the k th endogenous latent variable likelihood. After some algebra, the posterior distribution for γp can be expressed as p(γp ∣ ⋅) = N[γp∗ , Σ∗γp ] I(1′N˙ Bp γp = 0) D

(l)∗

(l) −1

(5.33) −1

(l)

with γp∗ = [Σ∗γp B′p ηk ] (ψδ,k ) and Σ∗γp = (B′p Bp /ψδ,k + Mγp /τp ) , and where I(⋅) is an indicator function that takes the value of 1 of the restriction 1′N˙ Bp γp = 0 is satisfied and 0 otherwise. In this specification we define Bp as the N˙ × K˙ matrix ⎡[B (⋅) , ⋯, ⎢ p,1 ⎢ ⎢ ⎢ ⎢ ⋮ ⋱ Bp = ⎢⎢ ⎢ ⎢ ⎢ ⎢[Bp,1 (⋅) , ⋯, ⎢ ⎣

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⋮ ⎥ ⎥ ⎥ ⎥ ⎥ Bp,K˙ (⋅)] ⎥ [ng ,G] ⎦[N˙ ×K] ˙ Bp,K˙ (⋅)]

[1,1]

where N˙ is the total number of observations (individuals) in level l for which the endoge(l) nous latent variable ηig,k belongs, and K˙ is the total number of knots defined for each basis (l)∗ as the N˙ × 1 vector spline. This configuration is valid for every p in j1 to j ∗ . Also, let η (l)∗

defined as ηk

=

(l)∗ (l)∗ ′ (η11,k , ..., η ˙ ) , N ,k

2

(l)∗

k

˙ J K

(l)

with ηig,k = ηig,k − ∑ ∑ ∑ γp′ ,k˙ Bp′ ,k˙ (⋅), for every indi˙ p′ ≠p p′ k=1

vidual/group ig in 1, ..., N˙ in level l. To sample from the truncated posterior distribution, (New) we can sample an observation γp from equation (5.33) and then transform it as γp = γp(New) − Σ∗γp B′p 1N˙ (1′N˙ Bp Σ∗γp B′p 1N˙ )

5.3.

−1

1′N˙ Bp γp(New)

Implementation

Most posterior distributions in equations (5.16) to (5.33) are familiar standard distributions from the exponential family, such as Normal, (Inverted) Gamma, and (Inverted) Wishart distributions. Therefore, it is straightforward to sample from them using the Gibbs sampler (Geman and Geman, 1984). However, when sampling from the joint ∗ (l) posterior distribution p(Z∗ , α ∣ θ, Ω(L) , Z) in equation (5.18) and from p(ωig ∣ Ω(l ) , ⋅) in equation (5.21), it becomes clear that these distributions are not standard. We then appeal to the Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970) specification presented in Lee and Zhu (2000) and Song et al. (2013). Following Cowles (1996), for each of the k = 1, ..., r1 ordered categorical variables we generate a vector of thresholds αk = (αk,2 , ..., αk,Zk −2 ) from the following truncated normal distribution: (m) αk,q ∼ N [αk,q , σα2 k ] I[α (αk,q ), (5.34) (m) ,α ) k,q−1

k,q+1

67

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM (m)

for every q = 2, ..., Zk − 2. In equation (5.34), αk,q is the value of αk,q at the mth iteration of the Gibbs sampler, and σα2 k is a preassigned constant that yields an appropriate acceptance rate. In addition, we can sample Z∗k from (5.16). It follows from the MH algorithm that the acceptance probability for (αk , Z∗k ) is min{1, Rk }, where ∗(m)

Rk =

p(Z∗k , αk ∣ θ, Ω(L) , Z) q(Zk ∗(m)

p(Zk

(m)

, αk

(m)

, αk

∣ Z∗k , αk , θ, Ω(L) , Z) ∗(m)

∣ θ, Ω(L) , Z) q(Z∗k , αk ∣ Zk

(m)

, αk

,

, θ, Ω(L) , Z)

with p(⋅) being the posterior distribution in (5.18) and q(⋅) the proposal distribution, defined as the product of the truncated Gaussian distributions specified in equations (5.16) and (5.34). Therefore, following Lee and Zhu (2000), it can be shown that

(m)

(m)

(m)

Zk −2

Φ∗ [(αk,q+1 − αk,q ) /σαk ] − Φ∗ [(αk,q−1 − αk,q ) /σαk ]

q=2

Φ∗ [(αk,q+1 − αk,q ) /σαk ] − Φ∗ [(αk,q−1 − αk,q ) /σαk ]

Rk = ∏

(m)

−1

−1

G ng

Φ∗ (ψ,k2 [αk,zig,k +1 − µk + Λ′k,(l) ωig,(l) ]) − Φ∗ (ψ,k2 [αk,zig,k − µk + Λ′k,(l) ωig,(l) ])

g=1 i=1

Φ∗ (ψ,k2 [αk,zig,k +1 − µk + Λ′k,(l) ωig,(l) ]) − Φ∗ (ψ,k2 [αk,zig,k − µk + Λ′k,(l) ωig,(l) ])

×∏∏

−1

−1

(m)

(m)

(5.35) Since Rk depends only on the old and new values of αk and not on those of Z∗k , we do not need to draw new samples from the posterior distribution of Z∗k in the iterations for which the new value of αk is not accepted (Cowles, 1996). (l)



Something similar happens when sampling from p(ωig ∣ Ω(l ) , ⋅) in equation (5.21). In this case, we follow the approach of Song and Lu (2010), which depart from what initially proposed Arminger and Muth´en (1998) and Zhu and Lee (1999) in their respective papers. (l),(m) 2 (l) For the posterior distribution in (5.21), we choose N [ωig , σω Σω ] as the proposal (l),(m)

distribution, where ωig

(l)

is the random sample of ωig at the mth iteration of the Gibbs (l)

(l) sampler, and the covariance matrix defined as (Σω )−1 = Λ′ω Ψ−1 if l = 1, or  Λω + Σ (l) −1 ′ −1 (l) (Σω ) = N˙ g Λω Ψ Λω + Σ if l ≥ 1, with

Σ(l) =

(l) (l) ⎛ ⎞ (Ψδ )−1 −(Ψδ )−1 γω ∆(l) , (l) −1 (l) −1 ′(l) ′ (l) −1 ′(l) ′ (l) ⎝−∆ γω (Ψδ ) (Φ ) + ∆ γω (Ψδ ) γω ∆ ⎠

(5.36)

4 where Λω is a [r3 + ∑rk=r (Uk − 1)] × q (l) matrix that results from stacking every Λk,(l) 3 +1 j∗ (l) vector, as defined in equations (4.7) to (4.10); γω is a q × (K˙ × J ∗ ) matrix, J ∗ = ∑ 2 ji ,

(l)

1

i=1

that also results from stacking q1 vectors, one for each latent endogenous variable, composed of K˙ ×J ∗ γ’s (assuming we fix an equal number of K˙ knots for each basis expansion); (l) and where ∆(l) is a (K˙ × J ∗ ) × q2 matrix, defined as

68

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM

⎡ ∂Bj =1 (⋅) 1 ⎢ ⎢ (l) ⎢ RR ∂ξig,1 ⎢ ⎢ ∂B(⋅) RRRR ⋮ = ⎢⎢ ∆= R (l)′ RRR ⎢ ∂B ∗ (l∗ ) (⋅) ∂ξig RR j2 =q Rξig =0 ⎢⎢ ⎢ (l) ⎢ ∂ξig,1 ⎣

⋯ ⋱ ⋯

∂Bj1 =1 (⋅) ⎤⎥ ⎥ (l) ⎥ ∂ξig,q2 ⎥ ⎥ ⎥ ⋮ ⎥ ∂Bj ∗ =q(l∗ ) (⋅) ⎥⎥ 2 ⎥ ⎥ (l) ⎥ ∂ξig,q2 ⎦

, ∣ ξig =0

with B(⋅) being a (K˙ ×J ∗ )×1 vector defined as B(⋅) = (B′j1 =1 (⋅), ..., B′j ∗ =q(l∗ ) (⋅))′ , and Bj (⋅) 2 being a K˙ × 1 vector, for every j in j1 to j2∗ in the structural equation for each endogenous latent variable at level l. For this proposal distribution, σω2 is also fixed to a value such that average acceptance rate is above 0.25 (Gelman et al., 1995). Thus, the acceptance probability is ⎧ ⎫ (l) ⎪ p(ωig ∣ ⋅) ⎪ ⎪ ⎪ ⎪ ⎪ min ⎨1, ⎬ (l),(m) ⎪ ⎪ ⎪ p(ωig ∣ ⋅) ⎪ ⎪ ⎪ ⎩ ⎭

CHAPTER

6

Simulations & Application

6.1.

A Simulation Study

A simulation study is presented to provide an empirical idea of the performance of the proposed Bayesian estimation of the SPHSEM. We simulated a dataset for i = 1, 2, ..., 800 observations at the first level, l = 1, distributed heterogeneously among G = 20 groups at a second level, l = 2, ranging from ng = 28 to ng = 51. For practical purposes, we present a simulation exercise for a set of six continuous manifest variables, y = {y1 , ...y6 }, related in a linear fashion, as in equation (4.6), with a set of endogenous and exogenous latent variables varying at both l = 1 and l = 2. The proposed algorithm works for the case in which count, and ordered and unordered categorical variables are simulated. The measurement equations were simulated as it follows: yig = µ + Λ(2) ωig,(2) + ig (6.1) or in a more explicit notation: ⎡ 0 ⎡yig,1 ⎤ ⎡µ1 ⎤ 1 0 1 ⎤⎥ ⎢ ⎢ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 0 λ2,2 ⎥ ⎢ 0 λ2,1 ⎢yig,2 ⎥ ⎢µ2 ⎥ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎢ ⎢yig,3 ⎥ ⎢µ3 ⎥ 0 λ3,2 ⎥⎥ ⎥ = ⎢ ⎥ + ⎢ 0 λ3,1 ⎢ ⎢ 1 ⎢y ⎥ ⎢µ ⎥ 0 1 0 ⎥⎥ ⎢ ⎢ ig,4 ⎥ ⎢ 4 ⎥ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎢λ5,1 ⎢yig,5 ⎥ ⎢µ5 ⎥ 0 λ5,2 0 ⎥⎥ ⎥ ⎢ ⎥ ⎢ ⎢ ⎢yig,6 ⎥ ⎢µ6 ⎥ ⎢λ6,1 0 λ6,2 0 ⎥⎦ ⎣ ⎦ ⎣ ⎦ ⎣

⎡ig,1 ⎤ ⎢ ⎥ ⎡η (1) ⎤ ⎥ ⎢ ⎢ ig,1 ⎥ ⎢ig,2 ⎥ ⎢ (1) ⎥ ⎥ ⎢ ⎢ξ ⎥ ⎢ig,3 ⎥ ⎢ ig,1 ⎥ ⎥ ⎢ ⎢ (2) ⎥ + ⎢ ⎥ ⎢η ⎥  ig,4 ⎥ ⎢ ⎢ g,1 ⎥ ⎥ ⎢ ⎢ (2) ⎥ ⎢ig,5 ⎥ ⎢ξ ⎥ ⎥ ⎢ ⎣ g,1 ⎦ ⎢ig,6 ⎥ ⎦ ⎣

(6.2)

The true values for µ and Λ(2) where randomly generated and fixed to the values µ1 = ... = µ6 = 0.5, and λ2,1 = ... = λ6,2 = 0.8. The 1’s and 0’s in Λ(2) are fixed parameters that identify the model. Error terms ig are distributed as ig ∼ N [0, Ψ ], with the covariance matrix fixed at Ψ = 1.5 ∗ I6 (where I6 is the identity matrix order 6). (1)

The exogenous latent variables are drawn from ξig ∼ N[0, Φ(1) ], with Φ(1) = 1, and

(2)

ξg

∼ N[0, Φ(2) ], with Φ(2) = 1. For each i in 1, ..., 800 and for each g in 1, ..., 20, the 69

70

CHAPTER 6. SIMULATIONS & APPLICATION (1)

(2)

endogenous latent variables ηig,1 and ηg,1 were simulated according to the following simple, nonlinear, smooth, structural functions (as in equation 4.12):

(1)

(1)

(1)

(2)

(2)

(2)

(1)

ηig,1 = f11 (ξig,1 ) + f12 (ηg,1 ) + δig,1

(6.3)

ηg,1 = f21 (ξg,1 ) + δg,1

(6.4)

(1)

(2)

(2)

(2)

(2)

(2)

with f11 (ξig,1 ) = cos(1.5 ∗ ξig,1 ), f12 (ηg,1 ) = sin(ηg,1 ), and f21 (ξg,1 ) = 2 ∗ sin(1.5 ∗ ξg,1 ). (1)

(1)

(2)

Also, δig,1 ∼ N[0, Ψδ ] and δg

(2)

(1)

∼ N[0, Ψδ ], with values fixed at Ψδ

(2)

= Ψδ

= 1.

In this simulation study, the hyperparameters in the prior distributions presented in subsection 5.1 (equations 5.8 to 5.14) were assigned the following values: µ10 = .... = µ60 = 0, σ120 = .... = σ620 = 100, λ2,1 = ... = λ6,2 = 0, a matrix of appropriate order HΛk0 = 100 ∗ I6 , and αΛk0 = αδk0 = αγ0 = 1 with βΛk0 = βδk0 = βγ0 = 0.005 for uninformative priors on the dispersion priors hyperparameters. A total of 22 < N˙ 1 = 800 equidistant nodes were used to construct the cubic P-Splines for the latent variables belonging to the first level. Accordingly, for the second level, 12 < N˙ 2 = 20 knots where used. A first order random walk penalty matrix of appropriate order (Mγ ) was used for the Bayesian P-Splines in estimating the unknown smooth functions.

6.1.1.

MCMC Simulations and Results

After several thousands of iterations of Algorithm 5.1, we present the results of the simulation study in this subsection. Due to the presence of latent variables, the usage of data augmentation techniques, the complexity of the SPHSEM structure, but specially to the presence of cross-level effects, the resulting chains converge at a very slow rate. It took a burn-in phase of 75,000 iterations and used an additional 5,000 to compute the Bayesian estimates of the parameters in θ and their 5% and 95% density bounds, respectively. To avoid confusion, the estimates of γ and Ω are not presented in table format. However, the Bayesian estimates for Ω are displayed in Figures 6.1 and 6.2, while those of γ are shown in Figure 12 in the Appendix section. It is observed that the transitions from γk to γk+1 are smooth. The simulations were ran in an Intel Core—i7 CPU @2.00Ghz, 8.00GB RAM machine. It took about 12 hours to run a 10,000 iteration cycle.

®

The Gibbs sampler with the proposed MH algorithm produce the Bayesian estimates of the unknown parameters. In the MH algorithm (step 1 in Algorithm 5.1) we set σω2 = 1.5 to give an acceptance rate of around 61% and 58% for the latent variables at levels 2 and 1, respectively. Main results are reported in Table A1 in the Appendix section. The main result of this thesis is that the proposed Bayesian estimation of the SPHSEM recovers the nonlinear (causal) relationships between latent variables modeled in the set of structural equations with cross-level effects. As an example, Figures 6.1 and (1) (1) (2) (2) 6.2 show the recovered values for the latent variables ξig,1 , ηig,1 , ξg,1 , and ηg,1 , from the SPHSEM simulation study (black dots), together with the respective estimated values for (1) (2) the endogenous latent variables (blue dots, ηˆig,1 and ηˆg,1 ), and the true functions, f11 (⋅)

71

CHAPTER 6. SIMULATIONS & APPLICATION

and f21 (⋅), from which the original values for latent variables were simulated (red curves, equations 6.3 and 6.4).

0 −1 −4

−3

−2

Recovered values for

η1

1

2

3

Recovered Latent Variables

−2

−1

0

1

2

ξ1

(1)

(1)

(1)

Figure 6.1. Recovered values (ξig,1 , ηig,1 , black dots), fitted values (ˆ ηig,1 , blue dots), and true functions (f11 (⋅), red curve) for the nonlinear causal relationship between latent variables varying at Level 1. Source: Author’s own calculations.

6.1.2.

Bayesian Model Comparison and Goodness-of-fit Statistics

A key question when using complex and computationally intensive models (like the SPHSEM proposed herein) is if such model produces a better fit following some statistical criteria than somewhat simpler, more elementary alternatives (like the HSEM in Lee and Tang, 2006). One common Bayesian model comparison statistic is the Bayes Factor (Kass and Raftery, 1995). Assume that the observed data a arose from one of two competing (nonlinear or semiparametric) models, M0 and M1 . Let p(a ∣ Mi ) be the probability density of a given Mi , for i = 0, 1. The Bayes Factor (BF) statistic for evaluating M1 against M0 is defined by p(a ∣ M1 ) BF10 = (6.5) p(a ∣ M0 ) It is clear that the BF is a summary of the statistical evidence provided by the observed data in favor of a model M1 , as opposed to the competing model M0 . The marginal densities involved in the computation of BF10 are obtained by integrating over the parameter space as p(a ∣ Mi ) = ∫ p(a ∣ Mi , θMi ) p(θMi ∣ Mi ) dθMi . However, the latter densities are difficult to obtain analytically. Lee and Song (2003b) demonstrated that the path-sampling procedure presented in Gelman and Meng (1998) can be useful for computing the BF. Difficulties arise when establishing a link between the SPHSEM defined and conventional HSEM models within the path-sampling procedure. Moreover, even if the empirical computations of the marginal densities is a byproduct of the Bayesian estimation method,

72

CHAPTER 6. SIMULATIONS & APPLICATION

0

η2

−3

−2

−1

Recovered values for

1

2

3

Recovered Latent Variables

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

ξ2

(2)

(2)

(2)

Figure 6.2. Recovered values (ξg,1 , ηg,1 , black dots), fitted values (ˆ ηg,1 , blue dots), and true functions (f21 (⋅), red curve) for the nonlinear causal relationship between latent variables varying at Level 2. Source: Author’s own calculations.

it is highly computationally intensive. As pointed by Song et al. (2013), the computational cost for semiparametric SEMs is even higher. Therefore, we follow the Bayesian model comparison section in Song et al. (2013) and present and index based on the Deviance Information Criterion (DIC, Spiegelhalter et al., 2002). The DIC balances model complexity and goodness-of-fit to the data. The model with the smallest DIC is usually preferred in the model-comparison procedure. An extensive analysis of the DIC for models with missing data and latent variables (such as SEMs) is presented in Celeux et al. (2006). They demonstrate how the complete DIC, a statistic based on the complete-data log-likelihood (log p(D, Ω(L) ∣ θ), with D = (a, Z∗ , W), see Chapter 5) is the best version of the DIC for Bayesian model comparison with latent variables. The complete DIC (cDIC) is defined as: cDIC = −4Eθ,Ω(L) {log p(D, Ω(L) ∣ θ) ∣ D} + 2EΩ(L) {log p(D, Ω(L) ∣ Eθ [θ ∣ D, Ω(L) ]) ∣ D} . (6.6) For the proposed generalized SPHSEM, log p(D, Ω(L) ∣ θ), is computed as L G ng

(l)

log p(D, Ω(L) ∣ θ) = ∑ ∑ ∑ log p(aig , z∗ig , wig , ωig ∣ θ),

(6.7)

l=1 g=1 i=1

(l)

with p(aig , z∗ig , wig , ωig ∣ θ) as in equations (5.20) and (5.21) (see Chapter 5). The first expectation in equation (6.6) can be approximated as: Eθ,Ω(L) {log p(D, Ω(L) ∣ θ) ∣ D} = ∫ log p(D, Ω(L) ∣ θ) p(Ω(L) , θ ∣ D) dΩ(L) dθ

CHAPTER 6. SIMULATIONS & APPLICATION

73

1 T (t) (t) ∑ log p(D, Ω(L) ∣ θ ) T t=1

(6.8)

≈ (t)

where {(Ω(L) , θ(t) ) ∶ t = 1, ..., T } are the MCMC samples drawn from the posterior

distributions outlined in Chapter 5. Also, let {θ(m,t) ∶ m = 1, ..., M } be a chain of M ≤ T (t)

samples for the parameters’ posterior, p(θ ∣ D, Ω(L) ). We have that 1 M (m,t) (t) Eθ [θ ∣ D, Ω(L) ] ≈ θ¯(t) = . ∑θ M m=1 Therefore, the second expectation in equation (6.6) can be approximated by EΩ(L) {log p(D, Ω(L) ∣ Eθ [θ ∣ D, Ω(L) ]) ∣ D} ≈

1 (t) log p(D, Ω(L) ∣ θ¯(t) ). T

(6.9)

It is important to bear in mind that the sample averages in equations (6.8) and (6.9) are computed using the MCMC samples obtained using the Gibbs sampler and the MH algorithm presented in the previous chapter, and therefore, the computational cost for computing complete-data log-likelihood DIC values is not as high as if we computed other Bayesian model comparison statistics. In order to assess the performance of the SPHSEM, we compare the obtained results from our model versus an alternative, simpler, linear setting of an HSEM, similar to that in Lee and Tang (2006) and Lee (2007, section 9.7) (see the proposal distribution for the MH step therein). Afterwards, the cDIC for each model is computed. We fit the HSEM to the data simulated from the linear measurement equations and nonlinear structural equations presented at the beginning of this chapter. We assume the same linear structure for the measurement equations, as in (6.2), but instead of the nonlinear structural equations in (6.3) and (6.4), we assume a linear function for both levels, with cross-level effects, as: (1)

(1)

(2)

(2)

(2)

(1)

ηig,1 = γ1 ξig,1 + γ2 ηg,1 + δig,1

(6.10)

ηg,1 = γ3 ξg,1 + δg,1

(6.11)

(2)

For the HSEM, we also set σω2 = 1.5, and obtain an average acceptance range of 46% for latent variables at level 1 (individual level) and 48% for latent variables at level 2 (group level). These rates are lower than to those reported for the SPHSEM at both level 2 and 1. However, these numbers are misleading since the nonlinear (causal) relationships between latent variables are not recovered by the HSEM estimates. Table A2 in the Appendix section reports the estimation results for the parameters comparing both the SPHSEM and the HSEM. Results reported therein provide evidence of bias most HSEM parameters’ estimates. It is of particular interest how the parameters related to the cross-level effects, and the (1) (2) second level endogenous latent variables, i.e. µ4 , ψ,4 , λ6,1 , and both ψδ,1 and ψδ,1 are highly biased in the HSEM case. provides biased estimates for parameters in θ. As a piece of information, the sum of the absolute values for the bias of each estimated parameter is

CHAPTER 6. SIMULATIONS & APPLICATION

74

reported. For the SPHSEM that sum relatively low (3.471), while for the HSEM case, it (1) (2) is significantly higher (27.026), determined mostly by the bias of ψδ,1 and ψδ,1 . We also compute the cDIC statistic for a third model (in addition to the SPHSEM and HSEM), which we call the “oracle” model. In this exercise we fix the parameter in θ and the latent variables in Ω to their original, true values. The results for the cDIC statistics are presented in table 6.1. Table 6.1. Complete-data log-likelihood DIC

Model

cDIC

SPHSEM HSEM Oracle

15, 207.16 16, 160.01 12, 967.09

From these results it is clear the the SPHSEM outperforms the classical HSEM in terms of goodness-of fit for the simulated dataset. The cDIC for the SPHSEM (15,207.16) is higher than that of the oracle model (12,967.09), but is lower than that of the HSEM (16,160.01).

6.1.3.

An Intervention to the Simulated Causal System

We present an intervention to the structural system of equations through the do-operator presented in Pearl (2009b). As exogenous, explanatory variables are understood to be the Markovian Parents (causes) of endogenous, random variables (effects), we intervene the systems by setting the exogenous, latent variables at an arbitrary chosen level, and then compare the outcome versus the non-intervened scenario presented above. (2)

First, a group-level intervention is performed to the exogenous, latent variable ξg,1 . In each case, the do-operator acts by modifying the entering values in f21 (⋅) in the structural equation (6.4), and replacing them by a fixed value equal to 0.5 for those group with observed pre-intervention values below zero (0), i.e, assume an intervention of the (2) (2) form do(ξg,1 = 0.5) ∀ g ∈ {g ∶ ξg,1 < 0}. This group-level intervention has casual effects over the individual latent variables as well, through the cross-level effect in equation f12 (⋅). Using the potential outcome notation for the endogenous latent variables, the treat(2) ment effect (i.e. do(ξg,1 = 0.5) for selected g’s) can be expressed as (2)

(2)

(2)

(2)

(2)

(2)

τg = ηg,1 (do(ξg,1 = 0.5), δg,1 ) − ηg,1 (ξg,1 , δg,1 ),

(6.12)

τig =

(6.13)

(2) (1) (2) ηg,1 (ξig,1 , do(ξg,1

=

(1) (2) (1) (2) (1) 0.5), δig,1 ) − ηig,1 (ξig,1 , ξg,1 , δig,1 ),

for the group and individual levels, respectively. The treatment effect is computed by comparing the non-treated individuals/groups versus the potential outcomes of the treated ones. The counterfactual potential outcomes correspond to the estimated,

75

CHAPTER 6. SIMULATIONS & APPLICATION

semi-parametric structural functions, fˆ11 (⋅), fˆ12 (⋅) and fˆ21 (⋅), evaluated at the proposed (2) intervention value, i.e fˆ21 (do(ξg,1 = 0.5)), and both fˆ11 (⋅) and fˆ12 (⋅) evaluated at the (2)

(2)

resulting hypothetical value ηg,1 (do(ξg,1 = 0.5)). (2)

Figures 6.3a and 6.3b show how the intervention do(ξg,1 = 0.5) has causal direct and indirect effects on the endogenous latent variables. Black dots represent the non-intervened groups/individuals. Blue dots represent the observed values for endogenous latent variable for the intervened ones, previous to the intervention, i.e., in Figure 6.3a, blue dots are those (2) (2) groups for which ξg,1 < 0 holds true, and therefore those in which the treatment do(ξg,1 = (2)

0.5) was applied. Dotted lines (black and blue) are the average values of ηg,1 for non treated and to-be-treated groups, which take values of 1.441 and -0.875, respectively. The red dot is the value that would attain each of the treated groups, evaluated at the estimated function fˆ21 (⋅), through the SPHSEM. The dotted red line is the post-intervention average value of the endogenous latent variable for the treated groups, and takes a value of 1.352. (2) It is clear that the difference between the average values of ηg,1 for the non treated (black dotted line) and the treated (red dotted line) is lower after the intervention. Recovered Latent Variables

η1 0 −4

−2

Recovered values for

1 0 −1 −2 −3

Recovered values for

2

2

η2

3

4

Recovered Latent Variables

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

−3

ξ2

−2

−1

0

1

2

3

ξ1

(2)

(2)

(a) Direct effect do(ξg,1 = 0.5) on ηg,1 .

(2)

(1)

(b) Indirect effect do(ξg,1 = 0.5) on ηig,1 . (2)

(2)

Figure 6.3. Direct and Indirect causal effect of do(ξg,1 = 0.5) for g ∈ {g ∶ ξg,1 < 0}. Source: Author’s own calculations.

Similar is displayed in Figure6.3b. Black dots represent the non-treated individuals. (2) The black dotted line is the average value of ηg,1 for the non-treated individuals. The blue dots and the blue dotted line correspond to the recovered values and their average for the treated individuals, previous to receiving the treatment, respectively. Red dots, and the red dotted line, are the counterfactual values, and their average, of the endogenous (2) latent variable ηg,1 for the treated individuals. Again, the difference between the averages for the non-treated (black dotted line) and the treated (blue dotted linen) is lower af(2) ter the group-level intervention, do(ξg,1 = 0.5) for selected g’s, is supplied (red dotted line). Before the intervention, the difference for the latent endogenous variable, between the non-treated (black dotted lines) and the to-be-treated (blue dotted lines) groups and individuals were 2.316 and 0.362, respectively. A differences in means t-test suggests that these differences are statistically different from zero (p-values equal to 0.000 for

76

CHAPTER 6. SIMULATIONS & APPLICATION (2)

the group and individual samples). After the intervention, do(ξg,1 = 0.5, the differences (black dotted line versus red dotted line) were reduced to 0.090 and 0.113 for the group and the individual levels, respectively. These differences are statistically equal to zero, according to a differences in means t-test (p-values are 0.357 and 0.0992 for the group and individual levels). The statistical tests suggest that this particular intervention eliminates the differences between the selected groups and individuals, in terms of the mean values (2) (1) of ηg,1 and ηig,1 . Now, assume an individual-level intervention. In this case, the exogenous latent (1) variable ξig,1 is manipulated through the do-operator for selected individuals, and set to a fixed value equal to 0.5. In this case, we pick those individuals with values below (1) (1) (1) -1 for ξig,1 , i.e. assume an intervention of the form do(ξig,1 = 0.5) ∀ i ∈ {i ∶ ξig,1 ∈ (−∞, −1]}. Again, the potential outcomes allow for estimating the treatment effect at the individual level (as in equation (6.13)). Given the fact that lower level interventions do not have causal effects over higher level interventions, the treatment effect at the group level is not estimated.

0 −2 −4

Recovered values for

2

η1

4

Recovered Latent Variables

−3

−2

−1

0

1

2

3

ξ1 (1)

(1)

Figure 6.4. Direct effect of do(ξig,1 = 0.5) for i ∈ {i ∶ ξig,1 ∈ (−∞, −1]}. Source: Author’s own calculations.

In Figure 6.4, black dots represent the non-intervened individuals, and the black dotted line the average value for the latent variable corresponding to the non-treated individuals. Blue dots represent the pretreatment values for the to-be-intervened individuals, and the blue dotted line the average value for the latent variable of the to-be-treated individuals. The red dot corresponds to the hypothetical, counterfactual value that treated individuals would attain if they were indeed subject to the proposed intervention. The (1) red dotted line corresponds to the value ηig,1 would attain for individuals under the treat-

77

CHAPTER 6. SIMULATIONS & APPLICATION

ment regime. It is computed using the estimated semiparametric function of the SPHSEM. Before the intervention, the difference between the non-treated (black dotted line) and the to-be-treated (blue dotted line) individuals was 0.938. After the intervention, the (1) difference is not −0.284, which means that the counterfactual average value of ηig,1 for the treated individuals would be even higher than that for the non-treated. However, given the sample variance, both differences are statistically equal to zero according to a t-test performed for the non-intervened and intervened individuals. Results are summarized in Table 6.2. Table 6.2. Intervention Results

(l)

(l)

ηT - ηN T Level 1 Level 2

Intervention No intervention (group treatment) (2) do(ξg,1 = 0.5) No intervention (unit treatment) (1) do(ξig,1 = 1.5)

0.362 0.113 0.938 -0.284

2.316 0.090 -

Sensitivity Analysis : Due to the nonlinear nature of the simulated, causal relationships between latent variables, a change in the intensity of the intervention will not be corresponded with a change on the output of the same magnitude. An important feature of the SPHSEM is that acknowledges these characteristics of the data generating process. Using the same simulated example, in Table 6.3 we present a sensitivity analysis of a group (2) level intervention, do(ξg,1 = α), to different values (α) of the exogenous, second-level latent (2)

variable ξg,1 . Results suggest that interventions of greater magnitudes are followed by also increasing causal effects on the treatment effect at the group level. Table 6.3. Sensitivity Analysis to Group Interventions (l)

Value of Intervention No intervention α = 0.0 α = 0.5 α = 1.0 α = 1.5

6.2.

(l)

ηT - ηN T Level 2 2.316 1.317 0.090 -0.873 -1.568

Empirical Application of the SPHSEM: Hostility, Leadership, and Task Significance Perceptions

We present the advantages of the Bayesian SPHSEM over more conventional, simpler models. In this example, we use the dataset described in Bliese et al. (2002), available

78

CHAPTER 6. SIMULATIONS & APPLICATION

within the distribution of the R-package multilevel (Bliese, 2016). The original dataset consists of 21 measures of items related to individual’s perceptions of Leadership Climate (LEAD), Task Significance (TSIG), and Hostility (HOSTIL) for 2,042 U.S Army soldiers grouped within 49 companies (groups). For simplicity, in this application we restrict our attention to those companies with 25 or more soldiers. As a result, we end up with a sample of 1,723 soldiers (individuals) belonging (heterogeneously) to 29 companies (groups). This dataset was used to assess theoretical models of stress related to working conditions and job situations. The authors wanted to prove whether or not higher levels of both leadership climate in a group (company) and level task significance perception of at the individual (soldier) and group levels caused higher responses of personal job-related wellbeing items, measured by the degree to which individuals are keen to exhibit hostile behaviors or not. Core to this analysis is the theory of individual level and nomothetic perspectives of job stress discussed in Bliese and Halverson (1996). The former emphasizes the role of individual perceptions, based on unique personality traits, beliefs, goals, abilities, etc, on the formation of wellbeing self-reported perceptions. The latter approach is based on the role of environmental variables on individual job stress. In order to assess the causal relationships between individual and group (latent) variables of leadership climate and task significance on individual, self-reported hostile behaviors, we set a SPHSEM, a two-level SEM, with structural equations following a semi-parametric structure, as the one explained in previous chapters. The dataset set consists of 21 variables for each unit, as described in the Appendix section. The implicit plate diagram we propose for this exercise is also presented in Figure 13, in the Appendix section. We model individual level Hostile Behavior (HOST ILig ) as an endogenous latent variable, caused by individual level Leadership Climate (LEADig ), Task Significance (T SIGig ) latent exogenous variables and by group level HOST ILg (cross-level effects). The latter is in turn modeled as an endogenous group level latent variable, caused by group level measurements of Leadership (LEADg ) and Task Significance (T SIGg ). Said that, we assume the structural equations for the proposed SPHSEM as: (1)

(1)

(1)

(1)

HOST ILig = f11 (LEADig ) + f12 (T SIGig ) + f13 (HOST ILg(2) ) + δig,1

(6.14)

(2) (2) HOST IL(2) g = f21 (LEADg ) + f22 (T SIGg ) + δg,1

(6.15)

(2)

We expect both LEAD and TSIG latent variables (at any level) to have a negative (causal) relationship with individual level Hostile Behavior scales, i.e., f11 (⋅), f12 (⋅), f13 (⋅) and f21 (⋅), f22 (⋅) should display negative trends. As explained in the Appendix section, questions are based on a five-point Likert scale. Following the advice in Preston and Colman (2000), who show that SEM and factor analyses based on less than seven-point items can be highly unreliable, we transform the items in our example to continuous, standardized random variables. Again, and given the lack of information about the parameters involved in the application study, we use the same uninformative priors used in the simulation example subsection above. Also, based on the meaning of the latent constructs and the identifia-

CHAPTER 6. SIMULATIONS & APPLICATION

79

bility constraints, we consider a non-overlapping structure of the loading matrix for the corresponding, similar to that displayed in equation (6.1). In order to avoid over-fitting, a total of 6 equidistant knots were used. The second-order random-walk penalty was used for the Bayesian P-splines in estimating the unknown, smooth functions. Due to the convergence issues addressed previously and the computational burden, we collected 5,000 observations after an initial burn-out phase of 45,000 cycles of the algorithm for the Bayesian inference.

6.2.1.

Results

Results for the empirical example are satisfactory. The average acceptance rates for the latent variables at the individual and group levels are 38.7% and 26.7%, respectively. Given the complexity of the model, these acceptance rates are acceptable. The Bayesian estimates of the structural parameters of the SPHSEM example using Bliese et al. (2002) dataset are presented in Table A3 in the Appendix section. For simplicity, we do not present results for Ω, τ , and for γ parameters in the B-Splines representation of the structural equations. As expected, the relationship between the latent constructs Hostile Behavior (HOSTIL) and both Leadership Climate (LEAD) and Task Significance Perception (TSIG) is negative. However, the SPHSEM results suggest that these relationships are non linear at both individual and group levels. The latter can be inferred from the recovered values of the latent variables, as displayed in Figures 6.5a to 6.5d. Results for the structural equations at the individual level suggest that higher Leadership Climate in the company causes individual Hostile Behavior to decline. However, as evidenced by the fitted values of the function f11 (⋅) in Figure 6.5a, the slope of this causal relationship is steeper for lower scores of LEAD relative to higher ones. The latter might provide evidence for overall lower job-related stress levels (measured by hostile behavior) for individuals who perceive higher, positive leadership attitudes in the environment they work. Also, for those individuals (high-performing, low-stressed), marginal improvements on leadership might not further reduce job-related stress levels, but might have different impact on other variables (not measured in this study). On the other hand, it is less clear how a higher individual Task Significance perception causes Hostile Behavior to decline. A positive, but possibly non-significant, relationship is recovered for individuals with lower TSIG scores. The latter suggest that very low TSIG values might actually cause HOSTIL to increase, which makes sense. However, mid-to-high scores display a negative relationship, as expected. For individuals with higher task significance perceptions there is no evidence for TSIG causing decreases in HOSTIL, which are already low, as shown in Figure 6.5b. Results for the structural equations at the group level are more robust, and do, in fact, reflect a negative causal relationship between both LEAD and TSIG, and HOSTIL. The fitted values (red lines) for functions f21 (⋅) and f22 (⋅) suggest that higher group-level Leadership Climate scores reduce environmental job-related stress levels, measured by

80

CHAPTER 6. SIMULATIONS & APPLICATION

1.5 1.0 0.5 −0.5

0.0

Recovered values for HOSTIL, individual level

1.5 1.0 0.5 0.0 −0.5

Recovered values for HOSTIL, individual level

2.0

Causal Relationship between Hostile Behavior and Task Significance

2.0

Causal Relationship between Hostile Behavior and Leadership Climate

−4

−2

0

2

4

−6

−4

Recovered values for LEAD, individual level

(a) HOSTILig vs LEADig , f11 (⋅).

0

2

4

(b) HOSTILig vs TSIGig , f12 (⋅).

0.05 0.00 −0.05 −0.10

−0.10

−0.05

0.00

0.05

Recovered values for HOSTIL, group level

0.10

Causal Relationship between Hostile Behavior and Task Significance

0.10

Causal Relationship between Hostile Behavior and Leadership Climate

Recovered values for HOSTIL, group level

−2

Recovered values for TSIG, individual level

−1.0

−0.5

0.0

0.5

1.0

Recovered values for LEAD, group level

(c) HOSTILg vs LEADg , group level, f21 (⋅).

1

2

3

4

5

Recovered values for TSIG, group level

(d) HOSTILg vs TSIGg , group level, f22 (⋅).

Figure 6.5. Causal relationship between latent variables at both the individual and group levels. Source: Author’s own calculations.

the group indicator of Hostile Behavior, as shown in Figures 6.5c and 6.5d. The complete-data log-likelihood Deviance Information Criteria (cDIC) statistic for the SPHSEM model is 18, 118.22. The cDIC is lower than that of the simpler, linear HSEM (20, 056.88), suggesting that the SPHSEM is the best model and also reaffirming the necessity of its usage. These results are illustrative in the sense that they demonstrate the power and versatility of the Semiparametric Hierarchical SEM, SPHSEM, when it comes to estimating and recovering nonlinear patterns in the functional relationship between latent variables at both the individual and group levels. They are also useful because they suggest that the SPHSEM can be used both as an exploratory tool for investigating functional forms and as a confirmatory tool for selecting models through statistically based criteria.

6.2.2.

Analysis of Intervention: What if soldiers reported higher Leadership perceptions?

In this particular example, we want to answer what would happen to individual autoperception of Hostile behavior if soldiers perceived higher levels of Leadership climate

81

CHAPTER 6. SIMULATIONS & APPLICATION

at the company’s level? Would it decrease if, say, a coaching program was introduced to improve Leadership attitudes of Non-Commissioned Officers (NCO’s) and Officers? These are policy questions that can be answered through the do-operator and the SPHSEM model. For example, assume the following intervention: NCO’s and Officers in command of low Leadership ranked companies are signed to a coaching program, such that their Leadership abilities and attitudes are improved until they reach at least group level Leadership scores observed at the 75th quantile (0.568). That is, assume an intervention (2) (2) do(LEADg = 0.568) ∀ g ∈ {g ∶ LEADg < 0}. This intervention is expected to directly (2) reduce Hostile perception at the group level, (HOST ILg ), and indirectly reduce Hostile (1) perception at the soldier’s level, (HOST ILig ). In this case, the value in the do-operator function is evaluated in the estimated functions (red lines in figures 6.5c and 6.5a). Given the estimated functional forms of these lines, an evaluated value greater than zero will predict a negative causal impact of Leadership perception and Hostile behavior display. Figures 6.6a and 6.6b show the direct and indirect effects of the intervention (2) do(LEADg,1 = 0.568) on both the group and individual level of Hostile behavior, (1)

(2)

HOST ILg,1 and HOST ILig,1 , respectively. In figure 6.6a, the black dots represent the non-intervened companies (black dotted line is average for HOSTIL scores, which takes a value of -0.057). The blue dots represent the to-be-intervened groups, i.e. low ranked (2) companies such that the intervention do(LEADg,1 = 0.568) is to be supplied (blue dotted line is the average, and takes a value of 0.044). The red dot is the counterfactual value for both Leadership climate and Hostile behavior that companies would display had they reported Leadership climate scores at least as good as the 75th quantile (i.e, 0.568). This (2) counterfactual is computed by evaluating the intervention value do(LEADg,1 = 0.568) on the estimated semiparametric structural equation in the SPHSEM (red dotted line, with value -0.048). It is clear how the hypothetical treatment would cause lower levels of perceived hostility at the group level.

−1.0

−0.5

0.0

0.5

1.0

−0.5

0.0

0.5

1.0

1.5

2.0

Causal Relationship between Hostile Behavior and Leadership Climate Recovered values for HOSTIL, individual level

0.10 0.05 0.00 −0.05 −0.10

Recovered values for HOSTIL, group level

Causal Relationship between Hostile Behavior and Leadership Climate

−4

Recovered values for LEAD, group level

−2

0

2

4

Recovered values for LEAD, individual level

(2)

(a) Direct effect do(LEADg,1 = 0.568) (2)

on HOST ILg,1 .

(2)

(b) Indirect effect do(LEADg,1 = 0.568) (1)

on HOST ILig,1 . (2)

Figure 6.6. Direct and Indirect causal effect of do(LEADg,1 = 0.568) for low ranked Leadership (2)

companies (i.e. LEADg,1 < 0). Source: Author’s own calculations.

CHAPTER 6. SIMULATIONS & APPLICATION

82

In figure 6.6b, the black dots represent the non-intervened individuals (average for the endogenous latent variable of interest is displayed as a black dotted line, and takes a value of -0.039). The blue dots are the individuals belonging to the to-be-intervened companies (average for these individuals is displayed as a blue dotted line, value equal to 0.050). The red line is the counterfactual values for the unit level endogenous latent variable that the treated individuals would display had they been treated , i.e. belong to a company whose NCO’s and Officers are subject to the coaching program. The average, hypothetical Hostility scores for the treated individuals (red dotted line, value equal to -0.0006) is in fact similar to that of the non-treated. Before the intervention, the differences between the non-treated and the to-be-treated groups and individuals were −0.102 and −0.089, respectively. After the group-level intervention, the differences were reduced to −0.008 and −0.038, at group and soldier’s level, respectively. As expected, raising group’s Leadership scores (by an hypothetical coaching program) causes both individual and group levels perceptions of Hostile behavior to decrease. However, these differences were not statistically significant different from zero, according to a differences in means t-test.

Conclusions

Causal inference is a crucial step in scientific discovery, since allows for estimating causal effects of interventions (variables) on outcomes of interest from non-experimental, observational data. Most of the literature in causal inference explicitly assume no causal connections or interactions among individuals, an assumption that is frequently violated in the Social, Behavioral, Biomedical and Epidemiological Sciences. It has been formally shown that the presence of causally connected individuals produce biased estimations of the causal effects of the desired intervention (Rosenbaum, 2007; Sekhon, 2008). Despite of this issue, as acknowledged by the scholars in the field of causal inference, a formal theory and statistical tools for empirical research are still yet to be developed (VanderWeele and An, 2013). This thesis contributes to fill this gap by presenting a Semi-Parametric Hierarchical Structural Equation Model (SPHSEM) that counts for the presence of non independent, causally connected units clustered in groups that are organized following a multilevel structure. The SPHSEM builds upon the work on multilevel Structural Equation Models (SEM) of Rabe-Hesketh et al. (2004, 2012) and Lee and Tang (2006); Lee (2007), among others, in the sense that accommodates a set of semi-parametric structural equations when modeling the latent variables’ means in a HSEM, as in Song et al. (2013). The SPHSEM then allows for modeling nonlinear, cross-level, causal relationships between latent variables within the theoretical Structural Causal Model (SCM) framework for causal discovery presented and developed by Pearl (2009b) and others. We use Bayesian techniques, namely an hybrid algorithm that combines the Gibbs Sampler (Geman and Geman, 1984) and the Metropolis Hastings algorithm (Metropolis et al., 1953; Hastings, 1970), to estimate the unknown parameters in the SPHSEM. After presenting the formal derivations of the model’s distributions, a simulation study shows that the Bayesian SPHSEM is capable of recovering the nonlinear causal relationships between latent variables with cross-level effects. These results were contrasted with those of a linear, hierarchical SEM model, HSEM. We conclude that the SPHSEM provides better goodness-of-fit indexes that the HSEM for the case when there are nonlinear, causal relationships between latent variables, cross-level effects, and non-independent units clustered within groups.

83

Further Research

The following is a list of possible further research topics relating the SPHSEM: ˆ A formal definition of identification conditions in the P-Splines specification: As suggested by Song et al. (2013), a formal study of identification constraints in the proposed SPHSEM is needed, perhaps following the methodologies developed by Jara et al. (2008) and San-Mart´ın et al. (2011). ˆ Model comparison and Bayesian selection: This item is not covered in this thesis. For a more rigorous testing of the empirical performance of the SPHSEM, it has to be tested against the null hypothesis of the data following a non hierarchical or a linear structure for the latent variable equations. A theoretical test has not been developed. Model comparison can be performed using Bayesian criteria. ˆ Multivariate spline functions: It can be useful to test whether the goodness-offit indexes improve by fitting multivariate spline functions, in place of separate, univariate cubic-spline basis for each latent variable entering in the structural part of the SPHSEM, at any level of aggregation. ˆ Mediation and computation of (in)direct effects: In order to provide causal claims using the SPHSEM, the researcher has to report a measure of direct or indirect causal effect of a given intervention. Very little has been written in this aspect. Bollen (1987) and Sobel (1987) present theoretical grounds, but it was only until Muth´en and Asparouhov (2015) that proposed how to model mediation and to compute direct and indirect effects within the linear SEM framework. However, to the best knowledge of the author, no paper has provided a clear explanation on how to compute direct and indirect effects within a multilevel and/or nonparametric SEM framework.

84

APPENDIX

Some Causal Quantities of Interest

In this Appendix we present a description for some of the most popular causal target quantities of interest, Q(P ′ ). Throughout this subsection we use the potential outcome notation in Rubin (1974, 1978).

Average Treatment Effect (ATE) Recall the potential outcome notation introduced in Chapter 1, where Yi (Xi , ti ) is the value an outcome variable of interest Y , for an individual i, would attain under treatment Ti = ti , with pre-treatment (or baseline) covariates Xi . For ease on notation, for now we omit the baseline covariates from the potential outcome notation. The causal effect of treatment Ti = ti over control Ti = ci for i is defined as τi = Yi (ti )−Yi (ci ). Since populationbased causal claims are derived from a sample of treated and untreated individuals, the average causal effect (ACE) or average treatment effect (ATE) is defined as AT E = E(τ ) = E [Y (t) − Y (c)] = E [Y (t)] − E [Y (c)] ≈

1 N 1 N ∑ Yi (ti ) − ∑ Yi (ci ), N i=1 N i=1

and it is estimated using the observed difference in means (after some matching procedure): 1 N 1 N ATˆ E = ∑ Yi (ti ) × I(i ∈ Nt ) − ∑ Yi (ci ) × I(i ∈ Nc ) Nt i=1 Nc i=1 where Nt is the portion of the sample assigned to the treatment regime, and Nc is the portion assigned to the control regime, i.e. Nc = N − Nt .

Conditional Average Treatment Effect (CATE) The conditional average treatment effect is the causal quantity to be estimated when the researcher is interest about the causal effect of a treatment T over specific strata of the 85

SOME CAUSAL QUANTITIES OF INTEREST

86

sample, according to ancillary, observed covariates. The CATE statistic is defined as CAT E = E [Yi (Xi , ti ) − Yi (Xi , ci ) ∣ Xi = Xa ] for Xa ∈ X, a particular realization or set of possible values of covariates X.

Complier Average Treatment Effects (CoATE) In some quasi-experimental settings, the treatment is not actually taken by every individual in the sample. Let Z be an indicator for whether an observation was assigned to the treatment, and D another indicator whether that observation actually received the treatment. In settings with non-compliers the sample is divided then into always takers (Di = 1, regardless of Zi ), never takers (Di = 0, regardless of Zi ), and compliers (Di = 1 when Zi = 1, and Di = 0 when Zi = 0). The complier average treatment effect is then defined as E(Yi ∣ Zi = 1) − E(Yi ∣ Zi = 0) CoAT E = E(Di ∣ Zi = 1) − E(Di ∣ Zi = 0)

Average Treatment Effects on the Treated (ATT) and the Control (ATC) We often need to know the effects of the treatment not just on the whole sample but specifically for those to whom the treatment is actually administered. We define the average effects of treatment among the treated (ATT) and the control (ATC) as simple counter-factual comparisons AT T = E(Yi (t) − Yi (c) ∣ Di = 1) = E(Yi (t) ∣ Di = 1) − E(Yi (c) ∣ Di = 1) AT C = E(Yi (t) − Yi (c) ∣ Di = 0) = E(Yi (t) ∣ Di = 0) − E(Yi (c) ∣ Di = 0) Bear in mind that when treatment is randomly assigned and there is full compliance, then AT E = AT T = AT C, since E(Yi (c) ∣ Di = 1) = E(Yi (c) ∣ Di = 0) and E(Yi (t) ∣ Di = 0) = E(Yi (t) ∣ Di = 1).

Causal Effects within the SCM Given a causal model M , and its associated graph G, it is important to make a distinction between total and direct effects of an intervention of the type do(X = x), on a variable of interest Y . Assume a set of intermediate variables Z in the paths that connect X and Y . Following Pearl (2005), and using the potential outcome notation in Rubin (1974, 1978), the total effect of an intervention do(X = x) over an alternative regime do(X = x∗ ) is defined as P (Yx = y) − P (Yx∗ = y). However, X might have some effect over Z, and to identify the direct effect of X on Y we should isolate every other indirect effect, by controlling for Z. That is, the direct effect of an intervention do(X = x) over an alternative regime do(X = x∗ ), controlling for Z, is defined as P (Yxz = y) − P (Yx∗ z = y), where Z = z stands for a specified level of the intermediate variables in Z. First, we define the difference between controlled and natural effects. Definition .0.1. (Controlled unit-level direct effect; Pearl, 2005): A variable X has a controlled direct effect on Y in a model M , and situation U = u, if there exists a

SOME CAUSAL QUANTITIES OF INTEREST

87

setting Z = z of the other variables in the model, and values of X, x and x∗ , such that Yxz (u) ≠ Yx∗ z (u). In words, Y under do(X = x) differs from its value under do(X = x∗ ) when we keep Z = z fixed. Definition .0.2. (Natural unit-level direct effect; Pearl, 2005): An event X = x has a natural direct effect on Y in situation U = u, if Yx∗ (u) ≠ Yx,Zx∗ (u) (u) holds. In words, the value of Y under do(X = x∗ ) differs from its value under do(X = x) even when we keep Z at the same value that it would attain under X = x∗ and U = u, i.e. Zx∗ (u). We now present definitions for some causal quantities of interest presented in the SCM literature.

Average Controlled Direct Effects (ACDE) Given a causal model M with causal graph G, the controlled direct effect of X = x on Y for an unit with covariates U = u, and setting Z = z is given by CDEz (x, x∗ ; Y, u) = Yxz (u) − Yx∗ z (u) where Z stands for all parents of Y in G, excluding X. Therefore, at the sample/population level, the average controlled direct effect (ACDE) is defined as ACDEz (x, x∗ ; Y, u) = EU (Yxz − Yx∗ z ), where the expectation is taken over U .

Average Natural Direct Effects (ANDE) Again, given a causal model M , the natural direct effect of do(X = x∗ ) on Y for unit U = u, is given by N DEz (x, x∗ ; Y, u) = Yx,Zx∗ (u) (u) − Yx∗ (u) and the average natural direct effect (ANDE) is defined as AN DEz (x, x∗ ; Y, u) = EU (Yx,Zx∗ (u) − Yx∗ ).

APPENDIX

Derivation of posterior distributions

∗ The derivations here presented are for the case of continuous variables, i.e yig,k = g(yig,k )= ∗ yig,k , but with subtle changes, the algebra is applicable for every underlying variable k ∈ p. Also for simplicity, in some cases we present the derivations for L = 2, but the extension to an arbitrary number of levels is straightforward.

Posterior joint distribution for thresholds parameters α and underlying latent variable for unordered categorical variables Z∗ (equations 5.17 and 5.18) From equation (5.4), the posterior distribution for (Z∗ , α) can be expressed as the product p(Z∗ , α ∣ θ, Ω(L) , Z) ∝ p(Z∗ ∣ α, θ, Ω(L) , Z) × p(α ∣ θ, Ω(L) , Z). The first part of the latter is reported in equation (5.16), which is obtained from (4.7). However, the second part of the above product, as reported in equation (5.17) is not straightforwardly derived. First, bear in mind that we assume independent posteriors for every ordered categorical variable. Using the prior distribution in equation (5.15), we recall that for every k = 1, ..., r1 , p(αk ∣ θ, Ω(L) , Zk , Z∗k ) ∝ p(αk ) × p(Z∗k ∣ αk , θ, Ω(L) , Zk ) G ng

∗ = c × ∏ ∏ p(zig,k ∣ αk , θ, Ω(L) , zig,k ) ∗

g=1 i=1

=c

(A.1)

∗ ∗ Note that p(zig,k ∣ αk , θ, Ω(L) , zig,k ) ≠ 0 only when zig,k ∈ [αk,zig,k , αk,zig,k +1 ). There∗ ∗ fore, p(αk ∣ θ, Ω(L) , Zk , Zk ) ≠ 0 only if Zk ∈ Zk , where Zk is the N-dimensional Euclidean ∗ space originated when zig,k ∈ [αk,zig,k , αk,zig,k +1 ), for every i ∈ ng and g ∈ G. ∗ Now, by the Central Limit Theorem, z˜ig,k = N[0, 1]I[αk,z D

ig,k

∗ zig,k ), /ψ,k ,αk,zig,k +1 /ψ,k ) (˜

∗ ∗ with z˜ig,k ≡ (zig,k − µk − Λ′k,(l) ωig,(l) )/ψ,k . Also, let Φ∗ (ι) be the univariate cumulative

88

89

DERIVATION OF POSTERIOR DISTRIBUTIONS

distribution function of a standard Gaussian random variable, evaluated at an arbitrary set in the real line, ι. As Φ∗ (⋅) is a monotonically increasing transformation, the order statistics and the interpretation remain unaltered. ∗ ∈ Following this reasoning, it is clear that p(αk ∣ θ, Ω(L) , Zk , Z∗k ) ≠ 0 only if z˜ig,k ∗ [αk,zig,k /ψ,k , αk,zig,k +1 /ψ,k ) for every i ∈ ng and g ∈ G. Therefore, c in equation (A.1) can be approximated as the product of independent segments of the set [0, 1], as in equation (5.17). The segment −1

−1

Φ∗ (ψ,k2 [αk,zig,k +1 − µk − Λ′k,(l) ωig,(l) ]) − Φ∗ (ψ,k2 [αk,zig,k − µk − Λ′k,(l) ωig,(l) ]) is a measure of how likely is for zˆ˜ig,k = (µk − Λ′k,(l) ωig,(l) )/ψ,k to belong to the interval [αk,zig,k /ψ,k , αk,zig,k +1 /ψ,k ), and therefore, can be approximated to the value of c∗ . The latter argument is similar to what is presented in Shi and Lee (1998).

Posterior distributions for intercepts µ (equations 5.22 to 5.25) From equation (5.4), let G ng

G ng

g=1 i=1

g=1 i=1

∗ ∏ ∏ p(yig,k ∣ ⋅)p(µk ) = ∏ ∏ √



1 1 2 ∗ (yig,k − µk − Λk,(l) ωig,(l) ) } × exp {− 2ψ,k 2πψ,k

1 1 (µk − µk0 )2 } exp {− 2σk0 2πσk0

(A.2)

The latter expression is proportional to: ⎧ ⎫ ⎪ ⎪ 1 G ng ∗ 1 ⎪ 2 2⎪ ∝ exp ⎨− (µk − µk0 ) ⎬ ∑ ∑ (yig,k − µk − Λk,(l) ωig,(l) ) − ⎪ ⎪ 2ψ,k g=1 i=1 2ψk0 ⎪ ⎪ ⎩ ⎭ ⎧ ⎫ n g G ⎪ 1 ⎛ −1 ⎪ ⎪ 2 2 ⎞⎪ ∗ −1 = exp ⎨− ψ,k ∑ ∑ ([yig,k − Λk,(l) ωig,(l) ] − µk ) + σk0 (µk − µk0 ) ⎬ ⎪ 2⎝ ⎠⎪ ⎪ ⎪ g=1 i=1 ⎩ ⎭

(A.3)

∗ ∗ Letting y¯ig,k = yig,k − Λk,(l) ωig,(l) , and after some algebra,

⎡ ⎤ ⎡ ⎤⎫ ⎧ ⎪ ⎪ ⎢ −1 ⎛ G ng ∗ ⎞ ⎥ ⎢ −1 ⎛ G ng 2∗ ⎞ ⎪ 1 2 ⎪ −1 −1 −1 −1 2 ⎥ ⎢ ⎥ ⎢ = exp ⎨− [µk (N ψ,k + σk0 )] − 2µk ⎢ψ,k ∑ ∑ y¯ig,k + σk0 µk0 ⎥ + ⎢ψ,k ∑ ∑ y¯ig,k + σk0 µk0 ⎥⎥⎬ ⎪ 2 ⎝g=1 i=1 ⎠ ⎝g=1 i=1 ⎠ ⎢ ⎥ ⎢ ⎥⎪ ⎪ ⎩ ⎭ ⎣ ⎦ ⎣ ⎦⎪ ⎡ ⎤ ⎧ −1 −1 ⎡ n G g ⎪ ⎢ ⎞⎥ −1 ⎛ −1 ⎛ ⎪ (N ψ,k + σk0 ) ⎢⎢ 2 −1 ∗ ⎞ = exp ⎨− µk − 2µk ⎢⎢(N ψ,k + σk−10 ) ψ,k ∑ ∑ y¯ig,k + σk−10 µk0 ⎥⎥ + ⎢ ⎪ 2 ⎝ ⎝g=1 i=1 ⎠ ⎠⎥ ⎢ ⎢ ⎪ ⎩ ⎣ ⎣ ⎦ ⎤⎫ ng G ⎪ ⎞⎥⎪ −1 ⎛ −1 ⎛ −1 2∗ ⎞ (N ψ,k + σk−10 ) ψ,k ∑ ∑ y¯ig,k + σk−10 µ2k0 ⎥⎥⎬ (A.4) ⎝ ⎝g=1 i=1 ⎠ ⎠⎥⎪ ⎪ ⎦⎭

90

DERIVATION OF POSTERIOR DISTRIBUTIONS −1

−1 + σk−10 ) By allowing σk∗ = (N ψ,k

G ng

∗ −1 ) + σk−10 µk0 ), and after ( ∑ ∑ y¯ig,k and µ∗k = σk∗ (ψ,k g=1 i=1

some algebra, the latter expression can be summarized as ⎤ ⎫ ⎧ G ng ⎪ 1 ⎛ ∗ ⎡⎢ ⎥⎞⎪ 1 ⎪ ⎪ 2 ∗ −1 −1 2∗ ∗ ⎢ ∝ exp {− ∗ (µk − 2µk µk + µk )} × exp ⎨− ∗ σk ⎢ψ,k − 2ψ,k ψk0 ∑ ∑ y¯ig,k µk0 + ψk0 ⎥⎥ ⎬ ⎪ 2σk 2σk ⎝ ⎢ ⎥⎠⎪ ⎪ g=1 i=1 ⎩ ⎭ ⎣ ⎦ ⎪ 1 (A.5) ∝ exp {− ∗ (µk − µ∗k )2 } 2σk which is the kernel of the distribution p(µk ∣ ⋅) = N[µ∗k , σk∗ ]. d

Posterior distributions for coefficients Λk,(l) and variances ψ,k (equations 5.26 to 5.29) (l)

L (l) From equation (5.4), and with x = ∑L l=1 rx + ∑l=1 q , let

G ng

G ng

g=1 i=1

g=1 i=1

∗ ∏ ∏ p(yig,k ∣ ⋅) p(Λk,(l) ∣ ψ,k ) p(ψ,k ) = ∏ ∏ √

1 1 2 ∗ (yig,k exp {− − µk − Λk,(l) ωig,(l) ) } × 2ψ,k 2πψ,k

(2π) 2 ∣ψ,k HΛk0 ∣− 2 × 1

x

exp {−

1 ′ (Λk,(l) − Λk0 ) H−1 Λk0 (Λk,(l) − Λk0 )} × 2ψ,k

αΛk

βΛk

0

0

Γ(αΛk0 )

−(αΛk −1)

ψ,k

0

exp {−

βΛk0 ψ,k

}

(A.6)

The latter expression can be organized as it follows: G ng

−1 ∝ ∏ ∏ ψ,k exp {− g=1 i=1

1 2 ∗ (yig,k − µk − Λk,(l) ωig,(l) ) } × 2ψ,k

∣ψ,k HΛk0 ∣− 2 exp {− 1

−(αΛk −1) 0

ψ,k

−N −αΛk +1 2

=ψ,k

exp {−

1 ′ (Λk,(l) − Λk0 ) H−1 Λk0 (Λk,(l) − Λk0 )} × 2ψ,k

βΛk0 ψ,k

}

1 ̃ ∗ − Ω′ Λk,(l) ]′ [Y ̃ ∗ − Ω′ Λk,(l) ] + ([Y k k 2ψ,k βΛk0 ′ [Λk,(l) − Λk0 ] H−1 } Λk0 [Λk,(l) − Λk0 ]) − ψ,k 0

∣ψ,k HΛk0 ∣− 2 exp {− 1

(A.7)

′ ̃ ∗′ = (y ∗ − µk , ..., y ∗ by allowing the N × 1 vector defined as Y k 11,k ng G,k − µk ) , and Ω = [ω11,(l) , ..., ωng G,(l) ], with vectors ωig,(l) defined as in equation (4.6). After some mã ∗′ Y ̃ ∗ + Λ′ H−1 Λk ], the expression inside the exp trix algebra, and by defining A = [Y k

k

k0

Λk0

0

91

DERIVATION OF POSTERIOR DISTRIBUTIONS

curly brackets ({}) can be further expressed as: ′ ′ −1 ′ −1 ̃∗ ̃ ∗′ ′ = Λ′k,(l) (H−1 Λk + ΩΩ ) Λk,(l) + A − Λk,(l) [HΛk Λk0 + ΩYk ] − [Yk Ω + Λk0 HΛk ] Λk,(l) 0

=

Λ′k,(l) H∗−1 k Λk,(l)

∗ + A − Λ′k,(l) H∗−1 k Λk,(l)

0

0

′ − Λ∗k,(l) H∗−1 k Λk,(l) ′





′ ∗−1 ∗ ∗ ∗−1 ∗ ∗−1 ∗ ∗ ∗−1 ∗ = Λ′k,(l) H∗−1 k Λk,(l) + A − Λk,(l) Hk Λk,(l) − Λk,(l) Hk Λk,(l) + Λk,(l) Hk Λk,(l) − Λk,(l) Hk Λk,(l) ′



∗ ∗ ∗−1 ∗ = [Λk,(l) − Λ∗k,(l) ] H∗−1 k [Λk,(l) − Λk,(l) ] + A − Λk,(l) Hk Λk,(l) ′ −1 ̃∗ and Λ∗k,(l) = H∗k [H−1 after defining H∗k = (H−1 Λk Λk0 + ΩYk ]. Λk + ΩΩ ) 0

0

By replacing the last equality inside the curly brackets in the exp expression above, we end up with the kernel of the product of a normal and a gamma distribution, such d −1 d that Λk,(l) = N[Λ∗k,(l) , ψ,k H∗k ] and ψ,k = Gamma [ N2 + αΛk0 , βk∗ ], with βk∗ = βΛk0 + 1 2



∗ [A − Λ∗k,(l) H∗−1 k Λk,(l) ].

(l)

Posterior distributions for variances ψδ,k (equation 5.30) (l)

For any level l, and for any endogenous latent variable k = 1, ..., q1 , we define: ˙ m(l) Kc,j1

Ak,(l) = ∑ ∑

j1 =1 k=1 ˙

(l) γj1 ,k˙ Bcj ,k˙ (cig,j1 ) − 1

(l)

q2

˙ ξ,j K 2

ξ ∑ ∑ γj2 ,k˙ Bj

˙ (Φ

2 ,k

j2 =1 k=1 ˙



(l)

(ξig,j2 ))

⎤ ⎡ ∗ K˙ l∗ ∗ ˙ ∗ K ⎥ (l ) c ,j q (l ) ω,j2∗ L ⎢ 1 ∗ ∗ ⎥ ⎢m ∗ ∗ l l (l ) (l ) − ∑ ⎢⎢ ∑ ∑ γj ∗ ,k˙ Bcj ∗ ,k˙ (cig,j ∗ ) + ∑ ∑ γj ∗ ,k˙ Bωj ∗ ,k˙ (Φ∗ (ωig,j ∗ ))⎥⎥ 2 1 2 1 2 1 ⎥ ∗ ˙ ˙ l∗ >l ⎢ j2∗ =1 k=1 ⎥ ⎢ j1 =1 k=1 ⎦ ⎣

From equation (5.4), it can be shown that after accounting for the appropriate Ak,(l) , for a given k, and a given N˙ (N˙ = N˙ 1 = N for l = 1, N˙ 2 = G for l = 2, etc.): L

G ng

(l) ∏ ∏ ∏ p(ηig,k l=1 g=1 i=1

∣ ⋅)

(l)−1 p(ψδ,k )



⎧ ⎪

⎪ ⎪ (l),−N˙ /2 ψδ,k exp ⎨− ⎪ ⎪ ⎪ ⎩

L G ng

(l) ∑ ∑ ∑ (ηig,k (l) 2ψδ,k l=1 g=1 i=1

1

⎧ ⎫ ⎪ ⎪ ⎪ ⎪ βδk0 ⎪ ⎪ ψδ,k exp ⎨− (l) ⎬ ⎪ ⎪ ⎪ ψ ⎪ ⎪ ⎩ δ,k ⎪ ⎭ ⎧ ⎫ ⎪ ˙ (l),−(N /2+αδk −1) ⎪ ⎪ ⎪ βδ∗ ⎪ ⎪ 0 =ψδ,k exp ⎨− (l) ⎬ ⎪ ⎪ ⎪ ψ ⎪ ⎪ ⎭ ⎩ δ,k ⎪



⎪ 2⎪ ⎪

− Ak,(l) ) ⎬ × ⎪ ⎪ ⎪ ⎭

(l),−(αδk −1) 0

(l)−1 d

which is the kernel of a Gamma distribution, such that ψδ,k L G ng

(l)

with βδ∗ = βδk0 + 21 [ ∑ ∑ ∑ (ηig,k − Ak,(l) ) ]. l=1 g=1 i=1

2

(A.8)

= Gamma [ N2 + αδk0 , βδ∗ ], ˙

92

DERIVATION OF POSTERIOR DISTRIBUTIONS

Posterior distributions for covariance matrices Φ(l) , for every l = 1, ..., L (equation 5.31) From equation (5.4), it can be shown that, L

G ng

(l)

p(ξ ∣ Φ) p(Φ) = ∏ ∏ ∏ p(ξig ∣ Φ(l) )p(Φ(l) ) l=1 g=1 i=1

(l)

L G ng −(ρ0 +q +1) 1 2 1 (l)′ 1 (l) 2 exp {− tr (R0 Φ(l)−1 )} ∝ ∏ ∏ ∏ ∣Φ(l) ∣− 2 exp {− ξig Φ(l)−1 ξig } × ∣Φ(l) ∣ 2 2 l=1 g=1 i=1 ⎤ (l) ⎧ ⎫ n L ⎡ −(ρ0 +q +1) ⎪ ⎢ (l) − N ⎥ 2 1 ⎪ 1 G g (l)′ (l)−1 (l) ⎪ ⎪ (l)−1 (l) ⎢ 2 2 ∝ ∏ ⎢∣Φ ∣ exp ⎨− ∑ ∑ ξig Φ ξig ⎬ × ∣Φ ∣ exp {− tr (R0 Φ )}⎥⎥ ⎪ ⎪ 2 2 ⎥ ⎪ ⎪ l=1 ⎢ ⎩ g=1 i=1 ⎭ ⎣ ⎦ (A.9) (1) We present the derivations for level l = 1, so we set N˙ = N˙ 1 = N and a q2 × N matrix (1) (1) ξ (l) = (ξ11 , ..., ξng G ). The extensions to levels l > 1 are straightforward (and so N˙ 2 = G, G ng

(l)′

(l)



etc. is). Bearing in mind that ∑ ∑ ξig Φ(l)−1 ξig = tr (ξ (l) Φ(l)−1 ξ (l) ), the expression g=1 i=1

above can be summarized as:

∝∣Φ(l) ∣− =∣Φ(l) ∣−

(l) (N +ρ0 +q +1) 2 2 (N +ρ0 +q

(l) +1) 2

2

′ 1 exp {− [tr (ξ (l) Φ(l)−1 ξ (l) ) + tr (R0 Φ(l)−1 )]} 2 ′ 1 exp {− tr ([ξ (l) ξ (l) + R0 ] Φ(l)−1 )} 2

(A.10)

′ d which is the kernel of an Inverse Wishart distribution, Φ(l) = IW[N˙ + ρ0 , ξ (l) ξ (l) + R0 ].

Posterior distributions for variances τp (equation 5.32) According to Song et al. (2013), the posterior distribution for every τp , p ∈ J1 , J2 , J1∗ or J2∗ , is derived from the conjugation between the prior distributions p(τp−1 ) and p(γp ∣ τp−1 ). Therefore, for any p, βγ 1 ′ γp Mγp γp } × (τp−1 )αγ0 −1 exp {− 0 } 2τp τp ′ γ M γ γp p ˙ p = (τp−1 )αγ0 +K/2−1 exp {− (βγ0 + ) τp−1 } (A.11) 2

p(γp ∣ τp−1 ) p(τp−1 ) ∝(τp )−K/2 exp {− ˙

which

is

Gamma [αγ0 +

the ˙ K 2 , βγ0

kernel +

γp′ Mγp γp 2

of ].

a

Gamma

distribution,

such

that

τp−1

d

=

93

DERIVATION OF POSTERIOR DISTRIBUTIONS

Posterior distributions for coefficients γp (equation 5.33) (l)∗

(l)∗

(l)

Let ηk , ηig,k , and Bp be defined as in subsection 5.2. For any level l and for k = 1, ..., q1 , and for a generic variable κp entering into the structural equation, let G ng

(l)∗ ∏ ∏ p(ηig,k g=1 i=1

∣ ⋅)p(γp ∣

τp−1 )

2⎫ ⎧ ˙ ⎪ ⎪ ⎪ ⎞ ⎪ 1 G ng ⎛ (l)∗ K ⎪ ⎪ ∗ ∝ exp ⎨− (l) ∑ ∑ ηig,k − ∑ γp,k˙ Bp,k˙ (Φ (κig,p )) ⎬ × ⎪ ⎪ ⎝ ⎠ ⎪ 2ψδ,k g=1 i=1 ⎪ ˙ ⎪ ⎪ k=1 ⎩ ⎭ 1 1 − [γ ′ Mγp γp ]} I(1′N˙ Bp γp = 0) τp 2 exp {− 2τp p ⎧ ⎤⎫ ⎡ (l)∗ (l)∗ ′ ′ ⎪ 1 ⎪ ⎪ ⎪ 1 ⎢⎢ (ηk − Bp γp ) (ηk − Bp γp ) γp Mγp γp ⎥⎥⎪ ⎪ (l) − N˙ − 2 2 = (ψδ,k ) τp exp ⎨− ⎢ + ⎬ I(⋅) ⎥ (l) ⎪ 2⎢ τp ⎥⎪ ⎪ ⎪ ψδ,k ⎪ ⎪ ⎦ ⎣ ⎩ ⎭ (A.12) (l) N˙ (ψδ,k )− 2

(l)

−1

After simple matrix algebra, and by defining Aγp = (ψδ,k )− 2 τp 2 , the latter expression can be reformulated as ˙ N

⎧ ⎫ (l)∗′ ′ ′ (l)∗ ′ ′ ⎪ ⎪ ⎪ γp′ Mγp γp ⎞⎪ ⎪ 1 ⎛ γp Bp Bp γp ηk Bp γp − γp Bp ηk ⎪ ⎟ ∝ Aγp exp ⎨− ⎜ − + ⎬ I(⋅) (l) (l) ⎪ ⎪ 2 τ p ⎪ ⎝ ψδ,k ⎪ ψδ,k ⎠ ⎪ ⎪ ⎩ ⎭ ′ ⎡ ⎤ ⎧ ⎫ (l)∗ (l)∗ ′ ⎪ ⎪ ⎪ (ηk Bp γp + γp′ B′p ηk ) ⎥⎥⎪ ⎪ 1 ⎢⎢ ′ ⎛ Bp Bp Mγp ⎞ ⎪ ⎥⎬ I(⋅) ⎟ γp − = Aγp exp ⎨− ⎢γp ⎜ (l) + (l) ⎥⎪ ⎪ 2⎢ τp ⎠ ⎪ ψ ψδ,k ⎥⎪ ⎪ ⎩ ⎢⎣ ⎝ δ,k ⎭ ⎦⎪ (l)

By allowing Σ∗γp = (B′p Bp /ψδ,k + Mγp /τp ) ′

∗ adding and subtracting γp∗ Σ∗−1 γp γp , we have

−1

(l)∗

and γp∗ = [Σ∗γp B′p ηk

(A.13)

(l) −1

] (ψδ,k ) , and after

1 ∗′ ∗−1 ∗ ∗′ ∗−1 ∗ ′ ∗−1 ∗ ∗′ ∗,−1 ∝ Aγp exp {− [γp′ Σ∗−1 γp γp − γp Σγp γp − γp Σγp γp + γp Σγp γp − γp Σγp γp ]} I(⋅) 2 1 ∗ ∝ A∗γp exp {− (γp − γp∗ )′ Σ∗−1 (A.14) γp (γp − γp )} I(⋅) 2 ′

∗ with A∗γp = Aγp exp {− 21 (−γp∗ Σ∗−1 γp γp )}. The expression above is the kernel of a trun-

cated Normal distribution, such that γp = N[γp∗ , Σ∗γp ] I(1′N˙ Bp γp = 0). Samples from this truncated distribution are obtained followed the procedure explained at the end of subsection 5.2, as presented in Song et al. (2013). D

APPENDIX

Results of Simulation Study

Bayesian Estimates for the SPHSEM parameters Table A1. Bayesian SPHSEM estimates, and their 90% probability interval. Simulated and Empirical Values True Empirical Parameter Value Value µ1 µ2 µ3 µ4 µ5 µ6 λ2,1 λ2,2 λ3,1 λ3,2 λ5,1 λ5,2 λ6,1 λ6,2 ψ,1 ψ,2 ψ,3 ψ,4 ψ,5 ψ,6 Φ(1) Φ(2) (1) ψδ,1 (2) ψδ,1 (1) τ1 (1) τ2 (2) τ1 (2) τ2

0.5 0.5 0.5 0.5 0.5 0.5 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 1.5 1.5 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 -

Bayesian SPHSEM Estimates Bayes Est. 5th pctle. 95th pctle. Value bound bound

0.958 0.712 1.825 2.707 -

0.308 0.363 0.413 0.268 0.226 0.134 0.817 0.824 0.727 0.819 0.733 0.790 0.656 0.794 1.409 1.455 1.562 1.188 1.402 1.469 1.073 0.706 1.590 1.227 2.290 0.011 0.004 0.005

94

0.138 0.214 0.263 -0.021 -0.005 -0.101 0.734 0.715 0.613 0.710 0.646 0.738 0.572 0.741 1.255 1.308 1.392 0.973 1.247 1.316 0.959 0.414 1.299 0.615 1.039 0.002 0.001 0.001

0.503 0.527 0.574 0.584 0.474 0.387 0.899 0.944 0.841 0.941 0.822 0.846 0.741 0.850 1.576 1.605 1.747 1.399 1.570 1.632 1.197 1.143 1.951 2.219 4.173 0.035 0.012 0.013

95

RESULTS OF SIMULATION STUDY

Comparison between SPHSEM and HSEM estimates Table A2. Bayesian SPHSEM and HSEM estimates for simulated parameters. Parameter

True Value

SPHSEM Estimate

Bias

HSEM Estimate

Bias

0.5 0.5 0.5 0.5 0.5 0.5 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 1.5 1.5 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 -

0.308 0.363 0.413 0.268 0.226 0.134 0.817 0.824 0.727 0.819 0.733 0.790 0.656 0.794 1.409 1.455 1.562 1.188 1.402 1.469 1.073 0.706 1.590 1.227 -

-0.192 -0.137 -0.087 -0.232 -0.274 -0.366 0.017 0.024 -0.073 0.019 -0.067 -0.010 -0.144 -0.006 -0.091 -0.045 0.062 -0.312 -0.098 -0.031 0.073 -0.294 0.590 0.227 -

0.148 0.229 0.280 -0.611 -0.456 -0.539 1.073 0.804 0.909 0.851 0.464 0.647 0.417 0.622 1.677 1.395 1.557 0.236 1.752 1.751 0.729 0.545 5.476 15.278 -0.001 -0.734 6.844

-0.352 -0.271 -0.220 -1.111 -0.956 -1.039 0.273 0.004 0.109 0.051 -0.336 -0.153 -0.383 -0.178 0.177 -0.105 0.057 -1.268 0.252 0.251 -0.271 -0.455 4.476 14.278 -

µ1 µ2 µ3 µ4 µ5 µ6 λ2,1 λ2,2 λ3,1 λ3,2 λ5,1 λ5,2 λ6,1 λ6,2 ψ,1 ψ,2 ψ,3 ψ,4 ψ,5 ψ,6 Φ(1) Φ(2) (1) ψδ,1 (2) ψδ,1 γ1 γ2 γ3 ∑ ∣BIAS∣

3.471

27.026

APPENDIX

MCMC Results for the Simulation Example

MCMC Results for µk's

4000

5000

2000

3000

4000

1.0

lambda 2 1000

5000

0

1000

MCMC Repetitions 0

1000

3000

4000

2000

3000

4000

5000

4000

5000

4000

5000

4000

5000

MCMC Repetitions

5000

mu 4

0.4

2000

3000

4000

5000

0

1000

MCMC Repetitions

2000

3000

MCMC Repetitions

5000

0

1000

2000

3000

4000

5000

MCMC Repetitions

0

1000

2000

3000

4000

0.95

4000

lambda 6

3000

0.70

2000

MCMC Repetitions

0.9

1000

0.6

0

lambda 5

−0.2

0.2

1000

0.2

0.6

0.6

0

1.0

MCMC Repetitions lambda 3

MCMC Repetitions

2000

lambda 4

3000

0.6

2000

0.5 0.8

1000

0.6

lambda 1

0

0.1 0

mu 3

0.65 0.95

0.7 0.5

mu 2

0.3

0.6 0.4 0.2 0.0

mu 1

MCMC Results for λk's

5000

0

1000

2000

3000

MCMC Repetitions

0

1000

2000

3000

MCMC Repetitions

4000

5000

0

1000

2000

3000

4000

5000

lambda 8 0

MCMC Repetitions

1000

2000

3000

MCMC Repetitions

(a) MCMC chains for µk .

4000

5000

0.70

0.8 0.5

lambda 7

0.2

mu 6

−0.2

0.2 −0.2

mu 5

0.6

MCMC Repetitions

0

1000

2000

3000

MCMC Repetitions

(b) MCMC chains for λk .

Figure 7. MCMC chains for µ and λ. Source: Author’s own calculations.

96

97

MCMC RESULTS FOR THE SIMULATION EXAMPLE

1000

2000

3000

4000

1.8 1.6 1.4

psi[epsilon, k] 2 0

1.2

1.6 1.4 1.2

psi[epsilon, k] 1

MCMC Results for ψε's

5000

0

1000

2000

3000

4000

0

1000

3000

3000

4000

5000

4000

5000

4000

5000

1.2 1.4 1.6 1.8

psi[epsilon, k] 5

psi[epsilon, k] 6 2000

2000

MCMC Repetitions

1.1 1.3 1.5 1.7

1000

5000

1.6

5000

MCMC Repetitions

0

4000

1.2

psi[epsilon, k] 4 1000

3000

0.8

psi[epsilon, k] 3

0

2000

MCMC Repetitions

1.3 1.5 1.7 1.9

MCMC Repetitions

0

1000

MCMC Repetitions

2000

3000

MCMC Repetitions

Figure 8. MCMC chains for ψ . Source: Author’s own calculations.

MCMC Results for ψ1δ 's

0

1000

3000

5000

1.0 0.6

Density

0.0

1.0

0.0

1

0.2

2

0.4

3

ψ2δ

0.5

1.5

0.8

4

1.5 1.0

Density

2.5 2.0

ψ1δ

3.0

5

1.2

3.5

2.0

MCMC Results for ψ2δ 's

1.0

2.0

3.0

0

ψδ1

MCMC Repetitions

1000

3000

5000

0

1

(1)

2

3

4

ψδ2

MCMC Repetitions

(2)

(a) MCMC chains for ψδ .

(b) MCMC chains for ψδ .

Figure 9. MCMC chains and histograms for ψδ (level 1 and 2). Source: Author’s own calculations.

MCMC Results for Φ1's

0

1000

3000

MCMC Repetitions

5000

0.9 1.0 1.1 1.2 1.3 Φ1

(a) MCMC chains for Φ(1) .

2.0 1.5 1.0

Density

0.5 0.0

Φ

2

0.5 1.0 1.5 2.0 2.5 3.0

5 4 3

Density

2 1 0

Φ1

0.9 1.0 1.1 1.2 1.3

6

MCMC Results for Φ2's

0

1000

3000

MCMC Repetitions

5000

0.0

0.5

1.0

1.5

2.0

Φ2

(b) MCMC chains for Φ(2) .

Figure 10. MCMC chains and histograms for Φ (level 1 and 2). Source: Author’s own calculations.

98

MCMC RESULTS FOR THE SIMULATION EXAMPLE

MCMC Results for τ2's (level 2)

50

100

Density

τ2

0.10

Density

6 τ1

0

2

6

0

8

1000 2000 3000 4000 5000

0.00

Density

0.20 τ2 0.10 0.00

0.02

0.04

0.06

0.08

0.04

0.05

0.03

0.04

0.05

0 0

τ1

MCMC Repetitions

0.03

150

0.30 1000 2000 3000 4000 5000

0.02 τ2

0.00

Density

0 0

0.01

MCMC Repetitions

20 40 60 80

0.15 0.10 0.00

0.05

τ1

4 τ1

MCMC Repetitions

100

1000 2000 3000 4000 5000

50

0

0

0.0

0.00

2

4

0.2

8

0.4

150

10

0.20

MCMC Results for τ1's (level 1)

1000 2000 3000 4000 5000

0.00

0.01

0.02 τ2

MCMC Repetitions

(1)

(2)

(a) MCMC chains for τγ .

(b) MCMC chains for τγ .

Figure 11. MCMC chains and histograms for τ (level 1 and 2). Source: Author’s own calculations.

1 0 −1 −2

Estimated values for

γ

2

3

4

Bayesian SPHSEM values for γ

0

10

20

30

40

Total number of knots

Figure 12. MCMC Results for γ. Source: Author’s own calculations.

APPENDIX

Description of the Empirical Exercise

In this Appendix we present some useful results for the empirical exercise in Chapter 6.

Bliese et al. (2002) dataset The items reported in the dataset in Bliese et al. (2002) is organized in 21 variables, three (3) for the Hostility behavior latent variable (HOSTIL), eleven (11) for the Leadership Climate perception (LEAD), and five (5) for the Task Significance perception (TSIG). The item responses are on a five-point Likert scale (ordered categorical variables), with the ends anchored by “none” and “extreme” for the HOSTIL items, and “strongly disagree” and “strongly agree” for both the LEAD and TSIG items. The survey questions are outline as it follows: Soldiers were asked to respond “I think/feel/believe ...” ˆ COMPID: Army Company (group) identifying variable. ˆ SUB: Soldier (individual) identifying number. ˆ HOSTIL01: I am easily annoyed or irritated. ˆ HOSTIL02: I have temper outbursts that I cannot control. ˆ HOSTIL03: I have urges to harm someone. ˆ HOSTIL04: I have urges to break things. ˆ HOSTIL05: I get into frequent arguments. ˆ LEAD01: Officers get cooperation from company. ˆ LEAD02: Non-commissioned Officers (NCOs) get cooperation from company. ˆ LEAD03: I am impressed by Leadership. ˆ LEAD04: I would go for help within chain of command.

99

100

DESCRIPTION OF THE EMPIRICAL EXERCISE

ˆ LEAD05: Officers would lead well in combat. ˆ LEAD06: NCOs would lead well in combat. ˆ LEAD07: Officers are interested in welfare. ˆ LEAD08: NCOs are interested in welfare. ˆ LEAD09: Officers are interested in what I think. ˆ LEAD10: NCOs are interested in what I think. ˆ LEAD11: Chain of command works well. ˆ TSIG01: What I am doing is important. ˆ TSIG02: I am contributing to the mission. ˆ TSIG03: What I am doing helps to accomplish the mission.

The proposed model path diagram

1

λ2LEAD,11

1

TSIGg

Lig,11

LEADig

Tig,1

λ2

TSIG,5

1

1

TSIGig

Hig,1

λ2

HOST,5

Hig,5

f12

λ1TSIG,5

1

HOSTig



HOSTg

Tig,5

f11

λ1LEAD,11



f22

1



LEADg

f21

Lig,1

λ1

HOST,5

i

g

f13

Figure 13. Proposed path diagram (DAG) for the SPHSEM model for Hostile Behavior, Leadership Climate and Task Significance. Source: Proposed by the author.

101

DESCRIPTION OF THE EMPIRICAL EXERCISE

SPHSEM Estimation Results Table A3. SPHSEM Bayesian estimates of the structural parameters in the study of Hostility and Leadership (Bliese et al., 2002). Par.

Est.

SE.

Par.

Est.

SE.

Par.

Est.

SE.

µ1 µ2 µ3 µ4 µ5 µ6 µ7 µ8 µ9 µ10 µ11 µ12 µ13 µ14 µ15 µ16 µ17 µ18 µ19 λ2,2 λ2,5 λ3,2 λ3,5 λ4,2 λ4,5 λ5,2

-0.158 -0.046 -0.066 -0.057 -0.063 -0.051 -0.067 -0.056 -0.067 -0.056 -0.068 -3.363 -1.838 -1.849 0.004 0.003 0.004 0.004 0.003 0.300 0.287 0.435 0.417 0.371 0.355 0.411

0.463 0.135 0.194 0.166 0.185 0.151 0.198 0.164 0.198 0.165 0.201 0.525 0.269 0.278 0.021 0.020 0.018 0.017 0.020 0.013 0.040 0.011 0.036 0.012 0.038 0.012

λ5,5 λ6,2 λ6,5 λ7,2 λ7,5 λ8,2 λ8,5 λ9,2 λ9,5 λ10,2 λ10,5 λ11,2 λ11,5 λ13,3 λ13,6 λ14,3 λ14,6 λ16,1 λ16,4 λ17,1 λ17,4 λ18,1 λ18,4 λ19,1 λ19,4 ψ,1

0.395 0.334 0.320 0.441 0.424 0.366 0.351 0.442 0.425 0.367 0.352 0.450 0.432 0.551 0.546 0.554 0.548 1.111 0.807 1.332 0.975 1.415 1.022 1.130 0.822 1.952

0.037 0.013 0.039 0.012 0.037 0.013 0.039 0.011 0.037 0.012 0.038 0.011 0.037 0.015 0.025 0.015 0.023 0.043 0.327 0.049 0.306 0.053 0.299 0.044 0.324 0.087

ψ,2 ψ,3 ψ,4 ψ,5 ψ,6 ψ,7 ψ,8 ψ,9 ψ,10 ψ,11 ψ,12 ψ,13 ψ,14 ψ,15 ψ,16 ψ,17 ψ,18 ψ,19 (1) ψδ,1 (2) ψδ,1 (1) Φ1,1 (1) Φ2,2 (1) Φ1,2 (2) Φ1,1 (2) Φ2,2 (2) Φ1,2

0.727 0.445 0.592 0.502 0.664 0.429 0.602 0.426 0.600 0.406 1.291 0.313 0.305 0.549 0.512 0.304 0.218 0.495 0.063 0.002 3.214 5.201 1.316 0.622 5.067 -0.054

0.026 0.016 0.021 0.018 0.024 0.017 0.023 0.016 0.022 0.016 0.069 0.020 0.017 0.021 0.022 0.015 0.016 0.021 0.017 0.001 0.133 0.647 0.132 0.222 1.855 0.924

APPENDIX

R Codes

The simulations here presented were computed using the statistical software R (R Development Core Team, 2017). Codes are available upon request to the author. Please write to [email protected].

102

Bibliography

Angrist, J. (1990). Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from Social Security Administrative Records. American Economic Review, 80(3):313–336. Angrist, J., Imbens, G., and Rubin, D. B. (1996). Indentification of Causal Effects Using Instrumental Variables. Journal of the American Statistical Association, 91(434):444– 455. Ansari, A. and Jedidi, K. (2000). Bayesian Factor Analysis for Multilevel Binary Observation. Psychometrika, 65(4):475–496. Ansari, A., Jedidi, K., and Jagpal, S. (2000). A Hierarchical Bayesian Methodology for Treating Heterogeneity in Structural Equation Models. Marketing Science, 19(4):328– 347. Arminger, G. and Muth´en, B. (1998). A Bayesian Approach to Nonlinear Latent Variable Models Using the Gibbs Sampler and the Metropolis-Hastings Algorithm. Psychometrika, 63(3):271–300. Balke, A. and Pearl, J. (1995). Bounds on Treatment Effects from Studies with Imperfect Compliance. Journal of the American Statistical Association, 92(439):1171–1176. Baron, R. M. and Kenny, D. A. (1986). The Moderator-Mediator Variable Distinction in Social Psychological Research: Conceptual, Strategic, and Statistical Considerations. Journal of Personality and Social Psychology, 51(6):1173–1182. Barringer, S., Eliason, S., and Leahey, E. (2013). A History of Causal Analysis in the Social Sciences. In Morgan, S. L., editor, Handbook of Causal Analysis for Social Research, chapter 2, pages 9–26. New York, NY, US: Springer-Verlag. Besag, J. (1974). Spatial Interaction and the Statistical Analysis of Lattice Systems. Journal of the Royal Statistical Society - Series B, 36(2):192–236. Besag, J., York, J., and Molli´e, A. (1991). Bayesian Image Restoration with Two Applications in Spatial Statistics. Annals of the Institute of Statistical Mathematics, 43(1):1–20. Bliese, P. D. (2016). multilevel: Multilevel Functions. https://CRAN.R-project.org/package=multilevel.

103

R package version 2.6.

BIBLIOGRAPHY

104

Bliese, P. D. and Halverson, R. R. (1996). Individual and Nomothetic Models of Job Stress: An Examination of Work Hours, Cohesion, and Well Being. Journal of Applied Social Psychology, 26(13):1171–1189. Bliese, P. D., Halverson, R. R., and Schriesheim, C. A. (2002). Benchmarking multilevel methods in leadership: The articles, the model and the data set. The Leadership Quarterly, 13:3–14. Bollen, K. A. (1987). Total, Direct, and Indirect Effects in Structural Equation Models. Sociological Methodology, 17:39–67. Bollen, K. A. (1989). Structural Equations with Latent Variables. New York, NY, US: John Wiley & Sons, Ltd., 1st edition. Bollen, K. A. and Pearl, J. (2013). Eight Myths about Causality and Structural Equation Models. In Morgan, S. L., editor, Handbook of Causal Analysis for Social Research, chapter 16, pages 301–328. New York, NY, US: Springer-Verlag. Bollen, K. A. and Stine, R. (1990). Direct and Indirect Effects: Classical and Bootstrap Estimates of Variability. Sociological Methodology, 20:115–140. Bowers, J., Fredrickson, M. M., and Panagopoulus, C. (2013). Reasoning about Interference Between Units: A General Framework. Political Analysis, 21:97–124. Brezger, A. and Lang, S. (2006). Generalized structured additive regression based on bayesian P-splines. Computational Statistics & Data Analysis, 50:967–991. Brookhart, M. A., Schneeweiss, S., Rothman, K. J., Glynn, R. J., Avorn, J., and St¨ urmer, T. (2006). Variable Selection for Propensity Score Models. American Journal of Epidemiology, 163(12):1149–1156. Browne, M. W. (1982). Covariance structures. In Hawkins, D. M., editor, Topics in Applied Multivariate Analysis, chapter 2, pages 72–133. Cambridge, UK: Cambridge University Press. Browne, M. W. (1984). Asymptotically distribution-free methods for the analysis of covariance structures. British Journal of Mathematical and Statistical Psychology, 37(1):62– 38. Cartwright, N. (1983). How the Laws of Physics Lie. Oxford, UK: Clarendon Press Oxford University Press. Cartwright, N. (1989). Nature’s Capacities and their Measurement. Oxford, UK: Clarendon Press - Oxford University Press. Celeux, G., Forbes, F., Robert, C. P., and Titterington, D. M. (2006). Deviance Information Criteria for Missing Data Models. Bayesian Analysis, 1(4):651–674. Chou, C.-P. and Bentler, P. M. (1995). Estimates and Tests in Structural Equation Modeling. In Hoyle, R. H., editor, Structural Equation Modeling: Concepts, Issues and Applications, chapter 3, pages 37–55. Thousand Oaks, CA, US: Sage Publications. Chou, C.-P., Bentler, P. M., and Satorra, A. (1991). Scaled test statistics and robust standard errors for non-normal data in a covariance structure analysis: A Monte Carlo study. British Journal of Mathematical and Statistical Psychology, 44:347–357.

BIBLIOGRAPHY

105

Clarke, K. A., Kenkel, B., and Rueda, M. R. (2011). Misspecification and the Propensity Score: The Possibility of Overadjustment. University of Rochester, Department of Political Science Working Paper. https://www.rochester.edu/college/psc/clarke/MissProp.pdf. Cochran, W. and Rubin, D. B. (1973). Controlling Bias in Observational Studies: A Review. Sankhy¯ a: The Indian Journal of Statistics, Series A, 35(4):417–446. Cowles, M. (1996). Accelerating Monte Carlo Markov Chain convergence for cumulativelink generalized linear models. Statistics and Computing, 6:101–111. Crespo-Tenorio, A. and Montgomery, J. (2013). A Bayesian Approach to Inference with Instrumental Variables: Improving Estimation of Treatment Effects with Weak Instruments and Small Samples. Unpublished manuscript. Daniel, R. M., Cousens, S. N., De Stalova, B. L., Kenward, M. G., and Sterne, J. A. C. (2013). Methods for dealing with time-dependent confounding. Statistics in Medicine, 32(9):1584–1648. Dawid, A. P. (1979). Conditional Independence in Statistical Theory. Journal of the Royal Statistics Society, Series B, 41(1):1–31. Dawid, A. P. (2008). Beware of the DAG! Journal of Machine Learning Research: Workshop and Conference Proceedings, 6:59–86. Dawid, A. P. (2010). Seeing and Doing: The Pearlian Synthesis. In Detcher, R., Geffner, H., and Halpern, J., editors, Heuristics, Probability and Causality: A Tribute to Judea Pearl, chapter 18, pages 309–325. London, UK: College Publications. De Boor, C. (1978). A Practical Guide to Splines. New York, NY, US: Springer-Verlag, 1st edition. Duncan, O. D. (1975). Introduction to Structural Equation Models. New York, NY, US: Academic Press, Inc., 1st edition. Duncan, O. D., Haller, A. O., and Portes, A. (1968). Peer Influences on Aspirations: A Reintepretation. American Journal of Sociology, 74(2):119–137. Dunson, D. B. (2000). Bayesian Latent Variable Models for Clustered Mixed Outcomes. Journal of the Royal Statistical Society - Series B, 62(2):355–366. Eilers, P. and Marx, B. (1996). Flexible Smoothing with B-splines and Penalties. Statistical Science, 11(2):89–121. Eilers, P. and Marx, B. (1998). Direct generalized additive modeling with penalized likelihood. Computational Statistics & Data Analysis, 28:193–209. Fahrmeir, L. and Raach, A. (2007). A Bayesian Semiparametric Latent Variable Model for Mixed Responses. Psychometrika, 72(3):327–346. Fox, J. (1984). Linear Statistical Models and Related Methods with Applications to Social Research. New York, NY, US: John Wiley & Sons, Ltd., 1st edition. Geiger, D. and Pearl, J. (1990). Logical and Algorithmic Properties of Conditional Independence and their Application to Bayesian Networks. Annals of Mathematics and Artificial Intelligence, 2:165–178.

BIBLIOGRAPHY

106

Geiger, D. and Pearl, J. (1993). Logical and Algorithmic Properties of Conditional Independence and Graphical Models. The Annals of Statistics, 21(4):2001–2021. Geiger, D., Verma, T., and Pearl, J. (1990). Identifying Independence in Bayesian Networks. Networks, 20:507–534. Gelman, A. (2009). Resolving disputes between J. Pearl and D. Rubin on causal inference. http://andrewgelman.com/2009/07/05/disputes about/. Gelman, A. (2011). Review Essay: Causality and Statistical Learning. American Journal of Sociology, 117(3):955–966. Gelman, A. and Meng, X.-L. (1998). Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling. Statistical Science, 13(2):163–185. Gelman, A., Roberts, G. O., and Gilks, W. R. (1995). Efficient Metropolis Jumping Rules. In Bernardo, J. M., Berger, J. O., Dawid, P., and Smith, F. P., editors, Bayesian Statistics 5, pages 599–607. Oxford, UK: Oxford University Press. Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibss distributions and the Bayesian restoration of images. I.E.E.E. Transactions: Pattern Analysis and Machine Intelligence, 6:721–741. Glymour, C. and Spirtes, P. (1988). Latent Variables, Causal Models and Overidentifying Constraints. Journal of Econometrics, 39:175–198. Goldberger, A. (1972). Structural Equation Models in the Social Sciences. Econometrica, 40:979–1001. Goldberger, A. (1973). Structural equation models: an overview. In Goldberger, A. and Duncan, O., editors, Structural Equations Models in the Social Sciences, pages 1–18. New York, NY, US: Seminar Press. Goldberger, A. and Duncan, O. (1973). Structural Equations Models in the Social Sciences. New York, NY, US: Seminar Press. Goldstein, H. and McDonald, R. (1988). A General Model for the Analysis of Multilevel Data. Psychometrika, 53(4):455–467. Goldszmidt, M. and Pearl, J. (1992). Rank-based systems: A simple approach to belief revision, belief update, and reasoning about evidence and actions. In Nebel, B., Rich, C., and Swartout, W., editors, Proceedings of the Third International Conference on Principles of Knowledge Representation and Reasoning, pages 661–672. San Mateo, CA, US: Morgan Kaufmann. Greenland, S. (2003). Quantifying Biases in Causal Models: Classical Confounding vs Collider-Stratification Bias. Epidemiology, 14(3):300–306. Greenland, S., Pearl, J., and Robins, J. M. (1999). Causal Diagrams for Epidemiologic Research. Epidemiology, 10(1):37–48. Haavelmo, T. (1943). The Statistical Implications of a System of Simultaneous Equations. Econometrica, 11(1):1–12.

BIBLIOGRAPHY

107

Halloran, M. E. and Struchiner, C. J. (1995). Causal Inference in Infectous Diseases. Epidemiology, 6:142–151. Hammersley, J. and Clifford, P. (1971). Markov fields on finite graphs and lattices. Unpublished manuscript. Hancock, G. R. and Mueller, R. O., editors (2006). Structural Equation Modeling: A Second Course. Quantitative Methods in Education and the Behavioral Sciences. Greenwich, CT, US: Information Age Publishing, 1st edition. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109. Heckman, J. (2005). The Scientific Model of Causality. Sociological Methodology, 35:1–97. Heckman, J. and Vytlacil, E. (2007a). Econometric Evaluation of Social Programs, Part I: Causal Models, Structural Models and Econometric Policy Evaluation. In Heckman, J. and Leamer, E., editors, Handbook of Econometrics, Volume 6B, chapter 70, pages 4779–4874. Amsterdam, NL: North-Holland Publishing Co. Heckman, J. and Vytlacil, E. (2007b). Econometric Evaluation of Social Programs, Part II: Using the Marginal Treatment Effect to Organize Alternative Econometric Estimators to Evaluate Social Programs, and to Forecast their Effects in New Environments. In Heckman, J. and Leamer, E., editors, Handbook of Econometrics, Volume 6B, chapter 71, pages 4875–5143. Amsterdam, NL: North-Holland Publishing Co. Hern´ an, M. A., Brumback, B., and Robins, J. M. (2000). Marginal Structural Models to Estimate the Causal Effect of Zidovudine on the Survival of HIV-Positive Men. Epidemiology, 11(5):561–570. Hern´ an, M. A., Hern´ andez-D´ıaz, S., and Robins, J. M. (2004). A Structural Approach to Selection Bias. Epidemiology, 15(5):615–625. Hern´ an, M. A. and Robins, J. M. (2016). Causal Inference. New York, NY, US: Chapman & Hall, CRC, 1st edition. Holland, P. (1986). Statistics and Causal Inference. Journal of the American Statistical Association, 81(396):945–960. Holland, P. (2003). Causation and Race. Educational Testing Service Research Reports, RR-03-03. Holland, P. W. (1988). Causal Inference, Path Analysis, and Recursive Structural Equation Models. Sociological Methodology, 18:449–484. Hong, G. and Raudenbush, S. (2003). Causal Inference for Multi-level Observational Data with Application to Kindergarten Retention Study. 2003 Proceedings of the American Statistical Association, Social Statistics Section, pages 1849–1856. Hong, G. and Raudenbush, S. (2005). Effects of Kindergarten Retention Policy on Children’s Cognitive Growth in Reading and Mathematics. Educational Evaluation and Policy Analysis, 27(3):205–224. Hong, G. and Raudenbush, S. (2006). Evaluating Kindergarten Retention Policy. Journal of the American Statistical Association, 101(475):901–910.

BIBLIOGRAPHY

108

Hong, G. and Raudenbush, S. (2013). Heterogeneous Agents, Social Interactions, and Causal Inference. In Morgan, S. L., editor, Handbook of Causal Analysis for Social Research, chapter 16, pages 331–352. New York, NY, US: Springer-Verlag. Hoogland, J. J. and Boomsma, A. (1998). Robustness Studies in Covariance Structure Modeling. Sociological Methods and Research, 26(3):329–367. Hoyle, R. H., editor (1995). Structural Equation Modeling: Concepts, Issues and Applications. Thousand Oaks, CA, US: Sage Publications, 1st edition. Hudgens, M. and Halloran, E. (2008). Toward Causal Inference with Interference. Journal of the American Statistical Association, 103(482):832–842. Imai, K., Keele, L., and Tingley, D. (2010a). A General Approach to Causal Mediation. Psychological Methods, 15(4):309–344. Imai, K., Keele, L., Tingley, D., and Yamamoto, T. (2010b). Causal Mediation Analysis Using R. In Vinod, H. D., editor, Advances in Social Science Research Using R, chapter 8, pages 129–154. New York, NY, US: Springer-Verlag. Imai, K., Keele, L., and Yamamoto, T. (2010c). Identification, Inference and Sensitivity Analysis for Causal Mediation Effects. Statistical Science, 25(1):51–71. Imai, K. and van Dyk, D. (2005). A Bayesian analysis of the multinomial probit model using marginal data augmentation. Journal of Econometrics, 125:311–334. Imbens, G. and Angrist, J. (1994). Identification and Estimation of Local Average Treatment Effects. Econometrica, 62(2):467–475. Imbens, G. W. (2004). Nonparametric Estimation of Average Treatment Effects under Exogeneity: A Review. The Review of Economics and Statistics, 86(1):4–29. Jara, A., Quintana, F., and San-Mart´ın, E. (2008). Linear Mixed Models with SkewElliptical Ddistributions: A Bayesian Approach. Computational Statistics and Data Analysis, 52:5033–5045. Johnston, J. and DiNardo, J. (1997). Econometric Methods. New York, NY, US: McGrawHill Higher Education, 4th edition. J¨oreskog, K. G. (1967). Some Contributions to Maximum Likelihood Factor Analysis. Psychometrika, 32(4):443–482. J¨oreskog, K. G. (1973). A general method for estimating a linear structural equation system. In Goldberger, A. and Duncan, O., editors, Structural Equation Models in the Social Sciences, pages 85–113. New York, NY, US: Seminar Press. J¨oreskog, K. G. (1978). Statistical Analysis of Covariance and Correlation Matrices. Psychometrika, 43(4):443–477. J¨oreskog, K. G. and S¨ orbom, D. (1996). LISREL 8: Structural Equation Modeling with the SIMPLIS Command Language. London, UK: Scientific Software International, 1st edition. Kass, R. E. and Raftery, A. E. (1995). Bayes Factors. Journal of the American Statistical Association, 50(430):773–795.

BIBLIOGRAPHY

109

Keesling, J. W. (1972). Maximum likelihood approaches to causal flow analysis. PhD thesis, Department of Education, University of Chicago. Kline, R. B. (2011). Practice and Principles of Structural Equation Modeling. Methodology in the Social Sciences. New York, NY, US: The Guildford Press, 3rd edition. Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. Cambridge, MA, US: The MIT Press, 1st edition. Lang, S. and Brezger, A. (2004). Bayesian P-Splines. Journal of Computational and Graphical Statistics, 13(1):183–212. Laplace, P.-S. (1814). Essai philosophique sur les probabilit´es. Paris, FR: Courcier, 1st edition. Lauritzen, S. L. (1996). Graphical Models. Number 17 in Oxford Statistical Science Series. Oxford, UK: Clarendon Press - Oxford University Press. Lee, S.-Y. (1990). 77(4):763–762.

Multilevel analysis of structural equation models.

Biometrika,

Lee, S.-Y. (2007). Structural Equation Modeling: A Bayesian Approach. New York, NY, US: John Wiley & Sons, Ltd., 1st edition. Lee, S.-Y. and Shi, J.-Q. (2001). Maximum Likelihood Estimation of Two-Level Latent Variable Models with Mixed Continuous and Polytomous Data. Biometrics, 57(3):787– 794. Lee, S.-Y. and Song, X.-Y. (2003a). Bayesian analysis of structural equation models with dichotomous variables. Statistics in Medicine, 22:3073–3088. Lee, S.-Y. and Song, X.-Y. (2003b). Model Comparison of Nonlinear Structural Equation Models with Fixed Covariates. Psychometrika, 68(1):27–47. Lee, S.-Y. and Song, X.-Y. (2004). Evaluation of the Bayesian and Maximum Likelihood Approaches in Analyzing Structural Equation Models with Small Sample Sizes. Multivariate Behavioral Research, 39(4):653–686. Lee, S.-Y. and Tang, N.-S. (2006). Bayesian Analysis of Two-level Structural Equation Models with Cross Level Effects. Unpublished manuscript. Lee, S.-Y. and Zhu, H.-T. (2000). Statistical analysis of nonlinear structural equation models with continuous and polytomous data. British Journal of Mathematical and Statistical Psychology, 50:209–232. Lee, S.-Y. and Zhu, H.-T. (2002). Maximum Likelihood Estimation of Nonlinear Structural Equation Models. Psychometrika, 67(2):189–210. Liu, L. and Hudgens, M. G. (2014). Large Sample Randomization Inference of Causal Effects in the Presence of Interference. Journal of the American Statistical Association, 109(505):288–301. Liu, W., Brookhart, M. A., Schneeweiss, S., Mi, X., and Setoguchi, S. (2012). Implications of M-Bias in Epidemiological Studies: A Simulation Study. American Journal of Epidemiology, 176(10):938–948.

BIBLIOGRAPHY

110

Longford, N. and Muth´en, B. (1992). Factor Analysis for Clustered Observations. Psychometrika, 57(4):581–597. Lundin, M. and Karlsson, M. (2014). Estimation of causal effects in observational studies with interference between units. Statistical Methods & Applications, 23(3):417–433. Manski, C. (2003). Partial Identification of Probability Distributions. Springer Series in Statistics. New York, NY, US: Springer-Verlag, 1st edition. Manski, C. (2013). Identification of treatment response with social interactions. Econometrics Journal, 16(1):1–23. Marcoulides, G. A. and Schumacker, R. E. (2009). New Developments and Techniques in Structural Equation Modeling. Mahwah, NJ, US: Lawrence Erlbaum Associates Publishers, 2nd edition. McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. New York, NY, US: Chapman & Hall, CRC, 2nd edition. McDonald, R. and Goldstein, H. (1989). Balanced versus Unbalanced Designs for Linear Structural Relations in Two-level Data. British Journal of Mathematical and Statistical Psychology, 42:215–232. Mehta, P. and Neale, M. (2005). People Are Variables Too: Multilevel Structural Equation Modeling. Psychological Methods, 10(3):259–284. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. The Journal of Chemical Physics, 21(6):1087–1092. Mill, J. S. (1843). A System of Logic. London, UK: Parker. Morgan, S. and Winship, C. (2007). Counterfactuals and Causal Inference. New York, NY: Cambridge University Press, 1st edition. Mulaik, S. (2009). Linear Causal Modeling with Structural Equations. Statistics in the Social and Behavioral Sciences Series. New York, NY, US: Chapman & Hall, CRC, 1st edition. Muth´en, B. (2011). Applications of Causally Defined Direct and Indirect Effects in Mediation Analysis using SEM in Mplus. Technical Report. Los Angeles, CA, US: Muth´en & Muth´en. Muth´en, B. O. (1984). A General Structural Equation Model with Dichotomous, Ordered Categorical, and Continuous Latent Variable Indicators. Psychometrika, 49(1):115–132. Muth´en, B. O. (1989). Latent Variable Modeling in Heterogeneous Populations. Psychometrika, 54(4):557–585. Muth´en, B. O. (1994). Multilevel Covariance Structure Analysis. Sociological Methods and Research, 22(3):376–398. Muth´en, B. O. and Asparouhov, T. (2015). Causal Effects in Mediation Modeling: An Introduction with Applications to Latent Variables. Structural Equation Modeling: A Multidisciplinary Journal, 22:12–23.

BIBLIOGRAPHY

111

Neuberg, L. G. (2003). Causality: Models, Reasoning and Inference: A Review. Econometric Theory, 19:675–685. Neugebauer, R. and van der Laan, M. (2006). Causal effects in longitudinal studies: Definition and maximum likelihood estimation. Computational Statistics & Data Analysis, 51:1664–1675. Neugebauer, R. and van der Laan, M. (2007). Nonparametric causal effects based on marginal structural models. Journal of Statistical Planning and Inference, 137:419–434. Neyman, J. (1923). On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9. Statistical Science, 5(4):465–480. Nowzohour, C. (2015). Estimating Causal Networks from Multivariate Observational Data. PhD thesis, ETH Z¨ urich. Ogburn, E. and VanderWeele, T. J. (2014). Causal Digrams for Interference. Statistical Science, 29(4):559–578. Pearl, J. (1985). Bayesian networks: A model of self-activated memory for evidential reasoning. Proceedings, Cognitive Science Society, pages 329–334. Pearl, J. (1986). Markov and Bayes Networks: a Comparison of Two Graphical Representations of Probabilistic Knowledge. UCLA Technical Report R-46-I. http://ftp.cs.ucla.edu/tech-report/198 -reports/860024.pdf. Pearl, J. (1988a). On the definition of actual cause. UCLA Technical Report R-259. ftp://ftp.cs.ucla.edu/pub/stat ser/R259.pdf. Pearl, J. (1988b). Probabilistic Reasoning in Intelligent Systems. San Mateo, CA, US: Morgan Kaufmann, 1st edition. Pearl, J. (1993). Comment: Graphical models, causality and intervention. Statistical Science, 8(3):266–269. Pearl, J. (1995). Causal diagrams for empirical research (with discussions). Biometrika, 82(4):669–710. Pearl, J. (1998). Graphs, Causality and Structural Equation Models. Sociological Methods and Research, 27(2):226–284. Pearl, J. (2000). Causality: Models, Reasoning and Inference. New York, NY, US: Cambridge University Press, 1st edition. Pearl, J. (2001a). Causal Inference in the Health Sciences: A Conceptual Introduction. Health Services & Outcomes Research Methodology, 2:189–220. Pearl, J. (2001b). Direct and Indirect Effects. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pages 411–420. San Mateo, CA, US: Morgan Kaufmann. Pearl, J. (2005). Direct and Indirect Effects. In Proceedings of American Statistical Association Joint Statistical Meetings, pages 1572–1581. Minneapolis, MN, US: MIRA. Pearl, J. (2009a). Causal inference in statistics: An overview. Statistics Surveys, 3:96–146.

BIBLIOGRAPHY

112

Pearl, J. (2009b). Causality: Models, Reasoning and Inference. New York, NY, US: Cambridge University Press, 2nd edition. Pearl, J. (2009c). Letter to the Editor: Remarks on the method of propensity score. Statistics in Medicine, 28(9):1415–1416. Pearl, J. (2009d). Myth, Confusion and Science in Causal Analysis. UCLA Technical Report R-348. http://web.cs.ucla.edu/∼kaoru/r348.pdf. Pearl, J. (2010a). The Foundations of Causal Inference. Sociological Methodology, 40(1):75– 149. Pearl, J. (2010b). An Introduction to Causal Inference. The International Journal of Biostatistics, 6(2):1–58. Pearl, J. (2012a). The Causal Foundations of Structural Equation Modeling. In Hoyle, R. H., editor, Handbook of Structural Equation Modeling, chapter 5, pages 68–91. New York, NY, US: The Guildford Press. Pearl, J. (2012b). Interpretable conditions for identifying direct and indirect effects. UCLA Technical Report R-389. http://ftp.cs.ucla.edu/pub/stat ser/r389-tr.pdf. Pearl, J. (2013). Linear Models: A Useful “Microscope” for Causal Analysis. Journal of Causal Inference, 1(1):155–170. Pearl, J. (2016). Causal Inference in Statistics: A Gentle Introduction. Tutorial at the Joint Statistical Meetings (JSM-16), Chicago, IL, August 1, 2016. Pearl, J. and Paz, A. (1987). Graphoids: A graph-based Logic for Reasoning about Relevance Relations. In Duboulay, B., Hogg, D., and Steels, L., editors, Advances in Artificial Intelligence II, pages 357–363. Amsterdam, NL: North-Holland Publishing Co. Pearl, J. and Verma, T. (1991). A theory of inferred causation. In Allen, J. A., Fikes, R., and Sandewall, E., editors, Proceedings of the Second International Conference on Principles of Knowledge Representation and Reasoning, pages 441–452. San Mateo, CA, US: Morgan Kaufmann. Petersen, M. L. and van der Laan, M. K. (2014). Causal Models and Learning from Data: Integrating Causal Modeling and Statistical Estimation. Epidemiology, 25(3):418–426. Preacher, K., Zhang, Z., and Zyphur, M. (2010). A General Multilevel SEM Framework for Assessing Multilevel Mediation. Psychologycal Methods, 15(3):209–233. Preacher, K., Zhang, Z., and Zyphur, M. (2011). Alternative Methods for Assessing Mediation in Multilevel Data: The Advantages of Multilevel SEM. Structural Equation Modeling, 18(2):161–182. Preston, C. C. and Colman, A. M. (2000). Optimal Number of Response Categories in Rating scales: Reliability, Validity, Discriminating Power, and Respondent Preferences. Acta Psychologica, 104:1–15. R Development Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

BIBLIOGRAPHY

113

Rabe-Hesketh, S. and Skrondal, A. (2012). Multilevel and Longitudinal Modeling Using Stata. College Station, TX, US: Stata Press, 3rd edition. Rabe-Hesketh, S., Skrondal, A., and Pickels, A. (2002). Reliable estimation of generalized linear mixed models using adaptive quadrature. The STATA Journal, 2(1):1–21. Rabe-Hesketh, S., Skrondal, A., and Pickles, A. (2004). Generalized Multilevel Structural Equation Modeling. Psychometrika, 69(2):167–190. Rabe-Hesketh, S., Skrondal, A., and Pickles, A. (2005). Maximum likelihood estimation of limited and discrete dependent variable models with nested random effects. Journal of Econometrics, 128:301–323. Rabe-Hesketh, S., Skrondal, A., and Zheng, X. (2007). Multilevel Structural Equation Modelling. In Lee, S.-Y., editor, Handbook of Latent Variables and Related Models, chapter 10, pages 209–227. Amsterdam, NL: North-Holland Publishing Co. Rabe-Hesketh, S., Skrondal, A., and Zheng, X. (2012). Multilevel Structural Equation Modeling. In Hoyle, R., editor, Handbook on Structural Equation Models, chapter 30, pages 512–531. New York, NY, US: The Guilford Press. Raudenbush, S. and Bryk, A. (2002). Hierarchical Linear Models: Applications and Data Analysis Methods. Thousand Oaks, CA, US: Sage Publications, 2nd edition. Richardson, T. and Robins, J. M. (2013a). Single World Intervention Graphs: A Primer. UAI Workshop on Causal Structural Learning, Bellevue, Washington. http://www.statslab.cam.ac.uk/∼rje42/uai13/Richardson.pdf. Richardson, T. and Robins, J. M. (2013b). Single World Intervention Graphs: A Unification of the Counterfactual and Graphical Approaches to Causality. Center for Statistics and the Social Sciences, University of Washington, Technical Report 128. https://www.csss.washington.edu/Papers/wp128.pdf. Robert, C. P. (1995). Simulation of truncated normal variables. Statistics and Computing, 5:121–125. Robins, J. M. (1986). A New Approach to Causal Inference in Mortality Studies with a Sustained Exposure Period: Application to control of the healthy worker survivor effect. Mathematical Modelling, 7:1393–1512. Robins, J. M. (1987). Addendum to “A New Approach to Causal Inference in Mortality Studies with a Sustained Exposure Period: Application to control of the healthy worker survivor effect”. Computers and Mathematics, with Applications, 14(9):917–921. Robins, J. M. (1992). Estimation of the time-dependent accelerated failure time model in the presence of confounding factors. Biometrika, 79(2):321–334. Robins, J. M. (1993). Analytic Methods for Estimating HIV-Treatment and Cofactor Effects. In Ostrow, D. G. and Kessler, R. C., editors, Methodological Issues in AIDS Behavioral Research, chapter 9, pages 213–290. New York, NY, US: Plenum Press. Robins, J. M. (1994). Correcting for Non-compliance in Randomized Trials Using Structural Nested Mean Models. Communications in Statistics - Theory and Methods, 23(8):2379–2412.

BIBLIOGRAPHY

114

Robins, J. M. (1995). Discussion in “Causal diagrams for empirical research”. Biometrika, 82(4):695–698. Robins, J. M. (1997). Causal Inference from Complex Longitudinal Data. In Bernake, M., editor, Latent Variable Modeling and Applications to Causality, volume 120 of Lecture Notes in Statistics, pages 69–117. New York, NY, US: Springer-Verlag. Robins, J. M. (1998). Marginal Structural Models. In 1997 Proceedings of the American Statistical Association, pages 1–10. Alexandria, VA, US: American Statistical Association. Section on Bayesian Statistical Science. Robins, J. M. (1999a). Association, Causation and Marginal Structural Models. Synthese, 121:151–179. Robins, J. M. (1999b). Marginal Structural Models versus Structural Nested Models as Tools for Causal Inference. In Halloran, M. E. and Berry, D., editors, Statistical Models in Epidemiology, The Environment and Clinical Trials, volume 116 of The IMA Volumes in Mathematics and its Applications, pages 95–133. New York, NY, US: Springer-Verlag. Robins, J. M. (1999c). Testing and Estimation of Direct Effects by Reparameterizing Direct Acyclic Graphs with Structural Nested Models. In Glymour, P. and Cooper, G., editors, Computation, Causation and Discovery, chapter 12, pages 349–405. Menlo Park, CA; Cambridge, MA, US; London, UK: AAAI Press & The MIT Press. Robins, J. M., Blevins, D., Ritter, G., and Wulfsohn, M. (1992). G-Estimation of the Effect of Prophylaxis Therapy for Pneumocystis carinii Pneumonia on the Survival of AIDS Patients. Epidemiology, 3(4):319–336. Robins, J. M. and Hern´ an, M. A. (2009). Estimation of the causal effects of time-varying exposures. In Fitzmaurice, G., Davidian, M., Verbeke, G., and Molenberghs, G., editors, Longitudinal Data Analysis, Handbooks of Modern Statistical Methods, chapter 23, pages 553–599. New York, NY, US: Chapman & Hall, CRC. Robins, J. M., Hern´ an, M. A., and Brumback, B. (2000). Marginal Structural Models and Causal Inference in Epidemiology. Epidemiology, 11(5):550–560. Robins, J. M., Hern´ an, M. A., and Siebert, U. (2004). Effects of Multiple Interventions. In Ezzati, M., Murray, C., L´opez, A. D., and Rodgers, A., editors, Comparative quantification of health risks: Global and regional burden of disease attributable to selected major risk factors, volume 2, chapter 28, pages 2191–2230. Geneva, CH: World Health Organization Press. Rosenbaum, P. (2005). Heterogeneity and Causality: Unit Heterogeneity and Design Sensitivity in Observational Studies. The American Statistician, 59(2):147–152. Rosenbaum, P. (2007). Interference Between Units in Randomized Experiments. Journal of the American Statistical Association, 102(477):191–200. Rosenbaum, P. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55. Rosenbaum, P. and Rubin, D. B. (1984). Reducing Bias in Observational Studies Using Subclassification on the Propensity Score. Journal of the American Statistical Association, 79(387):516–524.

BIBLIOGRAPHY

115

Rubin, D. B. (1973). Matching to Remove Bias in Observational Studies. Biometrics, 29:159–183. Rubin, D. B. (1974). Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies. Journal of Educational Psychology, 66(5):688–701. Rubin, D. B. (1977). Assignment to Treatment Group on the Basis of a Covariate. Journal of Educational Statistics, 2(1):1–26. Rubin, D. B. (1978). Bayesian Inference for Causal Effects: The Role of Randomization. The Annals of Statistics, 6(1):34–58. Rubin, D. B. (1979). Using Multivariate Matched Sampling and Regression Adjustment to Control Bias in Observational Studies. Journal of the American Statistical Association, 74(366):318–328. Rubin, D. B. (1980). Bias Reduction Using Mahalanobis-Metric Matching. Biometrics, 36(2):293–298. Rubin, D. B. (2005). Causal Inferences Using Potential Outcomes: Design, Modeling, Decisions. Journal of the American Statistical Association, 100(469):322–331. Rubin, D. B. (2006). Matching Sampling for Causal Effects. New York, NY, US: Cambridge University Press, 1st edition. Rubin, D. B. (2007). The design versus the analysis of observational studies for causal effects: Parallels with the design of randomized trials. Statistics in Medicine, 26(1):20– 36. Rubin, D. B. (2008). Author’s Reply. Statistics in Medicine, 27(14):2741–2742. Rubin, D. B. (2009). Author’s Reply: Should observational studies be design to allow lack of balance in covariate distributions across treatment groups? Statistics in Medicine, 28(9):1420–1423. San-Mart´ın, E., Jara, A., Rolin, J.-M., and Mouchart, M. (2011). On the Bayesian Nonparametric Generalization of IRT-type Models. Psychometrika, 76(3):385–409. Scheines, R., Spirtes, P., and Glymour, C. (1991). Building Latent Variable Models. Carnegie Mellon University Working Papers Series - Department of Philosophy, 19. http://repository.cmu.edu/cgi/viewcontent.cgi?article=1249&context=philosophy. Sekhon, J. (2008). The Neyman-Rubin Model of Causal Inference and Estimation via Matching Methods. In Box-Steffensmeier, J., Brady, H., and Collier, D., editors, The Oxford Handbook of Political Methodology, chapter 11, pages 271–299. Oxford, UK: Oxford University Press. Sekhon, J. (2009). Opiates for the Matches: Matching Methods for Causal Inference. Annual Review of Political Science, 12:487–508. Setoguchi, S., Schneeweiss, S., Brookhart, M. A., Glynn, R. J., and Cook, E. F. (2008). Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiology and Drug Safety, 17(6):546–555.

BIBLIOGRAPHY

116

Shi, J.-Q. and Lee, S.-Y. (1998). Bayesian sampling-based approach for factor analysis models with continuous and polytomous data. British Journal of Mathematical and Statistical Psychology, 51:233–252. Shi, J.-Q. and Lee, S.-Y. (2000). Latent variable models with mixed, continuous and polytomous data. Journal of the Royal Statistical Society - Series B, 62(1):77–87. Shpitser, I. and Pearl, J. (2006). Identification of Conditional Interventional Distributions. In Dechter, R. and Richardson, T. S., editors, Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, pages 437–444. Corvallis, OR, US: AUAI Press. Shrier, I. (2008). Letter to the Editor. Statistics in Medicine, 27(14):2740–2741. Sj¨olander, A. (2009). Letter to the Editor: Propensity scores and M-structures. Statistics in Medicine, 28(9):1416–1420. Skrondal, A. and Rabe-Hesketh, S. (2004). Generalized Latent Variable Modeling: multilevel, longitudinal, and structural equation models. Boca Rat´on, FL, US: Chapman & Hall, CRC. Sobel, M. E. (1987). Direct, and Indirect Effects in Linear Structural Equation Models. Sociological Methods & Research, 16(1):155–176. Sobel, M. E. (1990). Effect Analysis and Causation in Linear Structural Equation Models. Psychometrika, 55(3):495–515. Sobel, M. E. (2006). What Do Randomized Studies of Housing Mobility Demonstrate?: Causal Inference in the Face of Interference. Journal of the American Statistical Association, 101(476):1398–1407. Song, X.-Y. and Lee, S.-Y. (2004). Bayesian analysis of two-level nonlinear structural equation models with continuous and polytomous data. British Journal of Mathematical and Statistical Psychology, 57:29–52. Song, X.-Y. and Lee, S.-Y. (2012a). Basic and Advanced Bayesian Structural Equation Modeling with Applications in the Medical and Behavioral Sciences. New York, NY, US: John Wiley & Sons, Ltd., 1st edition. Song, X.-Y. and Lee, S.-Y. (2012b). A tutorial on the Bayesian approach for analyzing structural equation models. Journal of Mathematical Psychology, 56:135–148. Song, X.-Y., Lee, S.-Y., Ng, M., So, W.-Y., and Chan, J. (2007). Bayesian analysis of structural equation models with multinomial variables and an application to type 2 diabetic nephropathy. Statistics in Medicine, 26:2348–2369. Song, X.-Y. and Lu, Z.-H. (2010). Semiparametric Latent Variable Models with Bayesian P-Splines. Journal of Computational and Graphical Statistics, 19(3):590–608. Song, X.-Y. and Lu, Z.-H. (2012). Semiparametric transformation models with Bayesian P-splines. Statistics and Computing, 22(5):1085–1098. Song, X.-Y., Lu, Z.-H., Cai, J.-H., and Ip, E. H.-S. (2013). A Bayesian Modeling Approach for Generalized Semiparametric Structural Equation Models. Psychometrika, 78(4):624– 647.

BIBLIOGRAPHY

117

Spearman, C. (1904). “General Intelligence”, Objectively Determined and Measured. The American Journal of Psychology, 15(2):201–292. Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society - Series B, 64(4):583–639. Spirtes, P. (2001). An Anytime Algorithm for Causal Inference. AI and Statistics 2001 Conference. Spirtes, P. (2005). Graphical models, causal inference, and econometric models. Journal of Economic Methodology, 12(1):1–33. Spirtes, P. (2010). Introduction to Causal Inference. Journal of Machine Learning Research, 11:1643–1662. Spirtes, P. and Glymour, C. (1991). An Algorithm for Fast Recovery of Sparse Causal Graphs. Social Science Computer Review, 9(1):67–72. Spirtes, P., Glymour, C., and Scheines, R. (1991). From Probability to Causality. Philosophical Studies, 64(1):1–36. Spirtes, P., Glymour, C., and Scheines, R. (2000). Causation, Predition and Search. Cambridge, MA, US: The MIT Press, 2nd edition. Spirtes, P., Glymour, C., Scheines, R., and Tillman, R. (2010). Automated Search for Causal Relations: Theory and Practice. In Detcher, R., Geffner, H., and Halpern, J., editors, Heuristics, Probability and Causality: A Tribute to Judea Pearl, chapter 27, pages 467–506. London, UK: College Publications. Spirtes, P., Meek, C., and Richardson, T. (1995). Causal Inference in the Presence of Latent Variables and Selection Bias. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pages 499–506. San Mateo, CA, US: Morgan Kaufmann. Spirtes, P. and Richardson, T. (1997). A Polynomial Time Algorithm for Determining DAG Equivalence in the Presence of Latent Variables and Selection Bias. In Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics. Fort Lauderdale, FL, US: Society for Artificial Intelligence and Statistics. Spirtes, P., Richardson, T., and Meek, C. (1997). Heuristic Greedy Search Algorithms for Latent Variiable Models. In Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics. Fort Lauderdale, FL, US: Society for Artificial Intelligence and Statistics. Suppes, P. (1970). A Probabilistic Theory of Causality. Amsterdam, NL: North-Holland Publishing Co., 1st edition. Tanner, M. and Wong, W. H. (1987). The Calculation of Posterior Distributions by Data Augmentation (with discussion). Journal of the American Statistical Association, 82(398):528–540. Tchetgen Tchetgen, E. J. and VanderWeele, T. J. (2012). On causal inference in the presence of interference. Statistical Methods in Medical Research, 21(1):55–75.

BIBLIOGRAPHY

118

Thulasiraman, K. and Swamy, M. N. S. (1992). Graphs: Theory and Algorithms. Toronto, CA: John Wiley & Sons, Ltd. Toulis, P. and Kao, E. (2013). Estimation of Causal Peer Influence Effects. Journal of Machine Learning Research: Proceedings of the 30th International Conference on Machine Learning, Atlanta, Georgia, USA, 28(3):1489–1497. van der Laan, M. and Rose, S. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data. New York, NY, US: Springer-Verlag, 1st edition. van der Laan, M. J. (2010a). Targeted Maximum Likelihood Based Causal Inference: Part I. The International Journal of Biostatistics, 6(2). Article 2. van der Laan, M. J. (2010b). Targeted Maximum Likelihood Based Causal Inference: Part II. The International Journal of Biostatistics, 6(2). Article 3. van der Laan, M. J. (2012). Causal Inference for Networks. U.C. Berkeley Division of Biostatistics Working Paper Series. WP 300. http://biostats.bepress.com/ucbbiostat/paper300/. van der Laan, M. J. (2014). Causal Inference for a Population of Causally Connected Units. Journal of Causal Inference, 2(1):13–74. van der Laan, M. J. and Rubin, D. (2006). Targeted Maximum Likelihood Learning. The International Journal of Biostatistics, 2(1). Article 11. VanderWeele, T. J. and An, W. (2013). Social Networks and Causal Inference. In Morgan, S. L., editor, Handbook of Causal Analysis for Social Research, chapter 17, pages 353– 374. New York, NY, US: Springer-Verlag. VanderWeele, T. J. and Tchetgen Tchetgen, E. J. (2011). Effect partitioning under interference in two-stage randomized vaccine trials. Statistics and Probability Letters, 81:861–869. VanderWeele, T. J., Vanderbroucke, J. P., Tchetgen Tchetgen, E. J., and Robins, J. M. (2012). A Mapping Between Interactions and Interference: Implications for Vaccine Trials. Epidemiology, 23(2):285–292. Vansteelandt, S. and Joffe, M. (2014). Structural Nested Models and G-estimation: The partially realized promise. Statistical Science, 29(4):707–731. Verbitsky-Savitz, N. and Raudenbush, S. (2012). Causal Inference Under Interference in Spatial Settings: A Case Study Evauating Community Policing Program in Chicago. Epidemilogic Methods, 1(1):107–130. Verma, T. (1993). Graphical Aspects of Causal Models. UCLA Technical Report R-191. http://ftp.cs.ucla.edu/pub/stat ser/r191.pdf. Verma, T. and Pearl, J. (1988). Causal Networks: Semantics and Expressiveness. Proceedings of the Fourth Workshop on Uncertainty in Artificial Intelligence, pages 352–359. Verma, T. S. and Pearl, J. (1990). Equivalence and Synthesis of Causal Models. Proceedings of the Sixth Conference on Uncertainty in Artificial Intelligence, pages 220–227.

BIBLIOGRAPHY

119

Wagner, H. and T¨ uchler, R. (2010). Bayesian estimation of random effects models for multivariate responses of mixed data. Computational Statistics & Data Analysis, 54:1206– 1218. Wiley, D. E. (1973). The identification problem for structural equation models with unmeasured variables. In Goldberger, A. and Duncan, O., editors, Structural Equations Models in the Social Sciences, pages 69–84. New York, NY, US: Seminar Press. Winship, C. and Morgan, S. (1999). The Eestimation of Causal Effects from Observational Data. Annual Review of Sociology, 25:659–706. Witteman, J. C. M., D’Agostino, R. B., Stijnen, T., Kannel, W. B., Cobb, J. C., de Ridder, M. A. J., Hofman, A., and Robins, J. M. (1998). G-Estimation of Causal Effects: Isolated Systolic Hypertension and Cardiovascular Death in the Framingham Heart Study. American Journal of Epidemiology, 148(4):390–401. Wright, S. (1920). The Relative Importance of Heredity and Environment in Determining the Piebald Pattern of Guinea-Pigs. Proceedings og the National Academy of Sciences, 6:320–332. Wright, S. (1921). Correlation and Causation. Journal of Agricultural Research, 20(7):557– 585. Wright, S. (1934). The Method of Path Coefficients. The Annals of Mathematical Statistics, 5(3):161–215. Xie, Y. (2013). Population Heterogeneity and Causal Inference. Proceedings of the National Academy of Sciences Journal, 110(16):6262–6268. Zhu, H.-T. and Lee, S.-Y. (1999). Statistical analysis of nonlinear factor analysis models. British Journal of Mathematical and Statistical Psychology, 52:225–242.