Marginalized Exponential Random Graph Models

Marginalized Exponential Random Graph Models Thomas Suesse1 Centre for Statistical and Survey Methodology School of Mathematics and Applied Statistics University of Wollongong NSW 2522, Australia email: [email protected] tel: (61) 2 4221 4173; fax: (61) 2 4221 4845 Abstract Exponential random graph models (ERGMs) are a popular tool for modeling social networks representing relational data, such as working relationships or friendships. Data on exogenous variables relating to participants in the network, such as gender or age, are also often collected. ERGMs allow modeling of the effects of such exogenous variables on the joint distribution, specified by the ERGM, but not on the marginal probabilities of observing a relationship. In this paper we consider an approach to modeling a network that uses an ERGM for the joint distribution of the network, but then marginally constrains the fit to agree with a generalized linear model (GLM) defined in terms of this set of exogenous variables. This type of model, which we refer to as a marginalized ERGM, is a natural extension of the standard ERGM that allows a convenient population-averaged interpretation of parameters, for example in terms of log odds ratios when the GLM includes a logistic link, as well as fast computation of marginal probabilities. Several algorithms to obtain maximum likelihood estimates are presented, with a particular focus on reducing the computational burden. These methods are illustrated using data on the working relationship between 36 partners in a New England law firm. Supplemental materials for the article are available online.

Key Words: Social network; Exponential random graph model; Marginalized models; Odds ratio; Markov chain Monte Carlo; Maximum Likelihood 1

Thomas Suesse is a lecturer at the Centre for Statistical and Survey Methodology, School of Mathematics and Applied Statistics, University of Wollongong, NSW 2522, Australia (Email: [email protected])

1

1 Introduction Networks, or mathematical graphs, are an important tool for representing relational data, i.e. data on the existence, strength and direction of relationships between interacting actors. Types of actors include individuals, firms and countries. In its most basic form, a network consists of a set of n nodes and a set of edges, where nodes represent actors and edges the presence of a specific relationship between actors. The network can be represented by a n × n matrix Y = (Yij )ni,j=1 , where Yij is a binary indicator, which takes the value one if an edge exists from i to j and is zero otherwise. By convention Yii = 0. A pair of nodes is often called a dyad. The most commonly used model for a network is an exponential family model (Casella and Berger 2002) of the form Pr(Y = y; θ) = exp (η(θ)0 Z(y) − κ(θ)) , with κ(θ) = log

( X

(1)

) 0

exp(η(θ) Z(˜ y)) ,

˜ ∈S y

where the summation is over the sample space S of the network. The vector θ ∈ Rp contains the model parameters, Z(y) ∈ Rq is a vector of network statistics and κ(θ) is the normalizing constant. Here η(θ) is a mapping from Rp to Rq with p ≤ q. There are two important subcases. For the identity map η(θ) = θ, (1) is a canonical exponential family model, and ηi are the canonical parameters. When η(θ) is non-linear and p < q, (1) defines a curved exponential family model (Efron 1978). For random graphs, the first sub-case is often referred to as the exponential random graph model (ERGM) and the second as the curved ERGM (CERGM). To be consistent with standard exponential family terminology, we refer to the first sub-case as the canonical ERGM and the more general model specified by (1) as an ERGM. These models are currently widely used for social networks (Strauss and Ikeda 1990; Snijders 2002; Hunter and Handcock 2006) (the last reference abbreviated as HH06). The first model of this type for social networks was proposed by Holland and Leinhardt (1981), and is known as the p1 model. There are many choices of Z(y), see e.g. Morris et al. (2008), and most induce dependence among dyads. Maximum likelihood (ML) estimation is complicated and can usually only 2

be achieved by a stochastic approximation of the log-likelihood using Markov chain Monte Carlo (MCMC) algorithms. However ML estimation for canonical ERGMs based on MCMC methods often fails caused by model degeneracy. CERGMs were introduced (Snijders et al. 2006, HH06) to reduce the problem of degeneracy. Node attributes are also frequently collected, and a small number, say l, can be regarded as covariates. Let the n × l matrix X contain these covariates. Usually only covariates that are exogenous are considered, i.e. variables that are not influenced by the network, see HH06. Suppose the scientific interest is in modeling of the marginal probability Pr(Yij = 1), conditionally on X, denoted by Pr(Yij = 1|X). For example how is the effect of equal gender of actors i and j, defined as f (X) = 1 for equal gender and zero otherwise, on Pr(Yij = 1)? ERGMs are not useful for marginal modeling, because generally marginal probabilities are intractable. A naive approach ignoring the dyadic dependence is to apply a generalized linear model (GLM) (McCullagh and Nelder 1989) instead of an ERGM. It allows to calculate easily marginal probabilities and provides a convenient population-averaged interpretation of the parameters, for example in terms of log odds ratios for the logit link. However this approach does not account for the dyadic dependence structure of the network, which is likely to result in incorrect standard errors. Covariate effects as f (X) can also be accounted for by an ERGM by adding statistics depending on f (X) to Z(y). However when the main interest is in exogenous effects, such as equal gender, then interpretation of parameters, in terms of conditional log odds ratios, is difficult, as demonstrated in Section 2. In order to apply a marginal model and to account for the dyadic dependence, we introduce marginalized ERGMs (MERGMs) in Section 2, combining GLMs and ERGMs. Advantages of MERGMs over ERGMs are discussed by means of an example, using the Lazega (2001) data set. In Section 3, we derive two sets of likelihood-based estimating equations and use a Fisher-scoring scheme for solving them. Details of ML estimation are described, including two alternative methods for solving the two sets of likelihood equations in each step of the iterative process. Section 4 illustrates the proposed method on the Lazega data set. This article finishes with a discussion.

3

2 Marginalized Exponential Random Graph Models 2.1 Limitations of Exponential Random Graph Models Consider a network of collaborative working relationships between 36 partners in a New England law firm described in detail by Lazega and Pattison (1999) and Lazega (2001). An edge between two partners exists if both partners indicate collaboration with each other. This network is undirected, i.e. Y = Y0 , which will always be assumed in the remainder. The data also contain a number of attributes of each partner: seniority (rank number of entry into the firm), practice (litigation/corporate law), gender (male/female) and office (3 offices in different cities). One might think of fitting an ERGM to these data. Two examples of typical network P statistics Z(y) that describe an ERGM are: the number of edges denoted by E = i