Methods of ecological inference for disaggregation problems in ... - UFJF

0 downloads 95 Views 204KB Size Report
The EM-EI Software (Alpha Version), developed by the author using the. MatLab Language (Mathworks, Inc.), was used to co
Methods of ecological inference for disaggregation problems in operations research Rogério Silva de Mattos1 Universidade Federal de Juiz de Fora Faculdade de Economia e Administração [email protected] Abstract. Operations research techniques are widely used in business projects planning. Usually, a preliminary task of project planners is to assess the potential market for a business project, which involves defining the target population and evaluating its size. As the potential market is part of a larger aggregate of consumers, techniques for disaggregate data estimation (DDE) are often needed, and for long entropy optimization techniques have been used with this purpose. However, a line of research on DDE techniques displaying promising developments is Ecological Inference (EI). The paper presents a brief review of the recent literature on EI techniques, aiming at diffusing them among operations research analysts, and also a new method proposed by the author to make EI via the statistical technique known as EM Algorithm. Keywords: ecological inference, disaggregation, intensive computing, EM Algorithm. Resumo. Técnicas de pesquisa operacional são muito usadas no planejamento de projetos empresariais. Uma tarefa preliminar dos planejadores consiste em avaliar o mercado potencial, o que implica em conceituar a população alvo e medir seu tamanho. Sendo o mercado potencial parte de um agregado maior de consumidores, técnicas para estimação de dados desagregados (EDD) se fazem necessárias e, há muito, métodos de otimização da entropia vêm sendo usados com essa finalidade. Entretanto, uma linha de pesquisa em EDD que vem apresentando desenvolvimentos promissores é Inferência Ecológica (IE). Este artigo apresenta uma breve revisão da literatura recente sobre técnicas de IE, visando divulgá-las entre analistas de pesquisa operacional, e também um novo método proposto pelo autor para se fazer IE através da técnica estatística conhecida como Algoritmo EM. Palavras-chave: inferência ecológica, desagregação, computação intensiva, Algoritmo EM.

1. Introduction A number of business projects, as the launching of a new product in the market, the location of an industrial plant or of a shopping center, and the creation/expansion of a distribution network, for just a few examples, makes intensive use of operations research (OR) techniques. However, at early stages of project planning, a preliminary task − important in determining project feasibility − is to assess the potential market (or target population) to be attended by the products/services of the business activity. Being the potential market part of a larger aggregate of individuals/institutions, techniques for disaggregate data estimation (DDE) are needed to evaluate a variety of attributes of interest, like, for instance, market size and mean income by consumer groups. By far, the most used technique for market research has been sample survey, which allows for the assessment of a variety of market characteristics and is firmly grounded on statistical theory. A number of reasons, but in particular the high costs involved in making sample surveys, have led scientific literature to devise alternatives. A major among these alternatives was found in entropy optimization (EO), largely used as a DDE tool in urban and transport planning research since the 1960s (Wilson, 1970; Novaes, 1982). However, a line of research on DDE displaying promising developments at the present is Ecological Inference . Researchers on EI techniques use to say that EI is 1

Ph.D. Student in Control Theory and Statistics at The Department of Electrical Engineering/PUC-Rio. This paper is based on an ongoing research being developed under my Ph.D. thesis work, and I gratefully acknowledge fellowship support from the PICDT−UFJF/CAPES.

concerned with inferring individual behavior from aggregate, or "ecological", data (e.g.,King, 1997). As such, EI techniques are suitable to tackle DDE problems in various application areas, particularly in those where socio-economic data are needed (OR analysts may face DDE problems, for instance, when determining efficient ways to supply a line of products to a stratified market: in this case, disaggregate market information is essential for project planning success). The field of EI emerged from American studies on disaggregate voting behavior in the first decades of this century, but just a small number of EI techniques had been proposed until recently. A new perspective was open by a recent book from King (1997), where this author introduced new approaches based on modern resources of mathematical statistics and intensive computing. His study rekindled the interest of various researchers in developing new EI techniques. In this paper, we will make a review of new developments in these literature, aiming at diffusing them among OR analysts. We will focus, though, on King’s EI model, as an alternative approach proposed by this paper’s author to estimate it, based on a statistical technique known as EM Algorithm, will also be presented. The rest of this paper is organized as follows: In section 2, we state in more precise terms the type of DDE problem with which EI Research is concerned; in Section 3, notation for the EI problem is introduced; in Section 4, the recent EI developments are briefly reviewed; in Section 5, the parametric model for EI proposed by King (1997) is described; and in Section 6, an alternative approach based on the EM Algorithm to estimate King’s model with partial results already available are presented; Finally, in Section 8, some remarks are made. 2. Disaggregation problems in OR A variety of disaggregation problems arise in business planning that demands the use of OR techniques. For instance, the well known "Transportation Model", presented in standard textbooks on OR (e.g., Whitehouse and Weschler, 1976) is a tool designed to tackle a problem of distributing resources available from various sources to various destinations, with the target of minimizing the total transportation cost involved in the overall set of distribution flows. This problem is illustrated by Table 1. Table 1 Illustration of the standard transportation problem

SOURCES

DESTINATIONS D1

D2

...

DN

S1

?

?

...

?

S2

?

?

...

?

...

...

...

...

...

SM

?

?

...

?

Information on quantities of resources supplied at sources and demanded at destinations, and the unit cost per travel route are available to us. Our goal is to determine resource flows from each source to each destination: say, the unknown contents of table’s cells, represented by the question marks "?" in Table 1. To attain it, we could use a linear programming technique to find a best solution: we mean, a linear objective function representing total cost of transportation would be minimized subject to a set of linear constraints representing adding-up to row (sources) and adding-up to column (destina2

tions) totals. More sophisticated approaches based on non-linear programming techniques might also be used. A particular feature of this problem is that cells’ contents are decision variables under our control. When determining a best solution, what we are doing is deciding, aided by an OR technique, how much from each source should we allocate to each destination. The solution is thus a deterministic one. Another particular feature is that it is a disaggregation problem: what we know is the aggregate information from the row and column totals. The disaggregate information, say, the unknown contents of table’s cells, we have to seek using an OR technique. The EI problem is a disaggregation problem quite similar to that in Table 1, but with a few important differences. In EI, we also know the row and column (aggregate) totals of a table (or tables), and alike we want to determine the (disaggregate) contents of table’s cells. Formally, the two problems may be represented with the same type of adding−up constraints. However, even when we know with certainty the aggregate information, EI solutions are always subject to uncertainty. When making EI, we do not seek for an efficient decision managing variables under our control, but for a solution that maximizes the chances that variables describing phenomena of our interest assume certain values. Also formally, the two problems have different types of objective functions, as they are solved with different purposes. We mentioned in the Introduction that EO techniques have been largely used to solve OR problems where EI approaches would also be appropriate. However, while EO is a general approach to tackle a variety of disaggregation problems, EI is specific for the estimation of data, usually socio-economic data, unavailable in disaggregate form. Methodologically, EO makes use of pure mathematical programming techniques which usually provide deterministic solutions; EI, by its turn, is committed to statistical methods, where randomness and uncertainty are essential features of the solutions obtained. 3. The EI problem In technical terms, the EI problem represents a situation where the analyst/planner is interested in cell data for one or more contingency tables (or values tables), but he/she knows only the row and column totals of the table(s). These totals are called the aggregate (or ecological) data. Analyst’s goal is to determine the contents of table’s cells. Table 2 depicts this situation for the simplest case of a 2×2 tables problem. Table 2 Representation of the EI problem for the 2×2 tables case

Variable A

Variable B 1

2

Totals

1

β i11

1 − β i11

Xi

2

β i21

1 − β i21 1 − X i

Totals

Ti

3

1 − Ti

1

where:

β i11 = proportion of 1st category of variable B in 1st category of variable A; β i21 = proportion of 1st category of variable B in 2nd category of variable A; X i = proportion of 1st category of variable A in the total of its two categories; Ti = proportion of 1st category of variable B in the total of its two categories. Variables in Table 2 are defined as proportions, instead of absolute values, because it allows for a direct interpretation of results. For instance, in voting behavior studies, variable A might be race, e.g. blacks and whites, and Variable B might be partisan candidate, e.g. Republican or Democrat; thus, β i11 would represent proportion of blacks, and β i21 proportion of whites, voting for the Republican candidate. By their turn, X i would be the proportion of blacks, and Ti the proportion of people voting for the Republican candidate, both in total turnout of voting age population. The notation in Table 2 is general and applicable to a variety of contexts: In economics, variable A might be levels of family income, and variable B number of goods purchased; in sociology, variable A might be number of crimes by city regions, and variable B number of crimes by type; in transport planning, variable A might be number of residents by residential colonies, and variable B number of jobs by trade areas. Thus, EI techniques are suited for a wide range of DDE problems (including some arising in OR applications). Also, EI research is concerned with developing techniques for R×C tables problems, though implementation of such techniques are still limited with this respect (King, 1999). In Table 2, Ti and X i represent the known aggregate data. Subscript i indexes the tables, or sample units, ranging from 1 to P − the number of tables used in the EI analysis. By their turn, β i11 and β i21 represent the unknown disaggregate data and the target of EI problem solving; for this reason, β i11 and β i21 are called the quantities of interest. And once they become known, the contents of all cells in all tables will also become known. Figure 1, illustrate this distinct feature of EI techniques as compared to others that allow for just one table at a time. Figure 1 The use of various tables

β i11

1 − β i11

β i21

1 − β i21 1 − X i

Ti

β111

1 − β 111 X 1

β121

1 − β 121 1 − X 1

T1

1 − T1

1 − Ti

Xi

1

1

Source: Adapted from Mattos and Veiga (1999).

4

β P11

1 − β P11 X P

β P21

1 − β P21 1 − X P

TP

1 − TP

1

The EI problem is solved in such a way that cells’ contents in all tables are determined simultaneously. The proportions appearing outside the 2×2 tables of Figure 1, say, the pairs { (Ti , X i ) : i = 1,...,P}, are known aggregate data used to estimate each quantity of interest in each table. Various tables allows for the use of more observations (each table is a sample unit), and, by another side, for "borrowing strength" from the information in other tables in the estimation of cells’ values of a particular table. This latter aspect brings efficiency to estimation, if it happens that all tables have "something in common" (King, 1997). In practice, not always exist such a communality, at least among all the P tables, and King’s model admits extensions that allows different mean patterns for the quantities of interest in different tables. 4. Approaches to solve the EI problem To solve the EI problem, posed in its 2×2 version, a number of approaches have been suggested. Though the issue is old (Achen and Schively, 1995; King, 1997), a large part of EI approaches were proposed quite recently. Previous to the work of King (1997), the most used technique was Goodman Regression (Goodman, 1953). To understand its basic features, consider the following accounting identity from Table 2: Ti = β i11 X i + β i21 (1 − X i )

(4.1)

Relation (4.1) is a deterministic fact of any contingency or values table, not an assumption. To estimate the quantities of interest, Goodman (1953) assumed they are fixed for all tables, say: { β i11 = ℜ11 , β i21 = ℜ 21 : i = 1,...,P}, where ℜ11 and ℜ 21 are two constants. Any discrepancy between right and left-hand sides of (4.1) should be due to a random error, namely εi, and (4.1) might be re-written as: Ti = ℜ11 X i + ℜ 21 (1 − X i ) + ε i

(4.2)

Expression (4.2) is a linear regression model without a constant, being estimable by ˆ 11 and βˆ i21 = ℜ ˆ 21 ordinary least squares. Constant estimates across sample units βˆ i11 = ℜ may be produced for the quantities of interest, and it is immediate to extend Goodman’s model for problems with R×C tables. Yet, Goodman Regression is a limited technique because parameter constancy along different tables is a strong assumption, easy to be violated in a number of cases (e.g., Cho, 1997 and 1998) and because no restriction exist for the values of parameter estimates, which, as proportions, must lie within the [0,1] interval. As a consequence, it may happens that we find higher than 100% estimates or negative proportions using this approach (cf. King, 1997, p. 16). In an innovative work, King (1997) introduced a method for EI that overcomes all limitations of Goodman Regression. His method allows for quantities of interest to vary along different tables, never produces estimates outside the [0,1] interval, and incorporates a method of deterministic bounds from Duncan and Davis (1953). King’s method also admits extensions that allow for the effects of explanatory variables and uses modern intensive computing resources to reliably estimate parameters and disaggregate data. His work produced a sound impact on research, by reviving the interest on EI and inducing a number of studies: some applying and testing his method, others presenting new approaches (Cho, 1997 and 1998; Rivers and Cho, 1997; King, Rosen and Tanner, 1998; Penurbati and Shuessler, 1998; Rivers, 1998; and Lewis, 1998). Critical works have also been written (Freedman et al,1998; Cho, 1998; Freedman et al, 1999; and Ferree, 1999). The major limitations pointed against King’s model relies, first, on its diagnosis side, since formal statistical tests for checking goodness of fit and distributional assumptions are absent, while the graphical diagnostics suggested may be misleading (Cho, 1998); and second, on that it is restricted to 2×2 5

tables, lacking an implementation for the R×C tables case − actually, King (1997, Chapter 15) generalized his method to RxC tables, but did not implement it. King, Rosen and Tanner (1998), for short KRT, presented a more sophisticated approach: the hierarchical binomial-beta model. KRT claim it can address a variety of EI problems, being more flexible than King’s model. First, because a wide range of shapes are allowed for posterior distributions, an attribute from the beta distribution, which makes KRT’s model capable of recognizing data generated under distributional assumptions of King’s model and, thus, of providing "data analytic checks" (KRT, 1998; p. 1) for the latter. Second, becaise the binomial-beta model can be generalized to a number of EI problem types, including R×C tables, due to the use of modern Markov Chain Monte Carlo (MCMC) methods (Gibbs Sampling and nested Metropolis−Hastings algorithms) in the estimation process. The authors warn, however, that this increased flexibility is paid at the cost of increased computation, as MCMC methods are intensive consumers of computer time. Yet, generalizations to other situations brings new research challenges, and KRT conclude their work opens important paths for future EI research. In this paper, we will be focusing on the EI method of King (1997), referring the reader to the work of KRT (1998) for more details on the hierarchical binomial-beta model for EI. We have chosen to present and discuss King’s method because, as told in the Introduction, it is the object of a research on an alternative estimation approach which will be presented in Section 6. 5. King’s EI normal model The EI normal model proposed by King (1997) is being called this way here because it is based on an assumption of bivariate normality. It was designed to solve the simplest 2×2 EI problem of Table 2. Its first feature is the use of two deterministic facts of that problem: a) The accounting identity Ti = β i11 X i + β i21 (1 − X i ) , and b) the method of bounds, introduced by Duncan and Davis (1953). The importance of this method of bounds is that, given the aggregate data, the quantities of interest may belong to narrower intervals than [0,1], which reduces problem uncertainty regarding their estimation. For more details and examples, see Anchen and Schively (1995, pp. 190-193), King (1997) and Mattos and Veiga (1999). Using the notations L and U to denote lower and upper bounds, respectively, for a quantitiy of interest, the method of bounds is deterministic because, given the aggregate data Ti and 2 11 21 21 21 Xi, it is a true fact that β i11 ∈ [ L11 i ,U i ] and β i ∈ [ Li , U i ] , where :

 Ti − (1 − X i )   ≥ 0  L11 i = max 0, Xi   T  U i11 = min i ,1 ≤ 1  Xi 

 Ti − X i  ≥0  L21 i = max 0, X i    T  U i21 = min i ,1 ≤ 1 1 − X i 

(5.1)

A second feature of the EI normal model consists of the following assumptions: 1. The Xi (i = 1,...,P) variables are fixed, say, they are not random variables; 2. The quantities of interest follows a bivariate normal distribution truncated on the unit square [0,1] × [0,1] ∈ R 2 , as: 2

The formulas for the accounting identity in a) and for the method of bounds in (5.1) formulas can be generalized to R×C tables (King, 1997, Chapter 15).

6

(

 ℜ11   ( 2  β i11    ~ TN [0,1]×[ 0,1]   ( ;  ( (σ 11(  β 21   ℜ w   ρσ 11σ 21  i   

(( ( ρσ 11σ 21    i = 1,...,P ( σ 212  

(5.2)

(The symbol "∪" over the parameters in (5.2) is used to indicate they are from an untruncated distribution which was truncated to produce (5.2)); 3. The quantities of interest are mean independent of the regressors:

β i11 = ℜ11 + ε i11

(5.3)

β i21 = ℜ 21 + ε i21

(5.4)

It means β i11 and β i21 , are random variables with constant means which are thus independent of X i and 1 − X i ; the notations ℜ11 in (5.3) and ℜ 21 in (5.4) are not covered by a "∪" symbol to indicate they are the real truncated means; 4. The conditional random variable Ti | X i is independent across different tables or sample units (it is usually called the spatial independence assumption); From the facts and assumptions above, King (1997) builds his EI method in two stages: First, he estimates the "untruncated" parameters of the truncated bivariate normal in (5.2), say, the elements of the following vector: (

(

(

(

(

(

ψ = [ℜ11 , ℜ 21 ,σ 112 , σ 212 , ρ ]T

(5.5)

Second, he uses the parameters estimates to simulate the marginal posterior distributions for β i11 and β i21 , given the aggregate data T, and the resulting simulated values to compute estimates of these quantities of interest and of their associated standard errors. These stages are detailed next. 1st Stage: Estimating parameters of the truncated bivariate normal Assumption 4 of the EI normal model allows the likelihood function to be written as: (

(

p

P(T | ψ ) = ∏ P(Ti | ψ )

(5.6)

i =1

In (5.6), vector T = [T1 ,..., T p ] (

is the given sample of aggregate data and

P(Ti | ψ ) = TN ( µ i , σ ) represent the truncated normal distribution of each Ti , where (

2 i

(

(

(

((

(

µ i = ℜ b X i + ℜ w (1 − X i ) and σ i2 = σ 112 X i2 + σ 212 (1 − X i2 ) + 2 ρσ 11σ 21 X i (1 − X i ) are, ( respectively, the mean and the variance of the untruncated normal version of P(Ti | ψ ) . Actually, King does not use the likelihood formulation in (5.6), but a reparameterized one based on a one-to-one onto transformation of the parameter vector, given by ( φ = t(ψ ) , where φ = [φ1 , φ 2 , φ 3 , φ 4 , φ 5 ]T is such that: (

ℜ11 − 0.5 φ1 = ( 2 σ 11 + 0.25

(5.7)

ℜ 21 − 0.5 φ2 = ( 2 σ 21 + 0.25

(5.8)

φ 3 = ln σ 11

(5.9)

φ 4 = ln σ 21

(5.10)

(

(

(

7

(

 1+ ρ  φ5 = 0.5 ln (  1− ρ 

(5.11)

This reparameterization have three purposes: facilitate numerical optimization; reduce ( ( the correlation between means and variances (i.e, between ℜ11 and σ 112 , and between ( (2 ℜ 21 and σ 21 ); and allow for a better normal approximation of the likelihood (or posterior) function. The reparameterized likelihood is written: p

P(T | φ ) = ∏ P(Ti | φ )

(5.12)

i =1

Under a bayesian approach, where priors are used for the parameters in (5.5), we should work with the posterior function: P(φ | T ) ∝ P(φ ) P(T | φ ) (5.13) where P(φ ) is the prior distribution for φ . The parameters estimates are obtained by maximizing (5.12) or (5.13) for φ , thus finding:

φˆ = arg max[P(φ | T )]

(5.14)

φ

King implements this maximization via the Constrained Maximum Likelihood (CML) module, developed by Schoenberg (1995) and available for use with the GAUSS Programming Language (Aptech Systems, Inc.) 2nd Stage: Estimating the quantities of interest by simulation The CML/GAUSS module also produces an estimate, namely V (φˆ) , of the variancecovariance matrix of the parameters (cf. Schoenberg, 1995). If P is a large number, meaning that aggregate data is plenty, then a normal approximation can be adopted for the likelihood or posterior function (Tanner, 1996), like: φ | T ~ N φˆ;V (φˆ) (5.15)

(

)

The next step involves computing the posterior distributions for the quantities of interest. Considering the case of β i11 first, if φ were known we would just need to compute its marginal posterior P ( β i11 | T ) ; in such an instance, it would be equivalent to P ( β i11 | T , φ ) . However, by (5.15), φ is a random vector and thus P ( β i11 | T ) should be

obtained by averaging (integrating) P ( β i11 , φ | T ) over the uncertainty in φ, say:

P( β i11 | T ) ∝ ∫ P( β i11 , φ | T )dφ

(5.16)

Θ

where Θ is the parameter space for φ. However, solving analytically the multiple (5−tuple) integral in (5.16) is too a complex task. Alternatively, King obtains (5.16) by Monte Carlo Simulation: for each i-th table, a large number of values of β i11 and β i21 are randomly drawn and used to summarize their respective posteriors. The overall process is briefly described in Box 1. The reader is warned that step 3 in this Box uses the marginal distribution for β i11 conditioned on Ti and the parameters, say: ( ω ω2  ( ( P( β i11 | Ti ,ψ ) = TN  ℜ11 + i2 ε i ; σ i2 − i2  σi σi   (

(( (

(5.17) (

(

where ω i = σ 112 X i + ρσ 11σ 21 (1 − X i ) and ε i = Ti − ℜ11 X i − ℜ 21 (1 − X i ) . 8

Box 1 Estimating the disaggregate data by simulation 1. From the multivariate normal φ | T in (5.15), generate a random value φ~ ; 2. Reparameterize back the simulated value φ~ to the original parameterization using ψ~( = t −1 (φ~) , the inverse transformation of equations (5.7)-(5.11); then use importance sampling to improve the normal approximation (producing a new ψ~( ); ~ ~ 3. Use ψ~( in (5.16) and draw from P( β i11 | Ti ,ψ( ) a random value β i11 ;. if it happens ~

to fall outside [ L11i , U i11 ] (see (5.1)), repeat this step until β i11 ∈ [ L11i , U i11 ] ; 4. Compute a simulated value for the second quantity of interest, say β~i21 , using the accounting identity (4.1) rearranged as follows: T X i ~ 11 ~ (B1) β i21 = i − βi 1− X i

1− X i

Note: a) This relation between β~i 21 and β~i11 is a fixed one, when the aggregates Ti and Xi are given; and b) β~i 21 is generated from the simulated β~i11 and not from ( P( β i21 | Ti ,ψ ) , which is done to preserve model consistency; 5. Repeat steps 1-4 a large number of times until truncated simulated distributions for P( β i11 | Ti ) and P ( β i21 | Ti ) are obtained with a desired degree of precision; 6. Repeat steps 1-5 for each table or sample unit (say, for i = 1,...,P).

Figure 2 displays the types of truncated marginal posteriors that may result from such an exercise. Graph a) shows a situation where truncation is not operant; graph b) shows a posterior with truncation being operant just on one of its side; and graph c) shows a posterior with truncation being operant on both sides; truncation being operant with L = 0 and U = 1 may also happen. Figure 2 Examples of marginal posteriors simulated for the quantities of interest

(a)

(b)

(c)

The 2nd stage ends when means and standard deviations of the marginal posteriors for β i11 and β i21 , respectively, are computed for all P tables, as follows:

1 K ~ βˆ i11 = ∑ β i11( k ) K k =1

SE ( β i11 ) =

1 K

∑ (β

1 K ~ βˆ i21 = ∑ β i21( k ) K k =1

SE ( β i21 ) =

1 K

∑ (β

9

K

~ 11( k )

k =1

K

k =1

i

~ 21( k ) i

− βˆ i11

)

− βˆ i21

2

)

2

(5.18)

(5.19)

where K is the number of simulated values for the quantities of interest in each table. Computing confidence intervals at any desired level for β i11 and β i21 (i = 1,...,P) using expressions in (5.18) and (5.19) is straightforward. The EI normal model admits an extension where means’ patterns are allowed to vary along the sample unit tables. This extension is introduced in model structure by a slight modification of the means’ assumption 3. Instead of remaining constants, the ( ( "untruncated" mean values ℜ11 and ℜ 21 become functions of vectors of explanatory variables Z i11 and Z i21 , respectively, as: (

[ (( ( = [φ (σ

) ] ( + 0.25)+ 0.5]+ (Z

) )α

ℜ11 = φ1 σ 112 + 0.25 + 0.5 + Z i11 − Z 11 α 11 (

ℜ 21

2

2 21

T

21 i

− Z 21

T

21

(5.20) (5.21) (

(

These two expressions are derived by inverting (5.7) and (5.8) for ℜ11 and ℜ 21 , and adding to the right-hand side of each resulting expression the terms ( Z i11 − Z 11 )T α 11 and ( Z i21 − Z 21 )T α 21 , respectively. Vectors α 11 and α 21 are parameters vectors whose elements are the corresponding coefficients of the explanatory variables in Z i11 and Z i21 . The vectors Z 11 and Z 21 contain the respective means of these explanatory variables. According to King (1997; p. 170), the explanatory variables enters as deviations from their means so as not to change the interpretation of the original parameters and so that the vectors Z i11 and Z i21 need not include a constant term. Vectors α 11 and α 21 are jointly estimated with the other parameters of the reparameterized vector φ . For such, the posterior in (5.13) must be re-written as: P(φ ,α 11 , α 21 | T ) ∝ P(φ , α 11 , α 21 ) P(T | φ , α 11 , α 21 )

(5.22)

and then (by using CML/GAUSS) the following estimates are obtained: [φˆ, αˆ 11 , αˆ 21 ] = arg max P (φ , α 11 , α 21 | T ) (5.23) 11 21 φ ,α ,α

^ V [φˆ,αˆ 11 ,αˆ 21 ] = Cov[φˆ,αˆ 11 ,αˆ 21 ]

(5.24)

The procedures for estimating the disaggregate data remain the same adopted for the case with no explanatory variables. The difference is just that the means of the quantities of interest now are given by (5.20) and (5.21), and their estimates may follow different patterns because of the explanatory variables’ effects. 6. An approach based on the EM Algorithm We present now a different approach to estimate the EI normal model. It is based on a technique called Expectation Maximization Algorithm, for short EM Algorithm, that is an alternative to the optimization techniques used for maximizing the likelihood in (5.12) or the posteriors in (5.13) and (5.22). Three reasons for using the EM Algorithm for EI are: a) At the same time it estimates model’s parameters, it also estimates the quantities of interest; b) It can be naturally extended to cope with R×C tables problems; and c) For certain cases, it can be faster. The EM Algorithm will be described here in brief (references with detailed material will be present ahead). Its application to King’s EI normal model is a research being developed by the author of this paper as part of his Ph.D. Thesis (see more at the web address 10

http://www3.ufjf.br/~rmattos). Research findings which will be presented here are preliminary, as they are still for the 2×2 tables case without allowance for explanatory variables effects. They have the only purpose of illustrating possibilities of the EM Algorithm technique in the context of EI . Key concepts of the EM Algorithm formalism are those of complete and incomplete data. In various statistical applications, we might be concerned with a given variable for which observations are not available or measurements are impossible. It represents a variety of situations, like a time series data with some periods missing observations, a survey research database with non-responded items, or a set of values for which its is known only their sum. The complete data is the sample information we should have to estimate a model’s parameters, but actually we cannot observe it. The incomplete data is an associated sample information we can observe but whose information content for parameter estimation is lower than that of the complete data. The two forms of the data are related to each other by a many-to-one mapping: say, to each sample of incomplete data, it is associated many possible samples of complete data. In our EI problem, for instance, we observe a sample of aggregate data (row and column totals for a number of tables): but, in a strict sense, we should have the disaggregate data (contents of tables cells) to estimate model parameters by traditional methods. We know there may be many, indeed an infinity, of samples of unobserved dissaggregate data consistent with our (unique) sample of observed aggregate data. Formally speaking, suppose we are interested in a continuous (or a discrete) random variable X, generated by a probabilistic model f ( x | θ ) , were f is a known density, x an observation from X, and θ a vector of unknown parameters. Suppose also we cannot observe X. However, we can write its likelihood function f (x | θ ) , where

K

x = [ x1 , , x P ]T is a (non-observable) random sample from X. The vector x is called the complete data, and f (x | θ ) the likelihood of the complete data (LCD). However, assume we observe a sample vector y = [ y1 , , y Q ] , which is deterministically related to x, say: y = h(x), such that Q < P. The vector y is called the incomplete data. Now, let X be the sample space for x, and Y the sample space for y. Since Q < P, then h: X→Y is a many-to-one mapping relating X to Y. A single point y ∈Y has associated to it a subset of X − namely X(y), the inverse image of y under h − containing many points x ∈ X. Then, we can write the likelihood of the incomplete data (LID) as:

K

g (y | θ ) = ∫

X(y )

f (x | θ )dx

(6.1)

How does the EM Algorithm works? What the EM Algorithm does is to find estimates of θ using the incomplete data y and the form of the LCD to indirectly maximize the LID. It is not undertaken at once, but in a sequence of iterations, where, in each iteration, two stages are performed: the Expectation Stage (E-Stage), and the Maximization Stage (M-Stage). The general EM scheme is as follows: S1. Assume a guess for θ , say θ k ; S2. E-Stage: Given a guess θ k , compute: Q (θ ,θ k ) = E[log f (x | θ ) | θ k , y ] =∫

X(y)

log f (x | θ ) × f (x | θ k , y )dx

(6.2)

S3. M-Stage: Maximize Q(θ ,θ k ) for θ , finding:

θˆ = arg max Q(θ ,θ k )

(6.3)

θ

11

S4. Set θ k +1 = θˆ ; S5. Repeat steps S1−S4 a number of times until convergence. The EM Algorithm scheme S1-S5 presents good convergence properties: For instance, the LID never decreases in each iteration, and, under general conditions, it will always converge to a stationary point. For a systematic accounting of the EM Algorithm theory and various convergence theorems, see the recent book from McLachlan and Krishnan (1997). The EM Algorithm is appealing for EI because it also produces estimates of the complete data. It is done, according to S1-S5, in every E-Stage performed in every iteration. Of interest are only the complete data estimates from the last iteration, after convergence has been achieved. By this property, a path is open to use the EM Algorithm for EI. For instance, to put the EI normal model (2x2 tables case) from King (1997) under the EM formalism, we may assume the following: a) Complete data = disaggregate data: β = vec( β 11 , β 21 ) ; b) Incomplete data = aggregate data: T = [T1 ,..., T p ]T ; (

(

(

(

(

(

c) Parameter vector: ψ = [ℜ11 , ℜ21 ,σ 112 ,σ 212 , ρ ]T or φ = [φ1 , φ 2 , φ 3 , φ 4 , φ 5 ]T ; d) Many to one mapping = accounting identity: T = Xβ 11 + [I P − X]β 21 ; T 21 where β 11 = [ β 111 ,..., β 11 = [ β 121 ,..., β p21 ]T , IP is the P×P identity matrix, and p ] , β

K

X = diag ([ X 1 . , X P ]T ) . Note the many-to-one mapping is given by the accounting identity (4.1), presented in a generalized form in d). Now, if we let B be the sample space for β and Γ the sample space for T , then B(T) ⊂ B is the inverse image of the point T under the mapping h : B → Γ . The LID, the LCD and the Q−Function can be defined as: ( e) LCD: P( β | ψ ) ; (

(

f) LID: P(T | ψ ) = ∫B (T ) P( β | ψ , T )dβ ; g) Q-Function:

Q (ψ ,ψ k ) = E β [log p ( β | ψ )ψ k , T ] ( (

( ( (

(6.4)

(

= ∫Β (T ) log[ p ( β | ψ )] × p( β | ψ k , T )dβ

To turn operational the methodology, we must know the forms of the LCD, of the conditional (on T) LCD, and of the LID. The latter we already know from (5.6). The formers are given by: − LCD: (

P

(

(

P

P( β | ψ ) = ∏ P( β ib , β iw | ψ ) =

∏ N 2 ( β i11 , β i21 | ψ ) i =1

(

R(ψ ) P

i =1

− Conditional LCD: (

P( β | ψ , T ) = (

(

(

P( β | ψ ) P N 2 ( β i11 , β i21 | ψ ) ( =∏ ( ( P (T | ψ ) i =1 N (Ti | ψ ) S i (ψ )

β ∈ B(T )

(6.5)

β ∈ B(T )

(6.6)

(

N (, | ψ ) indicates the normal distribution and S i (ψ ) the truncation factor of the marginal posterior for β i11 . Given definitions a)-g) and (6.4)-(6.6), we are able to use the EM Algorithm to estimate the EI normal model: we just have to follow steps S1−S5. Let us call it the EM−EI methodology. A major difficulty to implement it is associated with computing 12

the multiple (2P−tuple) integral in (6.4). It is hard to be solved in an analytical closed form, and thus we have been working on two approaches to compute an approximation: numerical integration, and Monte Carlo simulation of the E−Stage (Tanner, 1996). The procedure adopted here is based on the first approach, and consists of: a) Generating, for each quantity of interest, a grid of equally spaced values between their corresponding bounds; b) Estimating those quantities as means of their truncated marginal posteriors, ( given a guess ψ k say: U ( ( βˆi11 = E [β i11 | ψ k , T ]= ∫ β i11 P( β i11 | ψ k , T )dβ i11 b i

Lbi

( βˆi21 = E [β i21 | ψ k , T ]=

Ti X i ˆ 11 + β 1− X i 1− X i

i = 1,...,P

(6.7)

i = 1,...,P

(6.8)

(note that (6.7) is easily computed via numerical integration); and c) Substituting the estimates from (6.7)−(6.8) in the following approximation for the Q-Function: ( ( Qˆ * (ψ ,ψ ) = log P (ψ ) + log Pˆ ( βˆ 11 , βˆ 21 | ψ ) (6.9) k

k

k

where: ( ( ( α Pˆ ( β k11 , β k21 | ψ ) = (R(ψ ) − P ) ∏ N 2 ( βˆ ik11 , βˆ ik21 | ψ )θ P

i =1

(6.10) Part of the approximation to (6.4) that (6.9) represents comes from the fact that (6.10) is an approximation to (6.5). The term inside the production operator in the right−hand side of (6.10) is the untruncated bivariate normal density to θ, and the term outside is ( the volume of this density over the unit square R(ψ ) elevated to −αP. We shall rewrite the approximation (6.9) as: ( ( ( ( ( Qˆ * (ψ ,ψ k ) = log P(ψ ) + α (− P log R(ψ ) )+ θ ∑ log N 2 ( βˆik11 , βˆ ik21 | ψ ) P

(6.11)

i =1

Parameters α and θ should be positive and play the hole of regulating the degree of approximation. They weight the elements of the truncated density, say, α weights the truncation factor log and θ the untruncated density log. At the moment, they are determined manually (research work is in progress regarding their optimization together with the other model parameters). In the exercise discussed next, good results were obtained by setting3 α = 1/P and θ = 2/P. To illustrate the methodology, we ran a simulated exercise. A number of 100 i.i.d. values for the disaggregate data pairs { ( β i11 , β i21 ) ,: i = 1,...,100} were drawn from a bivariate normal distribution truncated4 on [0,1] × [0,1] , assuming for the true parameters that:

A reason for using these weights is that the alternative of setting α = θ = 1 in (6.10) may downweights the prior and the untruncated normal term relative to the truncation factor log in (6.11), which is just an approximation and not the exact Q−Function. We note this problem in preliminary runs, where the priors had little effect in improving estimates. In an attempt to correct this problem, we used the weights. In addition, until now, we are certain only of the concavities of priors and the untruncated normal term; ( nothing we can say in this regard about the log of the truncation factor R (ψ ) . 3

To be strict, the disaggregate data pairs were drawn as follows: β ib values were drawn from the untruncated version of his conditional normal distribution (5.17), parameterized as in (6.12), with values

4

13

(

(

ψ T = [ℜ11 , ℜ 21 ,σ 112 , σ 212 , ρ ] = [0.5,0.5,0.25,0.25,0] (

(

(

(

(6.12)

For the aggregate data X, 100 i.i.d. values were drawn from a uniform distribution defined on the interval [0,1]; then, 100 values for Ti were computed via the accounting identity (4.1). The EM-EI Software (Alpha Version), developed by the author using the MatLab Language (Mathworks, Inc.), was used to compute estimates of the parameter ( vector ψ and of the disaggregate data vectors β 11 and β 21 from the true aggregate data vectors T and X. The basic settings used to run the EM−EI methodology over this simulated dataset were the following: a) Priors: lognormal priors (with means = 0.5 and standard deviations = 0.35) ( ( for5 the variances σ b2 and σ w2 were adopted; as reparameterization (5.7)−(5.11) was used, this implied in normal priors (with means = −0.89 and standard deviations = 0.63) for φ 3 and φ 4 ; in addition, a normal prior (with mean 0 and standard deviation = 0.25) was adopted for φ 5 , the reparameter( ized correlation coefficient ρ ; (

b) EM initialization: the EM scheme is initialized at ψ 0 = [0,0,1,1,0] ; c) Parameter space: the optimization algorithm used in the M−Stage is restricted to search a solution in a subset of the parameter space, described by the following limits imposed to the elements of the parameter vector: ( ( ψ low = [−24,−24,10 −6 ,10 −6 , (10 −5 − 1)] and ψ up = [100,100,5,5, (1 − 10 −5 )] ; d) Termination criteria: for the EM sequence, if || φˆk − φˆk −1 || is less than 10−5 for or if iterations exceeds 100; for the M−Stages, if the norm of the direction vector is less than 10−4 or if iterations exceeds 500;, e) Disaggregate data estimation: a grid with 101 equispaced points were generated for each quantity of interest between their corresponding deterministic bounds; Table 3 displays results of parameters’ estimation: 57 were necessary for convergence, lasting 4.02 mins./secs. on an Intel Celeron 366 MHz processor with 56 MB of RAM. Because of reparameterization, the φˆ57 vector ("FI Parameters") is displayed ( ( first, followed by the ψˆ vector ("PSI Parameters") and the true ψ vector ("True Pa57

rameters"). The methodology provided good estimates for the means and the correlation coefficient; but, for each variance, a large negative deviation was obtained. Figure 3 displays the evolution of the LID along the 57 iterations: A clear convergence pattern is displayed, despite the absence of monotonicity6. By its turn, Figure 4 shows the evolution of parameters estimates: The five graphs also display a clear convergence towards a fixed value, though only three of them converged to a point quite close to their true values. falling outside [0,1] being rejected until 100 valid values ∈ [0,1] were obtained; the 100 values for β iw where computed via the relation (B1) from Box 1. ( ( 5 The EM−EI software uses the transformations φ 3 = log σ( b2 and φ 4 = log σ w2 instead of φ 3 = log σ b and ( φ 4 = log σ w , which are used by King (1997, p.136) and were presented here as (5.9) and (5.10). 6 The absence of monotonicity was to be expected, because, again, we are using an approximation instead of the exact Q−Function. For instance, non−monotonicity of the LID along EM iterations is well known to be lost when the E−Stage is performed using Monte Carlo simulations (McLachlan and Krishnan, 1997). What was unexpected were two peaks of the LID above its limiting value, what happened at the beginning of the EM sequence in Figure 3 .

14

Table 4 displays methodology performance for disaggregate data estimation: 52% of the points for β 11 and 45% for β 21 displayed less than 10 percentage points of absolute deviation from the true value; considering 15 percentage points, the coverage augments to 61% for both. The latter results are visualized in Figure 5. Scatter plots (a) and (b) of true versus estimated values show 61 points for β 11 , and 61 for β 21 falling inside the error region of ± 15 percentage points far from the 45o line (between the dotted lines). Graphs (c) and (d) display the same scatter plots but with vertical lines passing through each point; these lines represent the bounded intervals delimited by the deterministic bounds. These two graphs show a concentration of larger intervals around estimates with value 0.5, indicating a high influence of interval size in methodology performance7. Best predictions are for points with shorter intervals; the bulk of points laying outside the dotted lines (graphs (a) and (b)) have larger intervals which stay around the value 0.5.

Table 3 Parameters estimation results EMEI:

EM ALGORITHM FOR ECOLOGICAL INFERENCE

RESULTS OF ESTIMATING KING´S BASIC NORMAL MODEL * * * * * * *

Data File : datsmpl.mat * Nobs : 100 E-Stage Option : 4 M-Stage Option : 1 Number of iterations : 57 (Max. = 100; Tol. = 1e-005) CPU Time in Minutes.Seconds: 4.02 Reparameterization Used with Priors. FI PARAMETERS VALUES: FI1 = −0.0258 FI2 = −0.0319 FI3 = -1.1224 FI4 = -0.9482 FI5 = * PSI PARAMETERS VALUES: Rb = 0.4908 Rw = 0.4872 Vb = 0.1060 Vw = 0.1501 Rho = * TRUE PSI PARAMETERS VALUES: Rb = 0.5000 Rw = 0.5000 Vb = 0.2500 Vw = 0.2500 Rho = * Likelihood of Incomplete Data: Raw = 8.970e+011 Log = 27.5223

0.0205 0.0205 0.0000

Source: Output from EM-EI Software (Alpha Version)

7

A desired result would be that methodology performance were less dependent on bounded intervals and more dependent on model form, in cases where those intervals are large, say, when they are near or equal to the [0,1] interval. Graphs (c) and (d) of Figure 5 show that, in the present exercise, there is a visible tendency of quantities of interest to be predicted close to the 0.5 value when bounded intervals are near or equal to [0,1].

15

Figure 3

Source: Output from EM-EI Software (Alpha Version)

Figure 4 EM sequences of parameters guesses for the EI normal model

Source: Output from EM-EI Software (Alpha Version)

16

Tabel 4 Disaggregate (complete) data estimation results EM-EI:

EM ALGORITHM FOR ECOLOGICAL INFERENCE

FITTING OF ESTIMATED AGAINST SIMULATED COMPLETE DATA P.ABS. P.ABS. MAE(b) MSE(b) P.ABS. P.ABS.

Suggest Documents