Inferring Individual Level Relationships from Aggregate Data*

20 downloads 0 Views 278KB Size Report
improving the cross-level inference, the Gary King's solution. For the purpose of .... T is the proportion of the literate and "i" is an aggregation unit. With some.
Inferring Individual Level Relationships from Aggregate Data* Kihong Eom**, Youngjae Jin*** < ABSTRACT > This paper introduces a technique for inferring individual level relationships from aggregate data. ecological

fallacy

problem

Social scientists encounter with the

where

individual

level

available, yet the individual level relationship is sought.

data

are

not

In particular,

it is the case that social scientists attempt to examine a subject for which no survey data are available or reliable.

A series of cross-level

inference techniques have been introduced since the Goodman’s seminal work (1959).

We introduce a new technique of significantly

improving the cross-level inference, the Gary King’s solution.

For the

purpose of verification, we examined regional voting in the 16 National Assembly elections of South Korea.

th

The estimates of

regional voting are compared with those of survey results.

We found

that the estimates from the King’s solution are closely matching with those from survey results if cell frequency in survey results is large enough. The King’s solution produces more reliable estimates in a case that cell frequency in survey results is small. Key words : Cross-level inference, ecological fallacy, aggregate data, individual level relationship * An earlier version was presented at the 53rd meeting of the International Statistical Institute, Seoul, Korea, August 22-29, 2001.

** Frostburg State University, Lecturer (email: [email protected]) *** Yonsei University, Associate Professor (email: [email protected])

I. Introduction Who voted for Chung-hee Park in South Korea in the 1963 presidential election? What is the level of regional voting in that election? How many times did female have an experience on abortion for her life time? They all are interesting questions, yet hard to examine, partly because it is a historical matter so that survey data is not available or partly because a survey respondent had a politically correct answer, if data were available, and thus results might be biased. The purpose of this paper is to introduce a new technique for solving these problems, the Gary King’s ecological regression. To verify this technique, we compare estimates from survey results with those from the King’s method. is "regional voting" in the 16

th

The case selected for the verification Korean National Assembly elections.

The regional voting refers to the concentration of votes along regional party lines in a number of Korean regions (Kim 1994; Lee 1998; Lee and Brunn 1996).1) Stated another way, voters whose hometown is Jeolla mostly vote for the candidate of the party whose leaders were born in the region.

Since this appears to happen

regardless of the quality of candidates and the ideology of party, voting patterns result in parties often being representatives of regions 1) By definition, regionalism refers to the voters’ affective identifications with, and support for, candidates with roots in their respective regions (Kim and Koh 1980, 81).

- 1 -

instead of districts or the nation (Shin, Jin, Gross and Eom 2005). However, the level of regional voting has been a difficult topic to be examined, because voting is secret and a survey respondent may have a politically correct answer. Since the Goodman’s seminal work (1959), several techniques for the cross-level inference is developed to solve the ecological fallacy problem (Palmquist 1993).

The King’s solution (1997) is well

known for producing efficient and robust estimates .

In addition, it

contains information on the uncertainty of estimates at the level of analysis. After the explanation of the King’s solution, we analyzed regional voting in the 16th National Assembly elections.

The

estimates from the King's solution are compared with those from survey results.

We found that estimates from the King’s solution are

closely matching with those from survey results if cell frequency in survey results is large enough.

The King’s solution produces more

reliable estimates in a case that cell frequency in survey results is small.

II. Ecological Fallacy and Ecological Regressions The cross-level inference is "the process of using aggregate (i.e., "ecological") data to infer discrete individual level relationships of interest" (King 1997, xv). of ecological fallacy.

It provides a solution for the problem

In this section, we introduce the problem of

- 2 -

ecological fallacy.

We then move to describe a series of efforts to

solve this problem. 1. Ecological Fallacy It is well known that using aggregate level data to figure out individual level relationships generates the ecological fallacy problem which produces biased and inefficient estimates (Palmquist 1993).

For

example, suppose that our research question is to examine the level of literacy between the foreign born and the native (Robinson 1950). Further, assume that we have three groups (the sophisticated, the regular and the foreign born), and both the sophisticated and the foreign born prefer to live in a city and the regular like to live in a rural area.

If a researcher regresses the percentage of the foreign

born on literacy rate at the county level, he or she may find that the greater the percentage of foreign born, the higher the literacy rate.

It

would be a shocking result, because the foreign born are not likely to be literate.

However, if one analyzes the relationship at the individual

level, he or she may find a different and more convincing result; the native tend to have a higher literacy than the foreign born.

This

discrepancy occurs because the sophisticated as well as the foreign born reside in the same type of area, i.e., city.

Without consideration

of aggregation unit, the findings from aggregate data mislead the individual level relationship.

It shows the inappropriateness of using

aggregate data to examine the individual level relationship.

- 3 -

2. Ecological Regressions To solve the aggregation bias, several methods have been introduced.

A common assumption the models make can be described

in below table. The Robinson’s problem Literate (L)

Illiterate (IL)

Marginal

The Native(N)

?

?

20000

The Foreign born (F)

?

?

1000

15000

6000

21000

Marginal

Let’s suppose that in a total population of 21,000 we observe only marginal population values for the Native (N) and the Foreign born (F): 20,000 and 1,000. Also we know marginal values for the Literate (L) and the Illiterate (IL).

Our research problem is to find

cell frequency noted as question marks; how many are the literate among the native and how many are the literate among the foreign born?

We then calculate literacy ratios between the native and the

foreign born and examine whether the birthplace is related to the literacy. One of the ways to solve this problem can be suggested as follows.

Let’s suppose that we know the value for the left upper

corner by pure luck; the number of the literate among the native is 15,000.

Once we have this information, we can accordingly calculate

the rest of cell values.

The results are shown in table 2. - 4 -

A solution for the Robinson’s problem Literate (L)

Illiterate (IL)

Marginal

15000

(5000)

20000

(0)

(1000)

1000

15000

6000

21000

The Native(N) The Foreign born (F) Marginal

Since the native population who can read and write is 15,000, the number of the illiterate among the native is 5000 in a population of 20000. In addition, the number of the entire literate is 15,000 and the number of the literate native is 15,000, and thus the number of the literate foreign born is zero.

For a purpose of comparison, table

2 can be rewritten in table 3.
Literacy Ratios Literate (L)

Illiterate (IL)

Marginal

The Native(N)

0.75

0.25

0.95

The Foreign born (F)

0.00

1.00

0.05

Marginal

0.71

0.29

1.00

Of the native, the literacy ratio is 0.75 while it is 0.00 for the foreign born.

Therefore, it leads to a conclusion that the native are

more likely to be literate compared to the foreign born.

This example

shows that we are able to disaggregate aggregate data if we "correctly" impose some constraints on the parameter of our interest. In this case, we assumed that some information on the number of the

- 5 -

literate among the native is available.

We can generalize table 3 in

the following table.
General Form of Ecological Regression Literate (L)

Illiterate (IL)

The Native(N)

βNi

1-βNi

Xi

The Foreign(F)

β

F

1-Xi

Marginal

Ti

F

1-β

i

Marginal

i

1-Ti

X is the proportion of the native, βN is the literacy ratio for F

the native, and β is the literacy ratio for the foreign born. proportion of the literate and "i" is an aggregation unit. constraints, the parameters of our interest (β

N

T is the

With some F

and β ) can be

calculated using aggregate values of X and T. With

this

general

form,

two

approaches

for

ecological

regression have been developed: method of bounds and statistical approach.

Method of bounds uses deterministic information in data

(Achen and Shively 1995).

Let’s suppose once again we attempt to

estimate the literacy ratio among the native and the foreign born with aggregate information.

The relationship in table 4 can be written as

follows: Ti = βiN Xi + βiF(1-Xi)

··········1)

Then,

βi

N

Ti 1 − X i F βi − Xi = Xi - 6 -

Since βs are a proportion, it should be between 0 and 1; 0 ≤β

≤ 1. In addition, if βi

while if

F βi

F

Ti N = 0, X i becomes a maximum value for βi ,

Ti −1 + X i N Xi = 1, becomes a minimum value for βi . N

Hence, lower and upper limits for βi are:

⎡ ⎛ Ti − 1 + X i ⎞ ⎛T ,0 ⎟⎟ Min⎜⎜ i ⎢ Max⎜⎜ Xi ⎠, ⎝ ⎝ Xi ⎣

⎞⎤ ,1⎟⎟⎥ ⎠⎦ .

With the same procedure, we can obtain lower and upper limits for β i

F

as follows:

⎡ ⎛ Ti − X i ⎞ ⎛ T ,0 ⎟⎟ Min⎜⎜ i ⎢ Max⎜⎜ ⎝ 1− Xi ⎠ , ⎝1− X i ⎣

⎞⎤ ,1⎟⎟⎥ ⎠⎦ .

With our example, the range of plausible values for β

N

is =

[Max (0.7, 0), Min (0.75. 1)]= [0.7, 0.75].

Therefore, we can

significantly

of

narrow

down

plausible

values

parameter

N

β .

However, in often cases, method of bounds produces too broad information, especially when the distribution of proportions (X and/or T) is considerably skewed.

For example, the range of βF values as in

our example is [Max (-5, 0), Min (15, 1)] = [0, 1].

In this case,

method of bounds does not reduce the range of plausible values for

 .

- 7 -

The second approach for ecological regressions has been developed to use logic of statistical association.

If there is association

between variables, it will occur across units with some fluctuation. The first model was developed by Leo A. Goodman (1959). He argues that if we can reasonably assume three things, we can infer individual behavior from aggregate data.

His assumptions are constant

effect of parameters, linear function, and normal distribution of residuals.

Following his suggestion, the equation 1) can be rewritten

as follows: N

F

Ti = β Xi + β (1-Xi) + ei, ··········2) Where e is residuals.

If three conditions are met, he argues,

parameter βs and their standard errors are correctly inferred. The Goodman’s model, however, has several problems (Voss 2000).

First, his constant effect assumption is not substantively

reasonable.

For example, being the constant literacy ratio for the

native across units are too restrictive.

If the parameters (βs) covary

with a unit, the estimates may over- or underestimates true βs due to the aggregation bias.

Second, since the Goodman’s model produces

only a single estimate, it is hard to know individual behavior within a unit. A series of models have been developed to solve or relax these assumptions (Achen and Shiverly 1995).

- 8 -

For example, the

homogeneous model utilizes, rather than estimates, information from observed data.

That is, with our example, the homogeneous model

observes the literacy ratio among the entire native, and then uses this ratio as a benchmark for inferring individual relationships.

The same

procedure is applied for the literacy ratio for the entire foreign born. It is only useful when units are highly segregated, however.

It

becomes unreliable when both the native and the foreign born are mixed in the same unit. The informed assumption model uses informed knowledge instead of observed literacy ratio.

For example, we may have prior

information that the entire foreign born are illiterate.

In this case, βF

becomes zero, and thus we can use this information and then calculate N

β and the rest of βs, as shown in table 4. prior information is unattainable.

However, in most cases,

And, researchers may not receive a

warning when this informed knowledge is incorrect, which results in biased estimates of parameters (Voss 2000). The final example for ecological regressions has a different premise.

The neighborhood model assumes that parameters of our

interest is the same within a unit (β (β

N i



N

β j, where i ≠ j).

N

i

F

= β i), yet varies across units

Therefore, the equation 2) becomes Ti =

   + βi(1-Xi) + ei = βi + ei, where βi is a function of Xi.

For

example, this model assumes that the literacy ratio between the native and the foreign born is the same within the same unit, while the literacy ratio varies across units.

As one may notice, the assumption

- 9 -

the neighborhood model makes is too strong.

Even it is a plausible

assumption, we do not have to estimate a model, because we have an answer for our research question; whether the birthplace is related to the level of the literacy. With an exception of the neighborhood model, a survey of ecological regressions shows some common problems. models assumed the constant effect of parameters.

First, all of the

It seems to be too

restrictive, because parameters of our interest are hardly constant across units.

Second, the models produce only a single estimate.

Since we attempt to infer the individual level relationship, it is not likely to be satisfied with a single estimate.

Finally, if an equation

has more than two parameters to be estimated, it is hard to imagine how these methods can be extended. Gary King (1997) provides an interesting method to solve these problems.

First, he does not assume a constant effect; rather he

assumes that a parameter varies with a common underlying dimension. Second, because of the varying parameter, we may have an estimate per unit.

In addition, since his method uses additional information

from method of bounds, the estimates become more efficient. method can be written as follows (King 1997, 93-94): N

F

N

F

Ti = β iXi + β i(1-Xi) + ei, where P(β i, β i) N

= TN (β i, β

F i

|Β, Σ) ··········3)

- 10 -

His

N

F

Probability density of parameters (β i, β i) follows truncated N

F

N

normal distribution of (β i, β i) with limits β i, = [0, 1] and β 1].

N

F i

= [0,

F

With the help of method of bounds, these limits for (β i, β i) can

be narrowed down as follows:

⎡ ⎛ Ti − 1 + X i ⎞ ⎛T ⎟⎟ Min⎜⎜ i ⎜⎜ Max , 0 ⎢ Xi βN i , = ⎣ ⎠, ⎝ ⎝ Xi

⎞⎤ ,1⎟⎟⎥ ⎠⎦

⎡ ⎛ Ti − X i ⎞ ⎛ T ⎟⎟ Min⎜⎜ i ⎜⎜ Max , 0 ⎢ F βi = ⎣ ⎝ 1− Xi ⎠ , ⎝1− X i

⎞⎤ ,1⎟⎟⎥ ⎠⎦ .

⎛ΒN ⎞ ⎜ F⎟ Β = ⎜ Β ⎟ and The mean and variance matrix of (βNi, βFi) are ⎝ ⎠ ⎛ σ N2 Σ = ⎜⎜ ⎝ σ NF

σ NF ⎞ ⎟ σ F2 ⎟⎠

If his three assumptions are met, estimation produces an efficient and robust estimate.2) The estimation procedure of the King’s solution can be summarized as follows: 1) The first step calculates the bounds of parameters. 2) The second step estimates parameters from truncated bivariate normal distributions within the bounds. If one extends a model with more than two parameters, the estimates from the first estimation are used for marginal values.

For

2) Three assumptions are single model of parameter, the absence of spatial correlation, and no correlation of marginal and parameter. King, Rosen, and Tanner (1999, 67-68) show, however, that the violation of the third assumption does not produce biased estimates if the bounds of parameters are low enough.

- 11 -

example, if one is interested in the proportion of regional voting, he or she first estimates a turnout rate among those who were born in a certain region in a given district.

The estimated turnout rate is used

as marginal values for the proportion of regional voting.

It can be

diagramed below:
King’s Solution: the first step Vote

Not vote

Marginal

Jeolla

βiJ

1-βiJ

Xi

Other Regions

J βi

1-βi '

1-Xi

Marginal

Ti

'

J

1-Ti

Where X is the proportion of voting age population who were born in Jeolla, T is the proportion of voters those who turn out to vote, βJ is a turnout rate among those who were born in Jeolla, βJ' is a turnout rate among those who were born in a region other than Jeolla, and "i" is a district indicator.

King’s Solution: the second step Vote

Not vote

Marginal

βiJ

1-βiJ

xi

1-λiJ'

βiJ'

1-βiJ'

1-xi

1 - Pi

Ti

NCNP

Other parties

Jeolla

λiJ

1-λiJ

Other Regions

λiJ' Pi

Where "x" is the estimated proportion of voters whose hometown is in Jeolla and who turn out to vote, P is the vote share for a candidate whose party label is the National Congress for New Politics (NCNP), and proportion of regional voting.

- 12 -

is the

The first step is to examine turnout rate ( and   ) for those who were born in Jeolla (Xi) and for those who were born in areas other than Jeolla (1-Xi).

Once we obtain estimates for βs, these

estimates are used to calculate marginal values for regional voting estimates (xi and 1-xi).

The second step starts with the calculation of

bounds of parameters (λs) and then estimates the parameters across units.

Note, however, that since      has a component to be

estimated, it is not a fixed variable.

Therefore, extending tables

produce more uncertain estimates due to added uncertainty originating from the first estimation.3) In next section, we apply the King’s solution to find the level of regional voting in the Korean National Assembly elections of 2000.

III. Application: Disaggregating Regional Voting The 2000 election outcomes in Korea suggest that there are three regions which tend to exhibit partisan regionalism: Jeolla, Gyeongsang, and Chungcheong.

Jeolla region covers Jeollabuk-do and

Jeollanam-do areas, Gyeongsang region refers to Gyeongsangbuk-do and

Gyeongsangnam-do

Chungcheongbuk-do

and

areas,

and

Chungcheong

Chungcheongnam-do

region

areas.

means Regional

dominance by a particular party was specified in terms of the

3) Note that the parameters (λ and (1-λ)) are weighted by the number of voting age population in a given district.

- 13 -

birthplace of particular party leaders.

A leader of the Grand National

Party, Kim Yong Sam was born in Gyeongsang region a leader of the National Congress for New Politics, Kim Dae Jung in Jeolla region and a leader of the United Liberal Democrats, Kim Chong Phil in Chungcheong region.

This link between the birthplace of a party

leader and the dominance of a particular party is well documented in contemporary Korean politics (Kim 1994; Lee 1998; Lee and Brunn 1996). In this section, using the Gary King’s ecological regression we attempt to disaggregate aggregate votes along the level of regional party lines in a district. the candidate level.

The King’s method infers regional voting at The percentage of regional voting at the

candidate level will be averaged out across regional blocs and compared to estimates from survey results.

The following equations

are to be estimated:

             , ··········4)                    , ··········5)

            , ··········6) where P is the vote share of a candidate, λ is the proportion of regional voting, and λ' is the proportion of non-regional voting. indicates Jeolla, G Gyeongsang, and C Chungcheong.

J

"xi " is

     , where β is a turnout rate for those were born in a certain - 14 -

region, and X is the proportion of voters for those who were born in a certain region. candidate level.

"i" indicates a district.

The level of analysis is the

Estimation is done by the program called "EzI."4)

In the 16

th

National Assembly elections of Korea (April 13,

2000), 194 incumbent and 449 non-incumbent candidates ran for office (National Election Commission 2000).

We focus our analysis on the

vote share for candidates of the three major parties.5)

The percentage

of those who registered their birthplace in a given district is collected with the help of one of major parties.6)

The results are shown in

table 5.
Regional Voting Estimates from Ecological Inference Regional Voting Level (Average λs) Regional Blocs

GNP

NCNP

ULD

Seoul

61.00%

75.27%

2.57%

Busan

67.29%

64.66%

1.62%

Daegu

62.02%

70.49%

2.26%

Incheon

61.05%

73.75%

2.51%

Gwangju

56.14%

79.46%

0.95%

Ulsan

57.61%

66.25%

1.23%

Gyeonggi-do

60.43%

73.92%

2.48%

Gangwon-do

60.14%

71.80%

2.61%

Jeollabuk-do

57.52%

61.04%

1.76%

Jeollanam-do

43.85%

63.86%

0.00%

4) "EzI" are developed by Kenneth Benoit and Gary King (released in 2001). It is available from http://gking.harvard.edu/stats.shtml, visited May 1, 2002. 5) We focus on only these three parties because they comprised over 96% of the single th member district seats in the 16 National Assembly Election. 6) Because of a contributor’s request, the source of data has not been released. Data on districts in Chungcheong-do are not available so that the number of districts in this study are 106.

- 15 -

Gyeongsangbuk-do

54.88%

71.17%

1.83%

Gyeongsangnam-do

55.35%

68.99%

2.18%

Average

52.21%

61.74%

1.04%

Source: complied by the authors. Note: Average λ is calculated by averaging out district level regional votings (λi) along with regional blocs. GNP stands for the Grand National Party, NCNP for the National Congress for New Politics and ULD for the United Liberal Democrats.

Table 5 shows that on average the percentage of regional voting (61.74%) is the highest among those who were born in Jeolla and it may be benefit to candidates of the National Congress for New Politics.

The percentage of regional voting for those who were born

in Gyeongsang ranked the second. Not surprisingly, the level of regional

voting

Chungcheong.

are

the

lowest

for

those

who

were

born

in

It resulted in less concentration of votes on candidates

running under the United Liberal Democrats (ULD).

In the 15th

National Assembly Elections of 1996, the ULD won 25 of the 28 seats.

But, by the elections of 2000, the ULD was only able to win

11 of the 24 seats in Chungcheong. It is also the case when one examines the percentage of regional voting within a regional bloc.

For example, 79.46 percent of

those who were born in Jeolla cast a regional voting if they reside in districts within Gwangju.

More than 70 percent of voters also voted

for candidates of the NCNP in districts within Seoul, Daegu, Incheon, Gyeonggi-do, Gangwon-do and Gyeongsangbuk-do if they were born in Jeolla. - 16 -

Those who were born in Gyeongsang tend to cast a slightly less regional voting, yet quite a significant level.

On average, more

than half of voters who were born in Gyeongsang cast a regional voting in the 16

th

National Assembly elections.

It is especially the

case when one examines in districts within Seoul, Busan, Daegu, Incheon, Gyeonggi-do, and Gangwon-do more than 60 percent of voters voted for candidates of the GNP if they were born in Gyeongsang.

It is also the case, though to less extent, if he or she

resides in Gwangju, Ulsan, Jeollabuk-do, Gyeongsangbuk-do, and Gyeongsangnam-do. Not surprisingly, those who were born in Chungcheong cast the least extent of regional voting.

Only handful of voters who were

born in Chungcheong cast regional voting on average.

However, it

should be noted that data for districts within Chungcheong area were not available and thus estimates may be underestimated. In sum, regional voting estimates from the King’s method provide supportive evidence for the argument that regional voting is a nationwide problem (Kim 1994; Lee 1998; Lee and Brunn 1996). Not only is the level of regional voting significant in districts within the the known regional voting blocs (Jeolla and Gyeongsang), but also it appears to be substantial in districts outside these regional blocs. However, there is a significant fluctuation at the level of regional voting across regions.

For example, the percentage of regional voting

for those who were born in Jeolla varies from 61.04% in Jeollabuk-do

- 17 -

to 79.46% in Gwangju, while it varies from 43.85% in Jeollanam-do to 67.29% in Busan if voters were born in Gyeongsang.

We can

conclude that the level of regional voting is not constant, but varies across regional blocs. The results from ecological inference can be verified by survey results.

The procedure is the same above except that figures are

obtained from individual level data.

The first step is to identify

voters who were born in a certain region, and then calculate how many these voters turn out to vote for the pertinent party.

The

Korean Social Science Data Center conducted a survey of the 16

th

National Assembly Elections in April 13, 2000 (Korean Social Science Data Center 2000).

Multistage quota sampling technique was used to

collect a random sample by regional blocs.

1,100 interviews were

completed with a rejection rate of 5 percent.

Fortunately, the survey

includes a question on the hometown of and the vote choice of a respondent. These two questions were used to construct a regional voting; for example, if he or she was born in Jeolla area and voted for the NCNP, it is coded as a regional voting for the NCNP. In Seoul, fifty five respondents were born in Jeolla.

Thirty four out of

the fifty five voted for the NCNP. Therefore, the percentage of regional voting for the NCNP is 61.82 percent for Seoul.

Table 6

shows the percentage of regional voting in regional blocs.
Regional Voting Estimates from Survey Results

- 18 -

Regional Voting (Percentage/Frequency) Regional Blocs

GNP

Seoul Incheon/Gyeonggi-do

NCNP

ULD

46.88%

61.82%

(32)

(55)

3.03% (33)

65.00%

57.14%

7.50%

(20)

(35)

(40)

Gangwon-do

0.00%

100.00%

0.00%

(1)

(1)

(3)

Daejeon/Chungcheongnam-do

0.00%

20.00%

20.34%

(4)

(5)

(59)

Chungcheongbuk-do

33.33%

0.00%

27.59%

(3)

(1)

(29)

Gwangju/Jeollanam-do

25.00%

50.68%

0.00%

Jeollabuk-do

(4)

(73)

(4)

0.00%

60.00%

0.00%

(1)

(40)

(5)

Busan/Ulsan/Gyeongsangnam-do

64.24%

31.25%

0.00%

(151)

(16)

(8)

Daegu/Gyeongsangbuk-do

53.15%

33.33%

25.00%

(111)

(3)

(4)

Average

30.09%

46.02%

9.27%

Source: The Korean Social Science Data Center (2000). Figures in parenthesis are the number of respondents.

Table 6 shows that regional voting is the most evident for those who were born in Jeolla, followed by those who were born in Gyeongsang and in Chungcheong.

The level of regional voting

isslightly low compared to that from the King’s solution.

On

average, 46.02 percent voted for the NCNP if they were born in Jeolla, while it is 30.09 percent if voters were born in Gyeongsang. A significant fluctuation appeared across regional blocs.

In

particular, if cell frequency is too small, the variation of regional voting is beyond the acceptable range.

For example, in Gangwon-do

where cell frequency is one, the percentage of regional voting is 100

- 19 -

percent out of those who were born in Jeolla, while it is zero percent in Chungcheongbuk-do where cell frequency is also one. If one may focus on the level of regional voting where the number of respondents are sufficient enough, we can find similarity in the level of regional voting between estimates from the King’s method and estimates from survey results.

For example, according to survey

results, the percentage of regional voting in Seoul is 61.82 percent for the NCNP while the comparable figure is 75.27 percent by the ecological regression.

It is 60 percent in Jeollabuk-do by survey

results, while it is 61.04 percent by the ecological regression.

We

can safely conclude that estimates from the King’s solution are closely matching with those from survey results.

IV. Conclusion Applying aggregate level findings for the individual level relationships generates biased estimates, known as the ecological fallacy problem.

Social scientists often encounters with a difficulty to

conduct a research at the individual level with aggregate data.

In

particular, if a research question is related to the past event when survey data are not available, it is almost impossible to pursue a research.

Further, if there is a politically correct answer on survey

questions, it is hard to obtain unbiased estimates.

- 20 -

In this paper, we introduced a way to infer the individual level

relationships

from

aggregate

data.

We

began

with

the

aggregation bias which leads to the ecological fallacy problem.

A

series of efforts have been suggested to solve the aggregation bias. The method by Gary King, which combines method of bounds and statistical association, is emphasized. The King’s solution is well known for a method to produce a robust and efficient estimate, even though there is a severe aggregation bias.

The solution applied to

infer regional voting at the candidate level.

The percentage of

regional voting was averaged out across regional blocs.

The average

percentages, then, were compared to estimates from survey results. We found that estimates from the King’s method are closely matching with estimates from survey results if cell frequency in survey results is large enough.

We also found that the former is more reliable than

the latter if cell frequency in survey results is small. Ecological regressions offer a new venue to examine previously impossible questions.

For example, we can examine who voted for

Chung-hee Park in the 1963 Korean presidential election.

We can

further question why they voted for him; for example, was the generation

effect

related

to

the

outcome

of

the

1963

Korean

presidential election? Furthermore, we can use ecological regressions to examine whether or not a voter casts a vote for a party candidate in a congressional election, while the same voter casts a different party candidate for a presidential election (Burden and Kimball 1998). We

- 21 -

should note, however, that ecological regressions also show some limitation.

If tables are extended more than 2 by 2, the uncertainty

of estimates gets thicker.

Scholars of ecological regression attempt to

reduce this uncertainty (King, Rosen, Tanner 1999; Rosen, Jiang, King forthcoming).

- 22 -

< REFERENCE > Achen, Christopher H. and W. Phillips Shively (1995). Cross-Level Inference. Chicago: University of Chicago press. Benoit, Kenneth and Gary King (1996). "A Preview of EI and EzI: Program for Ecological Inference." Social Science Computer Review 14:433-438. Burden, Barry C. and David C. Kimball (1998). "A New Approach to the Study of Ticket Splitting." American Political Science Review 92: 533-544. Goodman, Leo (1959). "Some Alternatives to Ecological Correlation." American Journal of Sociology 64: 610-625. Kim, Jae-On and B.C. Koh (1980). "The Dynamics of Electoral Politics:

Social

Development,

Political

Participation,

and

Manipulation of Electoral Laws." in Political Participation in Korea: Democracy, Mobilization, and Stability edited by Chong Lim Kim. Santa Barbara: CLIO books. 59-84. King, Gary, Ori Rosen, and Martin A. Tanner (1999). "Binomial-Beta Hierarchical Models for Ecological Inference."

Sociological

Methods & Research 28:61-90. King, Gary (1997). A Solution to the Ecological Inference Problem: Reconstructing

Individual

Behavior

from

Princeton, NJ: Princeton University Press.

- 23 -

Aggregate

Data.

Korean Social Science Data Center (2000). A Survey on Voters Attitudes toward the 16th General Election.

Seoul: Korean

Social Science Data Center. Lee, Dong Ok and Stanley D. Brunn (1996). "Politics and regions in Korea: an analysis of the recent presidential election." Political Geography 15: 99-119. Lee, Nam Young (1998). "Regionalism and Voting Behavior in South Korea." Korea Observer 29: 611-633. National Election Commission. http://www.nec.go.kr (2000. 6. 1). Palmquist, Bradley Lowell (1993). Ecological Inference, Aggregate Data Analysis of U. S. Elections, and the Socialist Party of America. Ph. D. dissertation at the University of California, Berkley. Robinson, W. S (1950). "Ecological Correlations and the Behavior of Individuals." American Sociological Review 15: 351-357. Rosen, Ori, Wenxin Jiang, Gary King, and Martin A. Tanner (Forthcoming).

"Bayesian

and

Frequentist

Inference

for

Ecological Inference: the R X C Case." Statistica Neerlandica. Shin, Myungsoon, Youngjae, Jin, Donald A. Gross, and Kihong Eom (2005). "Money Matters in Party-Centered Politics: Campaign Spending in Korean Congressional Elections." Electoral Studies 24: 85-101. Voss, D. Stephen (2000). Familiarity Doesn’t Breed Contempt: The Political Geography of Racial Polarization. Ph. D. dissertation at Harvard University.

- 24 -

Suggest Documents