Inferring Individual Level Relationships from Aggregate Data*
Recommend Documents
Cyber Security Centre, Department of Computer Science, University of Oxford, UK ... Pumphrey was at the University of Oxford during this research. into their life.
Feb 16, 2017 - Eagle et al. [3] .... MaxConfidence (AMC) measure to describe the correlation between ... According to definition, AMC measure is asymmetrical.
5Department of Ecology and Evolutionary Biology and Woodrow Wilson School, Princeton University,. Princeton, NJ 08540, USA. Models of infectious disease ...
Mar 31, 2011 - ... PC, Mattern MY,. Mitiku TF, Svenson LW, Putnam W, Flanagan WM, Tu JV: Canadian ... Ann Allergy Asthma Immunol 1997,. 78(2):221-224.
Nov 27, 2018 - antenatal care can change with the introduction of the eRegistry. .... thus eliminating the need for manual aggregations and reporting [25]. .... Two trained nurse-midwives completed the data extraction ..... No cases of oligohydramni
Sep 23, 2005 - consensus cluster tree (see Materials and Methods). As a result ..... eses about frequencies, for the positive best two percent correlated genes (Figure 5A) revealed ..... Found at DOI: 10.1371/journal.pcbi.0010040.sg001 (159 KB PDF).
sharing clinical trial data involving scientists from research institutions in low, middle .... of the witness seminar were entered into NVivo software Version. 11.3.2.
Apr 6, 2012 - Benjamin E. R. Rubin1,2*, Richard H. Ree3, Corrie S. Moreau2 ...... Philippe H, Snell EA, Bapteste E, Lopez P, Holland PWH, et al. (2004).
May 12, 2017 - inferred from fitness rank orders, where all genotypes are ordered according ... Basel, Switzerland; SIB Swiss Institute of Bioinformatics, Basel,.
years ago (Smith 1966, 1969). There is also some ..... D.; Les, D. H.; Mishler, B. D.; Duvall, M. R.. Price, R. A.; Hills ... L. E.; Smith, J. F.; Gaut, B. S.; Learn, G. H. in.
individual's socioeconomic status (SES) in analyses of health outcomes.1â6 In related analy- ses, however, Geronimus and Bound7 and Ger- onimus et al.8 ...
SEO, MC-063, University of. Illinois at Chicago, 851 ... similarity measurements. Our method is fully automated ..... is provided as part of the. SSEARCH tools.53.
The appendix collects the computational details and data sources. 2. THE MODEL. 2.1. Environment. The model economy is a
Sumitra Purkayastha, Tor D. Wager, and Thomas E. Nichols. Indian Statistical Institute, Columbia University and GlaxoSmithKline. Abstract: Functional magnetic ...
Mar 25, 2014 - in Breast Cancer: An Application of Bayes' Theorem ... Keywords: DNA microarray, inferring, gene network, bayesian network, normalization. 1.
Apr 12, 2013 - from low level data for the Khan Academy platform. Proceedings of the Third International. Conference on Learning Analytics and Knowledge.
Jul 14, 2011 - Contact: [email protected]; [email protected] .... Putative NR2C2 binding sites in the CYP3A4 promoter region at .... of expression profiles between candidate TFs and the cluster centroide.
Sep 1, 2017 - underlying intuition is thât the deviâ¢es of â â¢ertâin â¢lâss .... sn ânother âpproââ¢hD ân initiâl inferenâ¢e is improved upon Ëy ...
interest income and the value of off-balance sheet activities. We have estimated for this model the individual efficiency scores of each credit institution in the ...
Feb 23, 2011 - individual behavior and aggregate phenomena in macroeconomics. The .... Participants were only informed about the basic principles of the cobweb- ..... otherwise, each of the six agents will be endowed with M = 10 chromosomes ...... ly
Mar 10, 2017 - CAPTE's Resource Documents address is www.capteonline.org/resources. This publication is only available o
and the courts, as well as the likelihood of crime and violence. 6. ...... How problematic is judiciary for the growth of your business. X. X. X. X ...... BAHAMAS. BHS.
1 This research originated in the study on Japan's child care conducted by the Price Policy Division of the. Cabinet Office. We'd like to thank Mikio Kawa, ...
Inferring Individual Level Relationships from Aggregate Data*
improving the cross-level inference, the Gary King's solution. For the purpose of .... T is the proportion of the literate and "i" is an aggregation unit. With some.
Inferring Individual Level Relationships from Aggregate Data* Kihong Eom**, Youngjae Jin*** < ABSTRACT > This paper introduces a technique for inferring individual level relationships from aggregate data. ecological
fallacy
problem
Social scientists encounter with the
where
individual
level
available, yet the individual level relationship is sought.
data
are
not
In particular,
it is the case that social scientists attempt to examine a subject for which no survey data are available or reliable.
A series of cross-level
inference techniques have been introduced since the Goodman’s seminal work (1959).
We introduce a new technique of significantly
improving the cross-level inference, the Gary King’s solution.
For the
purpose of verification, we examined regional voting in the 16 National Assembly elections of South Korea.
th
The estimates of
regional voting are compared with those of survey results.
We found
that the estimates from the King’s solution are closely matching with those from survey results if cell frequency in survey results is large enough. The King’s solution produces more reliable estimates in a case that cell frequency in survey results is small. Key words : Cross-level inference, ecological fallacy, aggregate data, individual level relationship * An earlier version was presented at the 53rd meeting of the International Statistical Institute, Seoul, Korea, August 22-29, 2001.
** Frostburg State University, Lecturer (email: [email protected]) *** Yonsei University, Associate Professor (email: [email protected])
I. Introduction Who voted for Chung-hee Park in South Korea in the 1963 presidential election? What is the level of regional voting in that election? How many times did female have an experience on abortion for her life time? They all are interesting questions, yet hard to examine, partly because it is a historical matter so that survey data is not available or partly because a survey respondent had a politically correct answer, if data were available, and thus results might be biased. The purpose of this paper is to introduce a new technique for solving these problems, the Gary King’s ecological regression. To verify this technique, we compare estimates from survey results with those from the King’s method. is "regional voting" in the 16
th
The case selected for the verification Korean National Assembly elections.
The regional voting refers to the concentration of votes along regional party lines in a number of Korean regions (Kim 1994; Lee 1998; Lee and Brunn 1996).1) Stated another way, voters whose hometown is Jeolla mostly vote for the candidate of the party whose leaders were born in the region.
Since this appears to happen
regardless of the quality of candidates and the ideology of party, voting patterns result in parties often being representatives of regions 1) By definition, regionalism refers to the voters’ affective identifications with, and support for, candidates with roots in their respective regions (Kim and Koh 1980, 81).
- 1 -
instead of districts or the nation (Shin, Jin, Gross and Eom 2005). However, the level of regional voting has been a difficult topic to be examined, because voting is secret and a survey respondent may have a politically correct answer. Since the Goodman’s seminal work (1959), several techniques for the cross-level inference is developed to solve the ecological fallacy problem (Palmquist 1993).
The King’s solution (1997) is well
known for producing efficient and robust estimates .
In addition, it
contains information on the uncertainty of estimates at the level of analysis. After the explanation of the King’s solution, we analyzed regional voting in the 16th National Assembly elections.
The
estimates from the King's solution are compared with those from survey results.
We found that estimates from the King’s solution are
closely matching with those from survey results if cell frequency in survey results is large enough.
The King’s solution produces more
reliable estimates in a case that cell frequency in survey results is small.
II. Ecological Fallacy and Ecological Regressions The cross-level inference is "the process of using aggregate (i.e., "ecological") data to infer discrete individual level relationships of interest" (King 1997, xv). of ecological fallacy.
It provides a solution for the problem
In this section, we introduce the problem of
- 2 -
ecological fallacy.
We then move to describe a series of efforts to
solve this problem. 1. Ecological Fallacy It is well known that using aggregate level data to figure out individual level relationships generates the ecological fallacy problem which produces biased and inefficient estimates (Palmquist 1993).
For
example, suppose that our research question is to examine the level of literacy between the foreign born and the native (Robinson 1950). Further, assume that we have three groups (the sophisticated, the regular and the foreign born), and both the sophisticated and the foreign born prefer to live in a city and the regular like to live in a rural area.
If a researcher regresses the percentage of the foreign
born on literacy rate at the county level, he or she may find that the greater the percentage of foreign born, the higher the literacy rate.
It
would be a shocking result, because the foreign born are not likely to be literate.
However, if one analyzes the relationship at the individual
level, he or she may find a different and more convincing result; the native tend to have a higher literacy than the foreign born.
This
discrepancy occurs because the sophisticated as well as the foreign born reside in the same type of area, i.e., city.
Without consideration
of aggregation unit, the findings from aggregate data mislead the individual level relationship.
It shows the inappropriateness of using
aggregate data to examine the individual level relationship.
- 3 -
2. Ecological Regressions To solve the aggregation bias, several methods have been introduced.
A common assumption the models make can be described
in below table.
The Robinson’s problem Literate (L)
Illiterate (IL)
Marginal
The Native(N)
?
?
20000
The Foreign born (F)
?
?
1000
15000
6000
21000
Marginal
Let’s suppose that in a total population of 21,000 we observe only marginal population values for the Native (N) and the Foreign born (F): 20,000 and 1,000. Also we know marginal values for the Literate (L) and the Illiterate (IL).
Our research problem is to find
cell frequency noted as question marks; how many are the literate among the native and how many are the literate among the foreign born?
We then calculate literacy ratios between the native and the
foreign born and examine whether the birthplace is related to the literacy. One of the ways to solve this problem can be suggested as follows.
Let’s suppose that we know the value for the left upper
corner by pure luck; the number of the literate among the native is 15,000.
Once we have this information, we can accordingly calculate
the rest of cell values.
The results are shown in table 2. - 4 -
A solution for the Robinson’s problem Literate (L)
Illiterate (IL)
Marginal
15000
(5000)
20000
(0)
(1000)
1000
15000
6000
21000
The Native(N) The Foreign born (F) Marginal
Since the native population who can read and write is 15,000, the number of the illiterate among the native is 5000 in a population of 20000. In addition, the number of the entire literate is 15,000 and the number of the literate native is 15,000, and thus the number of the literate foreign born is zero.
For a purpose of comparison, table
2 can be rewritten in table 3.
Literacy Ratios Literate (L)
Illiterate (IL)
Marginal
The Native(N)
0.75
0.25
0.95
The Foreign born (F)
0.00
1.00
0.05
Marginal
0.71
0.29
1.00
Of the native, the literacy ratio is 0.75 while it is 0.00 for the foreign born.
Therefore, it leads to a conclusion that the native are
more likely to be literate compared to the foreign born.
This example
shows that we are able to disaggregate aggregate data if we "correctly" impose some constraints on the parameter of our interest. In this case, we assumed that some information on the number of the
- 5 -
literate among the native is available.
We can generalize table 3 in
the following table.
General Form of Ecological Regression Literate (L)
Illiterate (IL)
The Native(N)
βNi
1-βNi
Xi
The Foreign(F)
β
F
1-Xi
Marginal
Ti
F
1-β
i
Marginal
i
1-Ti
X is the proportion of the native, βN is the literacy ratio for F
the native, and β is the literacy ratio for the foreign born. proportion of the literate and "i" is an aggregation unit. constraints, the parameters of our interest (β
N
T is the
With some F
and β ) can be
calculated using aggregate values of X and T. With
this
general
form,
two
approaches
for
ecological
regression have been developed: method of bounds and statistical approach.
Method of bounds uses deterministic information in data
(Achen and Shively 1995).
Let’s suppose once again we attempt to
estimate the literacy ratio among the native and the foreign born with aggregate information.
The relationship in table 4 can be written as
follows: Ti = βiN Xi + βiF(1-Xi)
··········1)
Then,
βi
N
Ti 1 − X i F βi − Xi = Xi - 6 -
Since βs are a proportion, it should be between 0 and 1; 0 ≤β
≤ 1. In addition, if βi
while if
F βi
F
Ti N = 0, X i becomes a maximum value for βi ,
Ti −1 + X i N Xi = 1, becomes a minimum value for βi . N
Hence, lower and upper limits for βi are:
⎡ ⎛ Ti − 1 + X i ⎞ ⎛T ,0 ⎟⎟ Min⎜⎜ i ⎢ Max⎜⎜ Xi ⎠, ⎝ ⎝ Xi ⎣
⎞⎤ ,1⎟⎟⎥ ⎠⎦ .
With the same procedure, we can obtain lower and upper limits for β i
F
as follows:
⎡ ⎛ Ti − X i ⎞ ⎛ T ,0 ⎟⎟ Min⎜⎜ i ⎢ Max⎜⎜ ⎝ 1− Xi ⎠ , ⎝1− X i ⎣
⎞⎤ ,1⎟⎟⎥ ⎠⎦ .
With our example, the range of plausible values for β
N
is =
[Max (0.7, 0), Min (0.75. 1)]= [0.7, 0.75].
Therefore, we can
significantly
of
narrow
down
plausible
values
parameter
N
β .
However, in often cases, method of bounds produces too broad information, especially when the distribution of proportions (X and/or T) is considerably skewed.
For example, the range of βF values as in
our example is [Max (-5, 0), Min (15, 1)] = [0, 1].
In this case,
method of bounds does not reduce the range of plausible values for
.
- 7 -
The second approach for ecological regressions has been developed to use logic of statistical association.
If there is association
between variables, it will occur across units with some fluctuation. The first model was developed by Leo A. Goodman (1959). He argues that if we can reasonably assume three things, we can infer individual behavior from aggregate data.
His assumptions are constant
effect of parameters, linear function, and normal distribution of residuals.
Following his suggestion, the equation 1) can be rewritten
as follows: N
F
Ti = β Xi + β (1-Xi) + ei, ··········2) Where e is residuals.
If three conditions are met, he argues,
parameter βs and their standard errors are correctly inferred. The Goodman’s model, however, has several problems (Voss 2000).
First, his constant effect assumption is not substantively
reasonable.
For example, being the constant literacy ratio for the
native across units are too restrictive.
If the parameters (βs) covary
with a unit, the estimates may over- or underestimates true βs due to the aggregation bias.
Second, since the Goodman’s model produces
only a single estimate, it is hard to know individual behavior within a unit. A series of models have been developed to solve or relax these assumptions (Achen and Shiverly 1995).
- 8 -
For example, the
homogeneous model utilizes, rather than estimates, information from observed data.
That is, with our example, the homogeneous model
observes the literacy ratio among the entire native, and then uses this ratio as a benchmark for inferring individual relationships.
The same
procedure is applied for the literacy ratio for the entire foreign born. It is only useful when units are highly segregated, however.
It
becomes unreliable when both the native and the foreign born are mixed in the same unit. The informed assumption model uses informed knowledge instead of observed literacy ratio.
For example, we may have prior
information that the entire foreign born are illiterate.
In this case, βF
becomes zero, and thus we can use this information and then calculate N
β and the rest of βs, as shown in table 4. prior information is unattainable.
However, in most cases,
And, researchers may not receive a
warning when this informed knowledge is incorrect, which results in biased estimates of parameters (Voss 2000). The final example for ecological regressions has a different premise.
The neighborhood model assumes that parameters of our
interest is the same within a unit (β (β
N i
≠
N
β j, where i ≠ j).
N
i
F
= β i), yet varies across units
Therefore, the equation 2) becomes Ti =
+ βi(1-Xi) + ei = βi + ei, where βi is a function of Xi.
For
example, this model assumes that the literacy ratio between the native and the foreign born is the same within the same unit, while the literacy ratio varies across units.
As one may notice, the assumption
- 9 -
the neighborhood model makes is too strong.
Even it is a plausible
assumption, we do not have to estimate a model, because we have an answer for our research question; whether the birthplace is related to the level of the literacy. With an exception of the neighborhood model, a survey of ecological regressions shows some common problems. models assumed the constant effect of parameters.
First, all of the
It seems to be too
restrictive, because parameters of our interest are hardly constant across units.
Second, the models produce only a single estimate.
Since we attempt to infer the individual level relationship, it is not likely to be satisfied with a single estimate.
Finally, if an equation
has more than two parameters to be estimated, it is hard to imagine how these methods can be extended. Gary King (1997) provides an interesting method to solve these problems.
First, he does not assume a constant effect; rather he
assumes that a parameter varies with a common underlying dimension. Second, because of the varying parameter, we may have an estimate per unit.
In addition, since his method uses additional information
from method of bounds, the estimates become more efficient. method can be written as follows (King 1997, 93-94): N
F
N
F
Ti = β iXi + β i(1-Xi) + ei, where P(β i, β i) N
= TN (β i, β
F i
|Β, Σ) ··········3)
- 10 -
His
N
F
Probability density of parameters (β i, β i) follows truncated N
F
N
normal distribution of (β i, β i) with limits β i, = [0, 1] and β 1].
N
F i
= [0,
F
With the help of method of bounds, these limits for (β i, β i) can
be narrowed down as follows:
⎡ ⎛ Ti − 1 + X i ⎞ ⎛T ⎟⎟ Min⎜⎜ i ⎜⎜ Max , 0 ⎢ Xi βN i , = ⎣ ⎠, ⎝ ⎝ Xi
⎞⎤ ,1⎟⎟⎥ ⎠⎦
⎡ ⎛ Ti − X i ⎞ ⎛ T ⎟⎟ Min⎜⎜ i ⎜⎜ Max , 0 ⎢ F βi = ⎣ ⎝ 1− Xi ⎠ , ⎝1− X i
⎞⎤ ,1⎟⎟⎥ ⎠⎦ .
⎛ΒN ⎞ ⎜ F⎟ Β = ⎜ Β ⎟ and The mean and variance matrix of (βNi, βFi) are ⎝ ⎠ ⎛ σ N2 Σ = ⎜⎜ ⎝ σ NF
σ NF ⎞ ⎟ σ F2 ⎟⎠
If his three assumptions are met, estimation produces an efficient and robust estimate.2) The estimation procedure of the King’s solution can be summarized as follows: 1) The first step calculates the bounds of parameters. 2) The second step estimates parameters from truncated bivariate normal distributions within the bounds. If one extends a model with more than two parameters, the estimates from the first estimation are used for marginal values.
For
2) Three assumptions are single model of parameter, the absence of spatial correlation, and no correlation of marginal and parameter. King, Rosen, and Tanner (1999, 67-68) show, however, that the violation of the third assumption does not produce biased estimates if the bounds of parameters are low enough.
- 11 -
example, if one is interested in the proportion of regional voting, he or she first estimates a turnout rate among those who were born in a certain region in a given district.
The estimated turnout rate is used
as marginal values for the proportion of regional voting.
It can be
diagramed below:
King’s Solution: the first step Vote
Not vote
Marginal
Jeolla
βiJ
1-βiJ
Xi
Other Regions
J βi
1-βi '
1-Xi
Marginal
Ti
'
J
1-Ti
Where X is the proportion of voting age population who were born in Jeolla, T is the proportion of voters those who turn out to vote, βJ is a turnout rate among those who were born in Jeolla, βJ' is a turnout rate among those who were born in a region other than Jeolla, and "i" is a district indicator.
King’s Solution: the second step Vote
Not vote
Marginal
βiJ
1-βiJ
xi
1-λiJ'
βiJ'
1-βiJ'
1-xi
1 - Pi
Ti
NCNP
Other parties
Jeolla
λiJ
1-λiJ
Other Regions
λiJ' Pi
Where "x" is the estimated proportion of voters whose hometown is in Jeolla and who turn out to vote, P is the vote share for a candidate whose party label is the National Congress for New Politics (NCNP), and proportion of regional voting.
- 12 -
is the
The first step is to examine turnout rate ( and ) for those who were born in Jeolla (Xi) and for those who were born in areas other than Jeolla (1-Xi).
Once we obtain estimates for βs, these
estimates are used to calculate marginal values for regional voting estimates (xi and 1-xi).
The second step starts with the calculation of
bounds of parameters (λs) and then estimates the parameters across units.
Note, however, that since has a component to be
estimated, it is not a fixed variable.
Therefore, extending tables
produce more uncertain estimates due to added uncertainty originating from the first estimation.3) In next section, we apply the King’s solution to find the level of regional voting in the Korean National Assembly elections of 2000.
III. Application: Disaggregating Regional Voting The 2000 election outcomes in Korea suggest that there are three regions which tend to exhibit partisan regionalism: Jeolla, Gyeongsang, and Chungcheong.
Jeolla region covers Jeollabuk-do and
Jeollanam-do areas, Gyeongsang region refers to Gyeongsangbuk-do and
Gyeongsangnam-do
Chungcheongbuk-do
and
areas,
and
Chungcheong
Chungcheongnam-do
region
areas.
means Regional
dominance by a particular party was specified in terms of the
3) Note that the parameters (λ and (1-λ)) are weighted by the number of voting age population in a given district.
- 13 -
birthplace of particular party leaders.
A leader of the Grand National
Party, Kim Yong Sam was born in Gyeongsang region a leader of the National Congress for New Politics, Kim Dae Jung in Jeolla region and a leader of the United Liberal Democrats, Kim Chong Phil in Chungcheong region.
This link between the birthplace of a party
leader and the dominance of a particular party is well documented in contemporary Korean politics (Kim 1994; Lee 1998; Lee and Brunn 1996). In this section, using the Gary King’s ecological regression we attempt to disaggregate aggregate votes along the level of regional party lines in a district. the candidate level.
The King’s method infers regional voting at The percentage of regional voting at the
candidate level will be averaged out across regional blocs and compared to estimates from survey results.
, ··········6) where P is the vote share of a candidate, λ is the proportion of regional voting, and λ' is the proportion of non-regional voting. indicates Jeolla, G Gyeongsang, and C Chungcheong.
J
"xi " is
, where β is a turnout rate for those were born in a certain - 14 -
region, and X is the proportion of voters for those who were born in a certain region. candidate level.
"i" indicates a district.
The level of analysis is the
Estimation is done by the program called "EzI."4)
In the 16
th
National Assembly elections of Korea (April 13,
2000), 194 incumbent and 449 non-incumbent candidates ran for office (National Election Commission 2000).
We focus our analysis on the
vote share for candidates of the three major parties.5)
The percentage
of those who registered their birthplace in a given district is collected with the help of one of major parties.6)
4) "EzI" are developed by Kenneth Benoit and Gary King (released in 2001). It is available from http://gking.harvard.edu/stats.shtml, visited May 1, 2002. 5) We focus on only these three parties because they comprised over 96% of the single th member district seats in the 16 National Assembly Election. 6) Because of a contributor’s request, the source of data has not been released. Data on districts in Chungcheong-do are not available so that the number of districts in this study are 106.
- 15 -
Gyeongsangbuk-do
54.88%
71.17%
1.83%
Gyeongsangnam-do
55.35%
68.99%
2.18%
Average
52.21%
61.74%
1.04%
Source: complied by the authors. Note: Average λ is calculated by averaging out district level regional votings (λi) along with regional blocs. GNP stands for the Grand National Party, NCNP for the National Congress for New Politics and ULD for the United Liberal Democrats.
Table 5 shows that on average the percentage of regional voting (61.74%) is the highest among those who were born in Jeolla and it may be benefit to candidates of the National Congress for New Politics.
The percentage of regional voting for those who were born
in Gyeongsang ranked the second. Not surprisingly, the level of regional
voting
Chungcheong.
are
the
lowest
for
those
who
were
born
in
It resulted in less concentration of votes on candidates
running under the United Liberal Democrats (ULD).
In the 15th
National Assembly Elections of 1996, the ULD won 25 of the 28 seats.
But, by the elections of 2000, the ULD was only able to win
11 of the 24 seats in Chungcheong. It is also the case when one examines the percentage of regional voting within a regional bloc.
For example, 79.46 percent of
those who were born in Jeolla cast a regional voting if they reside in districts within Gwangju.
More than 70 percent of voters also voted
for candidates of the NCNP in districts within Seoul, Daegu, Incheon, Gyeonggi-do, Gangwon-do and Gyeongsangbuk-do if they were born in Jeolla. - 16 -
Those who were born in Gyeongsang tend to cast a slightly less regional voting, yet quite a significant level.
On average, more
than half of voters who were born in Gyeongsang cast a regional voting in the 16
th
National Assembly elections.
It is especially the
case when one examines in districts within Seoul, Busan, Daegu, Incheon, Gyeonggi-do, and Gangwon-do more than 60 percent of voters voted for candidates of the GNP if they were born in Gyeongsang.
It is also the case, though to less extent, if he or she
resides in Gwangju, Ulsan, Jeollabuk-do, Gyeongsangbuk-do, and Gyeongsangnam-do. Not surprisingly, those who were born in Chungcheong cast the least extent of regional voting.
Only handful of voters who were
born in Chungcheong cast regional voting on average.
However, it
should be noted that data for districts within Chungcheong area were not available and thus estimates may be underestimated. In sum, regional voting estimates from the King’s method provide supportive evidence for the argument that regional voting is a nationwide problem (Kim 1994; Lee 1998; Lee and Brunn 1996). Not only is the level of regional voting significant in districts within the the known regional voting blocs (Jeolla and Gyeongsang), but also it appears to be substantial in districts outside these regional blocs. However, there is a significant fluctuation at the level of regional voting across regions.
For example, the percentage of regional voting
for those who were born in Jeolla varies from 61.04% in Jeollabuk-do
- 17 -
to 79.46% in Gwangju, while it varies from 43.85% in Jeollanam-do to 67.29% in Busan if voters were born in Gyeongsang.
We can
conclude that the level of regional voting is not constant, but varies across regional blocs. The results from ecological inference can be verified by survey results.
The procedure is the same above except that figures are
obtained from individual level data.
The first step is to identify
voters who were born in a certain region, and then calculate how many these voters turn out to vote for the pertinent party.
The
Korean Social Science Data Center conducted a survey of the 16
th
National Assembly Elections in April 13, 2000 (Korean Social Science Data Center 2000).
Multistage quota sampling technique was used to
collect a random sample by regional blocs.
1,100 interviews were
completed with a rejection rate of 5 percent.
Fortunately, the survey
includes a question on the hometown of and the vote choice of a respondent. These two questions were used to construct a regional voting; for example, if he or she was born in Jeolla area and voted for the NCNP, it is coded as a regional voting for the NCNP. In Seoul, fifty five respondents were born in Jeolla.
Thirty four out of
the fifty five voted for the NCNP. Therefore, the percentage of regional voting for the NCNP is 61.82 percent for Seoul.
Table 6
shows the percentage of regional voting in regional blocs.
Source: The Korean Social Science Data Center (2000). Figures in parenthesis are the number of respondents.
Table 6 shows that regional voting is the most evident for those who were born in Jeolla, followed by those who were born in Gyeongsang and in Chungcheong.
The level of regional voting
isslightly low compared to that from the King’s solution.
On
average, 46.02 percent voted for the NCNP if they were born in Jeolla, while it is 30.09 percent if voters were born in Gyeongsang. A significant fluctuation appeared across regional blocs.
In
particular, if cell frequency is too small, the variation of regional voting is beyond the acceptable range.
For example, in Gangwon-do
where cell frequency is one, the percentage of regional voting is 100
- 19 -
percent out of those who were born in Jeolla, while it is zero percent in Chungcheongbuk-do where cell frequency is also one. If one may focus on the level of regional voting where the number of respondents are sufficient enough, we can find similarity in the level of regional voting between estimates from the King’s method and estimates from survey results.
For example, according to survey
results, the percentage of regional voting in Seoul is 61.82 percent for the NCNP while the comparable figure is 75.27 percent by the ecological regression.
It is 60 percent in Jeollabuk-do by survey
results, while it is 61.04 percent by the ecological regression.
We
can safely conclude that estimates from the King’s solution are closely matching with those from survey results.
IV. Conclusion Applying aggregate level findings for the individual level relationships generates biased estimates, known as the ecological fallacy problem.
Social scientists often encounters with a difficulty to
conduct a research at the individual level with aggregate data.
In
particular, if a research question is related to the past event when survey data are not available, it is almost impossible to pursue a research.
Further, if there is a politically correct answer on survey
questions, it is hard to obtain unbiased estimates.
- 20 -
In this paper, we introduced a way to infer the individual level
relationships
from
aggregate
data.
We
began
with
the
aggregation bias which leads to the ecological fallacy problem.
A
series of efforts have been suggested to solve the aggregation bias. The method by Gary King, which combines method of bounds and statistical association, is emphasized. The King’s solution is well known for a method to produce a robust and efficient estimate, even though there is a severe aggregation bias.
The solution applied to
infer regional voting at the candidate level.
The percentage of
regional voting was averaged out across regional blocs.
The average
percentages, then, were compared to estimates from survey results. We found that estimates from the King’s method are closely matching with estimates from survey results if cell frequency in survey results is large enough.
We also found that the former is more reliable than
the latter if cell frequency in survey results is small. Ecological regressions offer a new venue to examine previously impossible questions.
For example, we can examine who voted for
Chung-hee Park in the 1963 Korean presidential election.
We can
further question why they voted for him; for example, was the generation
effect
related
to
the
outcome
of
the
1963
Korean
presidential election? Furthermore, we can use ecological regressions to examine whether or not a voter casts a vote for a party candidate in a congressional election, while the same voter casts a different party candidate for a presidential election (Burden and Kimball 1998). We
- 21 -
should note, however, that ecological regressions also show some limitation.
If tables are extended more than 2 by 2, the uncertainty
of estimates gets thicker.
Scholars of ecological regression attempt to
reduce this uncertainty (King, Rosen, Tanner 1999; Rosen, Jiang, King forthcoming).
- 22 -
< REFERENCE > Achen, Christopher H. and W. Phillips Shively (1995). Cross-Level Inference. Chicago: University of Chicago press. Benoit, Kenneth and Gary King (1996). "A Preview of EI and EzI: Program for Ecological Inference." Social Science Computer Review 14:433-438. Burden, Barry C. and David C. Kimball (1998). "A New Approach to the Study of Ticket Splitting." American Political Science Review 92: 533-544. Goodman, Leo (1959). "Some Alternatives to Ecological Correlation." American Journal of Sociology 64: 610-625. Kim, Jae-On and B.C. Koh (1980). "The Dynamics of Electoral Politics:
Social
Development,
Political
Participation,
and
Manipulation of Electoral Laws." in Political Participation in Korea: Democracy, Mobilization, and Stability edited by Chong Lim Kim. Santa Barbara: CLIO books. 59-84. King, Gary, Ori Rosen, and Martin A. Tanner (1999). "Binomial-Beta Hierarchical Models for Ecological Inference."
Sociological
Methods & Research 28:61-90. King, Gary (1997). A Solution to the Ecological Inference Problem: Reconstructing
Individual
Behavior
from
Princeton, NJ: Princeton University Press.
- 23 -
Aggregate
Data.
Korean Social Science Data Center (2000). A Survey on Voters Attitudes toward the 16th General Election.
Seoul: Korean
Social Science Data Center. Lee, Dong Ok and Stanley D. Brunn (1996). "Politics and regions in Korea: an analysis of the recent presidential election." Political Geography 15: 99-119. Lee, Nam Young (1998). "Regionalism and Voting Behavior in South Korea." Korea Observer 29: 611-633. National Election Commission. http://www.nec.go.kr (2000. 6. 1). Palmquist, Bradley Lowell (1993). Ecological Inference, Aggregate Data Analysis of U. S. Elections, and the Socialist Party of America. Ph. D. dissertation at the University of California, Berkley. Robinson, W. S (1950). "Ecological Correlations and the Behavior of Individuals." American Sociological Review 15: 351-357. Rosen, Ori, Wenxin Jiang, Gary King, and Martin A. Tanner (Forthcoming).
"Bayesian
and
Frequentist
Inference
for
Ecological Inference: the R X C Case." Statistica Neerlandica. Shin, Myungsoon, Youngjae, Jin, Donald A. Gross, and Kihong Eom (2005). "Money Matters in Party-Centered Politics: Campaign Spending in Korean Congressional Elections." Electoral Studies 24: 85-101. Voss, D. Stephen (2000). Familiarity Doesn’t Breed Contempt: The Political Geography of Racial Polarization. Ph. D. dissertation at Harvard University.