districts using the MABLE program from the Missouri Census Data Center (MABLE, 2008). ... B. Validation and Sensitivity Analysis. Clearly the heuristic employed here has a somewhat .... and maximum bounds supports this proposition.
TECHNICAL APPENDIX
Estimating Local Redistribution Through Property-Tax-Funded Public School Systems
I.
ITERATIVE RANDOM FITTING (IRF) Our basic approach to allocating individual households identified by PUMAs to block
groups and then school districts has two steps. First we allocate households in a PUMA to a specific block group. The second and much easier step aggregates block groups up to school districts using the MABLE program from the Missouri Census Data Center (MABLE, 2008). The only difficulty in this second step arises in the situation where a block group straddles two or more school districts. MABLE allocates shares of the block group to each district based on the number of housing units at the block level in each district.
A.
Basic Technique
The 5% PUMS is rich in detail, both with respect to household characteristics and dwelling characteristics. The task is to use this data to allocate each sample household belonging to a specific suburban PUMA to an appropriate block group falling within that PUMA. For each block group the Census provides considerable data on a range of statistics and distributions in the SF3. We use these block group statistics and distributions to construct a multivariate loss function which is to be minimized.
1
The standard approach to this minimization problem uses Iterated Random Fitting (IRF).1 We start from a random allocation. Had we stopped there, in effect, we would be assuming that every block group has the same distribution of household characteristics as its PUMA. For each separate PUMA our heuristic adjusts this random household allocation with the purpose of
x *ij
minimizing an error loss function: i
x ij , where x *ij is the value of one of 54
j
characteristics i for jth block group and x ij is the estimated value from the allocation. The 54 characteristics are: total households, total rental households, number of public school students, number of black headed households, number of Hispanic headed households, 9 household head age categories, 16 household income categories, and 24 house value categories. Within each PUMA, the initial random assignment is adjusted by randomly reassigning a single household to a new block group. If this move improves the loss function, the new assignment is maintained. Otherwise the search reverts to the previous allocation. This process is repeated 500,000 times to complete a local search. To avoid getting stuck with a local minimum, we do 20 repetitions each seeded by a new overall random assignment. Whichever local search has the smallest loss function is then modified by a series of random swaps between block groups (from 502 to 2002 for each PUMA). Again, any given swap is only maintained if it lowers the loss function. Finally, each household is swapped with whichever household most substantially reduces the loss function. This last step is deterministic in the sense that the order in which households are selected will influence the expected results. In general these final scores were about an eighth the size obtained from a single local search.
1
Our programs are available on request. 2
B.
Validation and Sensitivity Analysis
Clearly the heuristic employed here has a somewhat arbitrary character, and, because we minimize the error-loss function across multiple dimensions, only one of which is number of households, we do not expect the synthesized household counts to perfectly match block group aggregate counts perfectly. Similarly, because our PUMA observations come from a 5% sample, the weighted number of microdata households will not equal the number of household reported in the Census. However, accuracy along this dimension is clearly a desirable trait. Figure A1 depicts the frequency distribution of the absolute percentage error of total household counts by block group. The synthesized household counts are within 10% of the Census counts for 63% of the block groups.2
2
While we have no absolute criterion to apply to such statistics, these results suggest that our
subsequent calculations of redistribution at the level of school districts will be only slightly affected by household misallocations. School districts are considerably larger and more heterogeneous than block groups. 3
Figure A1 Frequency Distribution of Absolute Error in Household Counts
Number of Block Groups 1000
800
600
400
200
0 0
50
100
Absolute % Error: Household Count
Figure A1: Household counts are for owner-occupied households by block group. The true values come from SF3 Census data while estimated values are from the synthetic population generation procedures described in the text.
As additional validation, Figure A2 presents a series of graphs showing synthetic school district characteristics plotted against actual values. Included there are the number of children enrolled in public schools, median home values, median household income, percent of home owners over age 65, percent of heads who are black and percent of heads who are Hispanic. As can be seen there, for these characteristics the IRF technique closely fits the actual data.3
3
The only apparent outlier occurs in the median house value graph in Figure 3. However, the
actual error here is relatively small. The median house value for that district falls in the $625,000 bin, while the synthetic value falls in the $425,000 bin which is the next lowest bin.
4
Figure A2 Synthetic Estimates vs. Census Values for Key Variables by School District Median Home Value by Quasi-School District (,000s)
400
Median Allocated Value
3
3 12
200
20
30
# of Allocated Children
40
600
50
# of Children Enrolled in Quasi-School Districts (,000s)
16
14 3
10
16 7
0
0
2
0 0
20 30 # of Census Children
40
400 Median Census Value
600
Note: marker labels indicate multiple overlapping points
Median Household Income by Quasi-School District (,000s)
% Quasi-School District Households with Head 65+
2 2
.4 .3 .2 .1
50
2 3 3 2 2
% of Allocated Households
100
.5
150
200
50
0
50 100 Median Census Income
0
0
Median Allocated Income
10
150
0
.1
Note: marker labels indicate multiple overlapping points
.4
.5
.4
.5
% Quasi-School District Households with Hispanic Head
0
.1
.2
.3
% of Allocated Households
.6 .4 .2 0
% of Allocated Households
.8
% Quasi-School District Households with Black Head
.2 .3 % of Census Households
0
.2
.4 % of Census Households
.6
.8
0
.1
.2 .3 % of Census Households
.4
.5
Figure A2: These graphs show statistics from the synthetic population estimates plotted against actual Census values for the eighty-five suburban Chicago pseudo-districts. In each case the scatter plots fall quite close to the 45o line.
5
We also carry out a sensitivity analysis that provides reasonable bounds on the key redistribution payment variable. In a world where households making redistribution payments sought out the block group with the lowest possible tax rates in their PUMA, their redistribution payments would be much smaller than actual. Alternatively, should households making redistribution payments seek out the highest possible tax rates in their PUMA, redistribution payments would be much higher than actual. We can relatively easily perform this experiment for each PUMA. The upper and lower bounds average +/- 20% – 25% of the estimated values (see Table A1). From the perspective of this paper it is the lower bound which is of greatest interest. A 20% error in this direction would imply a 20% reduction in redistribution payments. While significant, such reductions would still leave overall redistribution payments quite large. Moreover, we have no reason to expect our estimates to be biased. The balance of the minimum and maximum bounds supports this proposition.
6
Table A1 Home Owner Redistribution Payments by PUMA with Bounds Min % Difference PUMA Minimum Best Maximum Max % Difference Difference 3001 3002 3003 3004 3005 3006 3101 3102 3103 3104 3201 3202 3203 3204 3205 3206 3301 3302 3303 3304 3305 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414
$42.3 $40.9 $39.5 $58.7 $23.0 $53.4 $19.2 $18.1 $51.3 $56.9 $79.3 $64.9 $48.5 $71.7 $44.4 $59.7 $58.1 $24.1 $71.7 $50.8 $41.8 $45.3 $62.5 $86.1 $95.0 $106.4 $20.5 $48.1 $47.4 $39.5 $64.3 $45.1 $26.2 $33.8 $48.7
$46.3 $50.4 $48.5 $61.0 $24.3 $61.6 $42.3 $25.8 $57.9 $62.2 $92.1 $67.0 $53.3 $88.9 $70.8 $79.0 $90.1 $24.8 $89.9 $70.0 $71.2 $65.8 $72.6 $112.5 $122.7 $163.6 $34.9 $66.6 $55.4 $60.2 $81.8 $51.9 $38.2 $44.3 $60.3
$51.3 $57.6 $53.8 $66.1 $25.7 $72.3 $71.5 $30.2 $62.0 $64.3 $105.2 $73.0 $61.7 $112.4 $119.1 $103.3 $127.6 $28.9 $115.0 $87.3 $91.0 $73.9 $77.4 $163.3 $136.8 $215.2 $44.8 $84.4 $61.8 $80.0 $112.1 $75.1 $57.6 $53.9 $91.2
Total:
$1,787.2
$2,308.2
$2,906.8
Ave.:
-8.6% -19.0% -18.6% -3.8% -5.3% -13.3% -54.7% -29.7% -11.4% -8.6% -13.9% -3.2% -9.0% -19.3% -37.3% -24.4% -35.5% -2.5% -20.2% -27.5% -41.3% -31.1% -13.9% -23.5% -22.6% -35.0% -41.2% -27.7% -14.4% -34.4% -21.4% -13.0% -31.4% -23.7% -19.2%
10.9% 14.2% 10.8% 8.5% 5.9% 17.5% 68.8% 17.0% 7.0% 3.2% 14.2% 9.0% 15.7% 26.5% 68.2% 30.7% 41.7% 16.6% 28.0% 24.7% 27.7% 12.3% 6.7% 45.1% 11.4% 31.6% 28.3% 26.8% 11.6% 32.9% 37.0% 44.8% 50.9% 21.7% 51.3%
-21.7%
25.1%
Table A1: Columns Minimum and Maximum denote the lower and upper bounds on redistribution payments for each respective PUMA. Column Best provides estimates of actual redistribution payments made using the heuristic allocation described in the text. Min % Difference and Max % Difference provide the percentage difference between the allocated and bounded redistribution payment levels. All monetary amounts are in year 2000 millions of dollars. 7
C.
Limitations
The previous sections help confirm the accuracy of our IRF routine. However, some potential limitations must be recognized. Most importantly, it is important to note that all synthetic population procedures depend upon often arbitrary goodness-of-fit statistics (Williamson, Birkin, and Rees, 1998; Voas and Williamson, 2001). Our IRF procedure employs the total absolute error criterion which, although easy to calculate and interpret, does have its weaknesses. In particular, it has been recognized that using the total absolute error as a selection criterion may yield relatively inaccurate estimates for variables whose small-area aggregate counts are not representative of the sampled microdata when compared to other variables that are simultaneously included in the constraint. Because a random assignment is more likely to „fit‟ the more representative variables, the less representative variables are more likely to be “outvoted” when it comes to assessing an assignment. This issue may be particularly common when synthesizing populations for large college towns or military bases. Thus, while the number and type of variables simultaneously included as aggregate counts in the error loss function are only limited by the number and type of variables available in the microdata, including too many variables may limit the accuracy for certain, particularly less representative, variables. Additionally, the total absolute error does not account for relative error which may be the preferred measure of goodness-of-fit under certain circumstances. More sophisticated goodness-of-fit measures and fitting procedures not incorporated in our technique have been recently proposed by Voas and Williamson (2000, 2001) and Huang and Williamson (2001). Another potential limitation to our IRF procedure includes the observation that combinatorial optimization techniques can provide rather weak small-area estimates of variables
8
not included among the small-area aggregate counts. However, as noted by Voas and Williamson (2000), this limitation is of little concern for researchers who can choose constraining variables per their own requirements. Because we are allowed to include number of children enrolled in public schools, home value, and other appropriate variables as constraints, we are not overly concerned with this particular limitation. Lastly, there are cases where synthetic reconstruction seems to outperform combinatorial optimization. For example, Ryan, Maoh, and Kanaroglou (2009) suggest synthetic reconstruction may outperform combinatorial optimization in instances when tabular detail is low and microdata sample sizes are large. Additionally, Williamson (2007) speculates that synthetic reconstruction techniques may outperform combinatorial optimization in instances when combinations of household- and individual-level aggregate counts are included in the error loss function. To our knowledge, no research to date has empirically tested Williamson‟s line of reasoning.
II.
MAPS
9
Figure A3 Redistribution Payments’ Share of Property Taxes by Pseudo-School District
City of Chicago
Figure A3: Redistribution payments as a share of all owner education property taxes for suburban Chicago pseudo-districts range from about 41% to 80%.
10
Figure A4 Percentage of Households with No Children Enrolled in the Public Schools by PseudoSchool District
City of Chicago
Figure A4: The shares of households in suburban Chicago pseudo-districts with no children in those districts vary from 57% to 82% and tend to be lower near the metropolitan periphery.
11
REFERENCES Huang, Zengyi and Paul Willianson, 2001. “A Comparison of Synthetic Reconstruction and Combinatorial Optimisation Approaches to the Creation of Small-Area Microdata.” Working Paper 2001/2, Population Microdata Unit, Department of Geography, University of Liverpool, UK, http://pcwww.liv.ac.uk/~william/microdata/workingpapers/hw_wp_2001_2.pdf. MABLE, 2008. Missouri Census Data Center, http://mcdc2.missouri.edu/websas/geocorr2k.html. Ryan, Justin, Hanna Maoh, and Pavlos Kanaroglou, 2009. “Population synthesis: Comparing the major techniques using a small, complete population of firm.” Geographical Analysis 41 (2) 127 - 148. Voas, David and Paul Williamson, 2000. "An Evaluation of the Combinatorial Optimization Approach to the Creation of Synthetic Microdata," International Journal of Population Geography 6 (6) 349 - 366. Voas and, David and Paul Williamson, 2001. “Evaluating Goodness-of-Fit Measures for Synthetic Microdata.” Geographical and Environmental Planning 5 (2) 177 – 200. Williamson, Paul, 2007. “Confidentiality and Anonymised Survey Records: The UK Experience.” In Gupta, Anil, and Ann Harding (eds.), Modelling Our Future: Population Ageing, Health and Aged Care, 387-413, International Symposia in Economic Theory and Econometrics, Volume 16. Elsevier B.V., Amsterdam, Netherlands.
Williamson Paul, Mark Birkin, Phil H. Rees, 1998. "The Estimation of Population Microdata by Using Data from Small Area Statistics and Samples of Anonymised Records." Environment and Planning A 30 (5) 785 – 816.
12