na's capital city of Accra. Data. Remotely Sensed Imagery: ⢠A high spatial resolution (2.4m) Quickbird Multispectral (Blue, green, red, near-infrared) image ...
Predictive Modeling of Population Demographic Information Using Multiple Imagery-Derived Spatial Features in Accra, Ghana Andrew Copenhaver (Advisor: Dr. Ryan Engstrom), Department of Geography
Background and Objective According to a 2014 United Nations report, over 50 percent of the global population is now living in urban areas, and population forecasts indicate that urban populations will continue to grow, with most of that growth occurring in the global south. In order to allow for sustainable growth that considers the health, economic wellbeing and social equity of a rapidly growing population, as well as the protection of the natural environment, timely, high-quality demographic data is imperative. Frequently, demographic data is collected using a ten-year census; however, the high costs associated with a census have often precluded their application in the global south, where the potential negative effects of rapid urbanization are most pronounced. Additionally, the large gaps in time between censuses can make the effective use of such data challenging in places undergoing extremely rapid growth. This project examines the efficacy of combining machine learning regression models and satellite imagery processed with a variety of spatial feature extraction algorithms and spectral indices such as NDVI to predict population density and housing quality for Ghana’s capital city of Accra.
Methodology
Demographic Variable
Spatial and Spectral Information Extraction Features were calculated at various block sizes (size at which extraction outputs are aggregated to) and scale sizes (size of the pixel window from which contextual features are derived) using MapPy, an open-source Python library
Spatial Feature, Block Size (BLK), Scale (SC)
Mean
NDVI, BLK4, SC16 Band 4 Mean, BLK8, SC8 LBPM Histogram Variance, BLK8, SC16
Mean Mean
Spatial feature extraction algorithms utilized by the study include: Fourier Transform, Histogram of Oriented Gradients (HoG), Linear Support Regions (LSR), Local Binary Pattern Moments (LBPM), and a texture-derived built-up presence index (PanTex)
A total of 120 spectral / spatial feature images extracted from the original image mosaic
Map 1: Accra and the neighborhoods included in the study
Pixel Aggregation
Housing Quality
Resulting Image Product
Block Pixel Figure 3: A visualization of spatial feature extraction via the PanTex index
Zonal Statistics
Using a Python script and the neighborhood shapefile, the mean, standard deviation and sum were calculated for each neighborhood. This process tripled the total number of variables (120 image processing outputs x 3 zonal statistics = 360 variables)
Based on the large number of variables (n=360) relative to the number of samples (n=95), various methods for variable extraction and variable selection were tested in order to combat the “curse of dimensionality”, overfitting and multicollinearity
Principal Component Analysis (PCA)
Simple feature extraction method Creates new variables by orthogonally transforming correlated data into n linearly uncorrelated principal components
Figure 1: Sabon and Zongo, a densely populated neighborhood in Accra
Map 2: Site Location
Figure 4: A visualization of PCA
Data Remotely Sensed Imagery:
A high spatial resolution (2.4m) Quickbird Multispectral (Blue, green, red, near-infrared) image mosaic captured January 13, 2010 (eastern portion) and February 10, 2010 (western portion) Mosaic is radiometrically aligned to a 2002 Quickbird image using a pseudoinvariant feature method Imagery is georeferenced to Universal Transverse Mercator Projection, Zone 30N Coverage of approximately 87% of the Accra Metropolitan Area (AMA)
Census Data 2010 census data complied at the scale of the enumeration area (similar to a US census tract) provided by the Ghana Statistical Service (GSS) Enumeration area data digitized into a Shapefile and aggregated to neighborhoods While 100 neighborhoods comprise the AMA, only the 95 covered by the imagery were included
Genetic Algorithm Feature Selection (GAFS)
Goal is to mimic natural selection Process includes subsetting predictors, evaluating “fitness“ Root Mean Square Error (RMSE) “reproducing” fit features and outputting the most fit combination of variables
Mean
Band 3 Mean, BLK4, SC16
Mean
LSR Line Variance BLK4, SC8
SD
LBPM Histogram Variance, BLK8, SC8
Mean
LBPM Histogram Variance, BLK8, SC32
Selects all variables associated with the response variable
Figure 5: The general genetic algorithm feature selection process
Train on k-1 partitions Test
All Data
Partition Data Figure 6: the k-fold cross-validation process
Test 1
...
Test 2 Test k
Average R2 and (RMSE) across kfolds
LBPM Histogram Skew, BLK4, SC8
Mean
LBPM Histogram Skew, BLK8, SC8
Mean
LSR Line Length Sum, BLK4, SC8
Mean
Mean
LBPM Histogram Variance, BLK4, SC32
LBPM Histogram Variance, BLK8, SC32
SD
LSR Line Length Mean, BLK4, SC16 NDVI, BLK8, SC8 Band 4 Mean, BLK4, SC32
Population Density
SD
SD
SD
SD Mean Mean
For housing quality, the genetic algorithm selected 128 variables , including at least one output from every spatial feature extraction method and spectral index / filter 218 variables were selected by GAFS for population density, with at least one output from each image processing method selected
Dimension Reduction Technique
Regressor
RMSE
Coefficient of Determination 2 (R )
PCA
Random Forests
6,839.559
0.784
PCA
Support Vector Machine
7,727.560
0.740
VSURF
Random Forests
5,183.460
0.864
GAFS
Support Vector Machine
7,332.576
0.761
PCA
Random Forests
0.158
0.553
PCA
Support Vector Machine
0.153
0.577
VSURF
Random Forests
0.133
0.680
GAFS
Support Vector Machine
0.161
0.536
15,161.7
1.21
Table 3: The explanatory and predictive performance of the models
Conclusions and Future Work
Model Evaluation Due to the small number of observations, traditional holdout validation was avoided in order to train the regressors on all data points. Instead, an internal validation technique known as repeated k-fold cross-validation (k=10, repeats=50) was utilized
Mean
SD
Density
LBPM Histogram Kurtosis, BLK4, SC8
LBPM Histogram Kurtosis, BLK4, SC8
Housing Quality
SVM and RF regressions were created for population density and housing quality using all PCA extracted variables (30). RF models were created using VSURF selected variables and SVM models were created using GAFS selected variables
Mean
Mean
Population
Removes redundancy in variables
Using the variables extracted or selected by the dimensionality reduction phase of the project as predictors, eight regression models were generated to predict population density and housing quality
LBPM Histogram Kurtosis, BLK8, SC8
Mean
Mean Value of Demographic Demographic Variable Variable
Calculates Random Forest permutation importance scores and removes unimportant variables
Two well known machine learning regression methods were used in the study: Support Vector Machine (SVM) regression and Random Forests (RF) regression. The regression models were created using the “randomForest” and “e1071” R packages
Zonal Statistic
Fourier Transform Radial Profile Mean, BLK4, SC32 HoG Histogram Skew, BLK4, SC8 LBPM Histogram Kurtosis, BLK8, SC8
Regression Models
Spatial Feature, Block Size (BLK), Scale (SC)
Variables Selected by GAFS
Variable Selection Using Random Forests (VSURF)
Mean
LBPM Histogram Variance, BLK4, SC16
LBPM Histogram Variance, BLK8, SC32
Dimensionality Reduction / Automated Variable Selection
Zonal Demographic Variable Statistic
NDVI, BLK8, SC8
Spectral information calculated includes band means, global means and the Normalized Difference Vegetation Index (NDVI)
Scale
Study Area
Figure 2: The general project workflow
Results
Table 2: VSURF selected variables for population density
Table 1: VSURF selected variables for housing quality
Random Forests regressions made with variables selected by Variable Selection Using Random Forests (VSURF) appear to be most suitable for predicting and explaining population density and housing quality Generally, models built using the PCA transformed variables were the poorest performers, indicating that the information removed when reducing dimensionality in this manner may be useful Future work will include processing imagery with additional feature extraction methods including lacunaritybased texture, Gabor filters, Structural Feature Sets (SFS), and Speeded Up Robust Features (SURF)
Bibliography
Genuer, R., Poggi, J. M., & Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31(14), 2225-2236.
Pesaresi, M., Gerhardinger, A., & Kayitakire, F. (2008). A robust built-up area presence index by anisotropic rotation-invariant textural measure. Selected Topics in Applied Earth Observations and Remote Sensing, IEEE Journal of,1(3), 180-192.
Vandewater, L., Brusic, V., Wilson, W., Macaulay, L., & Zhang, P. (2015). An adaptive genetic algorithm for selection of blood-based biomarkers for prediction of Alzheimer's disease progression. BMC bioinformatics, 16(Suppl 18), S1.
United Nations. (2014). World Urbanization Prospects 2014: Highlights. United Nations Publications.
Acknowledgements This research was funded by the NASA grant entitled “The Urban Transition in Ghana and Its Relation to Land Cover and Land Use Change Through Analysis of Multi-scale and Multitemporal Satellite Image Data” Grant Number#: G00009708, NASA Award Number: NNX12AM87G