Background and Objective Data Methodology Results ...

13 downloads 0 Views 6MB Size Report
na's capital city of Accra. Data. Remotely Sensed Imagery: ♢ A high spatial resolution (2.4m) Quickbird Multispectral (Blue, green, red, near-infrared) image ...
Predictive Modeling of Population Demographic Information Using Multiple Imagery-Derived Spatial Features in Accra, Ghana Andrew Copenhaver (Advisor: Dr. Ryan Engstrom), Department of Geography

Background and Objective According to a 2014 United Nations report, over 50 percent of the global population is now living in urban areas, and population forecasts indicate that urban populations will continue to grow, with most of that growth occurring in the global south. In order to allow for sustainable growth that considers the health, economic wellbeing and social equity of a rapidly growing population, as well as the protection of the natural environment, timely, high-quality demographic data is imperative. Frequently, demographic data is collected using a ten-year census; however, the high costs associated with a census have often precluded their application in the global south, where the potential negative effects of rapid urbanization are most pronounced. Additionally, the large gaps in time between censuses can make the effective use of such data challenging in places undergoing extremely rapid growth. This project examines the efficacy of combining machine learning regression models and satellite imagery processed with a variety of spatial feature extraction algorithms and spectral indices such as NDVI to predict population density and housing quality for Ghana’s capital city of Accra.

Methodology

Demographic Variable

Spatial and Spectral Information Extraction Features were calculated at various block sizes (size at which extraction outputs are aggregated to) and scale sizes (size of the pixel window from which contextual features are derived) using MapPy, an open-source Python library



Spatial Feature, Block Size (BLK), Scale (SC)

Mean

NDVI, BLK4, SC16 Band 4 Mean, BLK8, SC8 LBPM Histogram Variance, BLK8, SC16

Mean Mean

Spatial feature extraction algorithms utilized by the study include: Fourier Transform, Histogram of Oriented Gradients (HoG), Linear Support Regions (LSR), Local Binary Pattern Moments (LBPM), and a texture-derived built-up presence index (PanTex)



A total of 120 spectral / spatial feature images extracted from the original image mosaic



Map 1: Accra and the neighborhoods included in the study

Pixel Aggregation

Housing Quality

Resulting Image Product

Block Pixel Figure 3: A visualization of spatial feature extraction via the PanTex index

Zonal Statistics

Using a Python script and the neighborhood shapefile, the mean, standard deviation and sum were calculated for each neighborhood. This process tripled the total number of variables (120 image processing outputs x 3 zonal statistics = 360 variables)





Based on the large number of variables (n=360) relative to the number of samples (n=95), various methods for variable extraction and variable selection were tested in order to combat the “curse of dimensionality”, overfitting and multicollinearity

Principal Component Analysis (PCA) 



Simple feature extraction method Creates new variables by orthogonally transforming correlated data into n linearly uncorrelated principal components

Figure 1: Sabon and Zongo, a densely populated neighborhood in Accra

Map 2: Site Location

Figure 4: A visualization of PCA

Data Remotely Sensed Imagery: 







A high spatial resolution (2.4m) Quickbird Multispectral (Blue, green, red, near-infrared) image mosaic captured January 13, 2010 (eastern portion) and February 10, 2010 (western portion) Mosaic is radiometrically aligned to a 2002 Quickbird image using a pseudoinvariant feature method Imagery is georeferenced to Universal Transverse Mercator Projection, Zone 30N Coverage of approximately 87% of the Accra Metropolitan Area (AMA)

Census Data  2010 census data complied at the scale of the enumeration area (similar to a US census tract) provided by the Ghana Statistical Service (GSS)  Enumeration area data digitized into a Shapefile and aggregated to neighborhoods  While 100 neighborhoods comprise the AMA, only the 95 covered by the imagery were included

Genetic Algorithm Feature Selection (GAFS) 



Goal is to mimic natural selection Process includes subsetting predictors, evaluating “fitness“ Root Mean Square Error (RMSE) “reproducing” fit features and outputting the most fit combination of variables





Mean

Band 3 Mean, BLK4, SC16

Mean

LSR Line Variance BLK4, SC8

SD

LBPM Histogram Variance, BLK8, SC8

Mean

LBPM Histogram Variance, BLK8, SC32





Selects all variables associated with the response variable



Figure 5: The general genetic algorithm feature selection process

Train on k-1 partitions Test

All Data

Partition Data Figure 6: the k-fold cross-validation process

Test 1

...

Test 2 Test k

Average R2 and (RMSE) across kfolds

LBPM Histogram Skew, BLK4, SC8

Mean

LBPM Histogram Skew, BLK8, SC8

Mean

LSR Line Length Sum, BLK4, SC8

Mean

Mean

LBPM Histogram Variance, BLK4, SC32

LBPM Histogram Variance, BLK8, SC32

SD

LSR Line Length Mean, BLK4, SC16 NDVI, BLK8, SC8 Band 4 Mean, BLK4, SC32

Population Density

SD

SD

SD

SD Mean Mean

For housing quality, the genetic algorithm selected 128 variables , including at least one output from every spatial feature extraction method and spectral index / filter 218 variables were selected by GAFS for population density, with at least one output from each image processing method selected

Dimension Reduction Technique

Regressor

RMSE

Coefficient of Determination 2 (R )

PCA

Random Forests

6,839.559

0.784

PCA

Support Vector Machine

7,727.560

0.740

VSURF

Random Forests

5,183.460

0.864

GAFS

Support Vector Machine

7,332.576

0.761

PCA

Random Forests

0.158

0.553

PCA

Support Vector Machine

0.153

0.577

VSURF

Random Forests

0.133

0.680

GAFS

Support Vector Machine

0.161

0.536

15,161.7

1.21

Table 3: The explanatory and predictive performance of the models

Conclusions and Future Work 

Model Evaluation Due to the small number of observations, traditional holdout validation was avoided in order to train the regressors on all data points. Instead, an internal validation technique known as repeated k-fold cross-validation (k=10, repeats=50) was utilized

Mean

SD

Density





LBPM Histogram Kurtosis, BLK4, SC8

LBPM Histogram Kurtosis, BLK4, SC8

Housing Quality

SVM and RF regressions were created for population density and housing quality using all PCA extracted variables (30). RF models were created using VSURF selected variables and SVM models were created using GAFS selected variables

Mean

Mean

Population

Removes redundancy in variables

Using the variables extracted or selected by the dimensionality reduction phase of the project as predictors, eight regression models were generated to predict population density and housing quality

LBPM Histogram Kurtosis, BLK8, SC8

Mean

Mean Value of Demographic Demographic Variable Variable

Calculates Random Forest permutation importance scores and removes unimportant variables

Two well known machine learning regression methods were used in the study: Support Vector Machine (SVM) regression and Random Forests (RF) regression. The regression models were created using the “randomForest” and “e1071” R packages

Zonal Statistic

Fourier Transform Radial Profile Mean, BLK4, SC32 HoG Histogram Skew, BLK4, SC8 LBPM Histogram Kurtosis, BLK8, SC8

Regression Models 

Spatial Feature, Block Size (BLK), Scale (SC)

Variables Selected by GAFS

Variable Selection Using Random Forests (VSURF) 

Mean

LBPM Histogram Variance, BLK4, SC16

LBPM Histogram Variance, BLK8, SC32



Dimensionality Reduction / Automated Variable Selection

Zonal Demographic Variable Statistic

NDVI, BLK8, SC8

Spectral information calculated includes band means, global means and the Normalized Difference Vegetation Index (NDVI)



Scale

Study Area

Figure 2: The general project workflow

Results

Table 2: VSURF selected variables for population density

Table 1: VSURF selected variables for housing quality



Random Forests regressions made with variables selected by Variable Selection Using Random Forests (VSURF) appear to be most suitable for predicting and explaining population density and housing quality Generally, models built using the PCA transformed variables were the poorest performers, indicating that the information removed when reducing dimensionality in this manner may be useful Future work will include processing imagery with additional feature extraction methods including lacunaritybased texture, Gabor filters, Structural Feature Sets (SFS), and Speeded Up Robust Features (SURF)

Bibliography 

Genuer, R., Poggi, J. M., & Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31(14), 2225-2236.



Pesaresi, M., Gerhardinger, A., & Kayitakire, F. (2008). A robust built-up area presence index by anisotropic rotation-invariant textural measure. Selected Topics in Applied Earth Observations and Remote Sensing, IEEE Journal of,1(3), 180-192.



Vandewater, L., Brusic, V., Wilson, W., Macaulay, L., & Zhang, P. (2015). An adaptive genetic algorithm for selection of blood-based biomarkers for prediction of Alzheimer's disease progression. BMC bioinformatics, 16(Suppl 18), S1.



United Nations. (2014). World Urbanization Prospects 2014: Highlights. United Nations Publications.

Acknowledgements This research was funded by the NASA grant entitled “The Urban Transition in Ghana and Its Relation to Land Cover and Land Use Change Through Analysis of Multi-scale and Multitemporal Satellite Image Data” Grant Number#: G00009708, NASA Award Number: NNX12AM87G