Statistical estimation and inference with aggregated and displaced georeferenced data
The thesis addresses the problem of statistical inference for data affected by geocoordinate random displacement or a combination of aggregation and random displacement, which is often used to preserve respondents’ confidentiality. However, the distortion induced in the location of the observations may compromise the validity of location-dependent estimates. This thesis explores various situations where such trade-offs may arise, including: 1) population density estimation, 2) estimation of health and demographic indicators for lower geographical domains, 3) regression analyses involving lower geographical domains (e.g., within the context of multilevel models), and 4) regression analyses incorporating spatial covariates calculated or linked from external data sources based on geocoordinates. A measurement error model (MEM) is developed for the case of random displacement or aggregation. It is demonstrated under the Demographic and Health Survey (DHS) random displacement process by devising a new probability distribution for the displaced coordinates. Two methods, Kernel Density Estimation-based (KDE) and External Data-based Classification (EDC), are proposed within the MEM framework to approximate the conditional distribution of the true coordinates given the ones subject to displacement. Additionally, a novel method, KDE-ED, that combines KDE and EDC is proposed to address both aggregation and random displacement issues in approximating the conditional distribution of true coordinates. The KDE method uses kernel density estimates for approximating the unknown marginal distribution of true location coordinates and is implemented using the Stochastic Expectation-Maximization (SEM) algorithm. The EDC approximates the marginal distribution of true coordinates using external data sources and implements estimation through numerical integration. The MEM and the two proposed KDE and EDC methods are used to address all four location-dependent statistical estimation issues mentioned. The KDE and EDC can be directly used to estimate population densities or domain parameter estimates accounting for random displacement or both aggregation and displacement errors. Apart from the EDC (or KDE)-based algorithm, a new method incorporating a parametric Bootstrap Bias correction (BC) is proposed to obtain improved estimates of the parameters in the linear mixed model, correcting misplacement error due to random displacement. Furthermore, the EDC (or KDE) can be used under regression calibration (EDC-RC) to improve the estimation of spatial covariate effects in a linear regression model under random displacement. An alternative estimator using only a non-parametric Bootstrap Bias correction over the usual OLS estimators is also proposed for the latter situation. The performance of all estimators developed, as well as the variance estimators proposed for them, is assessed via simulation exercises and illustrated using real data from the 2011 Bangladesh DHS.
https://eprints.soton.ac.uk/484015/
https://eprints.soton.ac.uk/484015/1/Final_PhD_Thesis_Submission_Jamal_Hossain_PDF_A.pdf