Small area estimation of non-monetary poverty with geospatial data

Read original article here

A proliferation of geospatial data obtained from satellites and mobile devices – as well as research demonstrating that satellite and mobile phone records are strongly correlated with household welfare – has sparked great interest in statistical methods that combine this geospatial data with survey data [1, 2, 3, 4, 5]. A key motivation for complementing survey data with comprehensive geospatial data is the potential for small area estimation, which can produce more precise and granular estimates of socioeconomic indicators. Geospatial data are well-suited for this application because, like a census, they are geographically comprehensive and not subject to selection bias. Established methods for small area estimation typically use either “unit-level” or “area level models”, depending on the availability of household-level auxiliary data. Unit-level models in practice typically utilize household survey data and contemporaneous household census data, by first estimating a prediction model of household welfare in the survey data and then using the estimated model parameters to simulate welfare in the census [6, 7, 8]. Simulations of household welfare are then aggregated to the target area level. Because this method relies on census data, however, strong assumptions are typically required to use survey data to update estimates in the absence of a census.1 Census data are typically dated and collected once per decade at best, which presents a significant hurdle to deriving timely small area poverty estimates.

To overcome these challenges, we link census data with geospatial and remote-sensing auxiliary data at the village level.2 Village in this case refers to the lowest geographic level at which spatial auxiliary data can be merged with household surveys containing data on welfare and poverty. Meanwhile, small area refers to the target level for the poverty estimates. Finally, regions are the lowest level for which the household budget survey is representative. In recent household surveys, GPS coordinates of households, enumeration areas, or villages are often available, which allows analysts to link survey data with other auxiliary sources of remote-sensing or geospatial data and improve the precision of poverty estimates.

The exact locations of survey respondents are guarded very closely by statistical agencies to preserve anonymity. However, village IDs or the location of enumeration areas can be obtained in some cases, enabling the survey to be linked to auxiliary data at the EA or village level. Linking surveys with geographically aggregated auxiliary data, whether derived from census or geospatial sources, is less efficient than using household-level data from a census. On the other hand, it also offers the key benefit of being able to include the auxiliary data directly in the prediction model. This is typically not feasible when using census data at the household level, because of the confidentiality concerns associated with linking survey data with census data for individual households. Therefore, unit-level models using census auxiliary data typically use variables common to the census and survey and assume that these prediction variables are drawn from the same underlying distribution, which in turn becomes problematic when there is a large time gap or are differences in questions between the census and the survey.

Using geospatial auxiliary data in particular offers further advantages, in that spatial auxiliary data are collected continuously, eliminating any concerns about a time gap between survey and auxiliary data, and are highly predictive of village level poverty. Both the quality and availability of geospatial data are improving rapidly. Finally, geospatial data is geographically comprehensive and representative, as opposed to mobile phone data which often represents only a portion of the population and is more difficult to obtain.

While there are many studies seeking to predict welfare, poverty or other development outcomes with a combination of survey data and geospatial data [1, 2, 4, 13, 14, 15], only a few have evaluated and quantified the gain in the precision of small area estimates achieved by supplementing survey data with geospatial data.3 This study uses data from Sri Lanka and mainland Tanzania to assess the feasibility of combining traditional household survey with satellite and remote-sensing data to improve the precision and accuracy of small area estimates of non-monetary poverty.4 These two countries were selected due to the availability of census data with geo-referencing information that can be matched with spatial features at the village level, which in this context refers to GN Divisions in Sri Lanka and villages in Tanzania.5 The proposed welfare prediction model uses survey data to estimate household welfare as a function of village characteristics, and therefore differs from standard small area estimation models that predict welfare using a mix of household and small area characteristics. The resulting estimates provide a large efficiency gain compared with estimates solely based on the household survey, which we refer to as direct survey estimates.

We mainly consider Empirical Best Predictor (EBP) models, which have a long history in small area estimation and can accommodate village-level auxiliary data to produce estimates of poverty rates and their mean squared error for small areas. When auxiliary data is aggregated at a geographic level such as a village, EBP models have an important advantage over alternative methods, such as the ELL method [6] and the “M-quantile” method [16] because it conditions on and therefore effectively combines household level survey data with village-level auxiliary data. Empirical evidence that EBP is far more efficient than ELL in this context is discussed below, in section 5.6 When using household- level auxiliary data, ELL can outperform EBP in terms of relative bias and relative root mean squared error in certain situations [17].

We compare the estimates generated by a household EBP model with direct estimates obtained solely from the survey, as well as the well-known Fay-Herriot area-level model [18]. The latter sacrifices precision by discarding the variation in the geospatial indicators across villages within small areas. Once the small area estimates are obtained from the household EBP and Fay-Herriot models, we compare them to non-monetary poverty rates calculated directly from the full census. The availability of census data provides a credible benchmark to establish the feasibility of the method and assess how different methods and their variants perform, in terms of the accuracy of both the small area point estimates and their confidence intervals. We compare the predictions from different methods in terms of their precision, their accuracy, and their coverage rate. The coverage rate is defined as the share of small areas for which the estimated 95 percent confidence intervals for the small area non-monetary poverty rate contains the census non-monetary poverty rate.

The main result is that incorporating remote sensing data in an EBP framework substantially improves the accuracy and precision of small area estimates of non-monetary poverty relative to direct survey estimates. While the main efficiency improvements occur by incorporating information from non-sampled villages, there are also minor efficiency improvements from combining sample data with synthetic predictions in sampled villages. This comes at no cost to coverage rates in Sri Lanka and a moderate cost in Tanzania, compared with standard direct survey estimates. The corresponding efficiency improvement is comparable to approximately tripling the size of the survey in Sri Lanka and quintupling it in Tanzania.

EBP is an appealing framework in this context because it has become a popular and widely accepted method, and is straightforward to apply in well-documented software. However, it does in this context moderately underestimate mean squared error, for two reasons. First, the EBP estimator fails to account for uncertainty in estimated variance parameters from the model. Second, when conditioning on the sample, the EBP estimator incorrectly assumes that sample observations are independent within small areas. Estimated coverage rates remain respectable, however, at 75 percent in Tanzania and 84 percent in Sri Lanka when the small area estimates are calibrated to ensure that their regional estimates match with those derived from the household survey.7 This is comparable to the 76 percent coverage rate in both countries when using standard direct estimates. In Tanzania, the estimates from the unit level model are roughly as accurate and moderately more efficient than those from the area-level Fay-Herriot model. In Sri Lanka, where the poverty rate is low and the Fay-Herriot model is less predictive, the estimates from the unit-level model are substantially more accurate and efficient than the small area level Fay-Herriot estimates.

These results hold under a variety of robustness checks that explore alternative implementation options, including the omission of sample weights and a different transformation method in EBP, the application of a different model selection algorithm as well as the absence of benchmarking to survey estimates at higher levels and the use of a noisier welfare measure. As a robustness check, we also test how EBP performs vis-à-vis the other most commonly used unit-level model ELL and show that EBP estimates are much more efficient than ELL estimates. When using the noisier welfare measure, the gain in efficiency is not as large in Sri Lanka, on the order of doubling the size of the sample, but the predictions remain accurate and coverage rates remain high.

This study makes three main contributions. First, it applies a commonly-used framework for small area estimation to combine household survey data on well-being with geographically comprehensive geospatial indicators at a national level. To our knowledge, it is the first paper that applies the EBP framework to geospatial data outside of the 12 Iowa countries studied in [13]. Second, it evaluates the extent to which incorporating geospatial variables at the subarea level improves the precision of small area poverty estimates, compared with direct survey estimates. Finally, it assesses which of two commonly used SAE models – unit-level models [6, 8, 21], and the Fay-Herriot area-level model [18] – are best suited for combining survey and sub-area geospatial data to produce efficient and accurate estimates of both area-level poverty rates and the uncertainty associated with them. The results taken together demonstrate that augmenting survey data with publicly available geospatial data enhance the accuracy and precision of small area estimates, and that unit-level models are preferred to area-level models when geospatial data is available at the sub-area level.

The remainder of the paper is organized as follows. Section 2 describes the data. Section 3 describes the methodology and estimators that are evaluated. Section 4 assesses results in terms of efficiency, accuracy, and coverage. Section 5 considers a variety of robustness checks of the main method. Section 6 concludes.

Images Powered by Shutterstock

The Data Daily

Small area estimation of non-monetary poverty with geospatial data