The Econometrics of Complex Survey Data: Volume 39

Cover of The Econometrics of Complex Survey Data

Theory and Applications

Subject:

Table of contents

(13 chapters)

Part I Sampling Design

Abstract

We examine sample characteristics and elicited survey measures of two studies, the Health and Retirement Study (HRS), where interviews are done either in person or by phone, and the Understanding America Study (UAS), where surveys are completed online and a replica of the HRS core questionnaire is administered. By considering variables in various domains, our investigation provides a comprehensive assessment of how Internet data collection compares to more traditional interview modes. We document clear demographic differences between the UAS and HRS samples in terms of age and education. Yet, sample weights correct for these discrepancies and allow one to satisfactorily match population benchmarks as far as key socio- demographic variables are concerned. Comparison of a variety of survey outcomes with population targets shows a strikingly good fit for both the HRS and the UAS. Outcome distributions in the HRS are only marginally closer to population targets than outcome distributions in the UAS. These patterns arise regardless of which variables are used to construct post-stratification weights in the UAS, confirming the robustness of these results. We find little evidence of mode effects when comparing the subjective measures of self-reported health and life satisfaction across interview modes. Specifically, we do not observe very clear primacy or recency effects for either health or life satisfaction. We do observe a significant social desirability effect, driven by the presence of an interviewer, as far as life satisfaction is concerned. By and large, our results suggest that Internet surveys can match high-quality traditional surveys.

Abstract

For central banks who study the use of cash, acceptance of card payments is an important factor. Surveys to measure levels of card acceptance and the costs of payments can be complicated and expensive. In this paper, we exploit a novel data set from Hungary to see the effect of stratified random sampling on estimates of payment card acceptance and usage. Using the Online Cashier Registry, a database linking the universe of merchant cash registers in Hungary, we create merchant and transaction level data sets. We compare county (geographic), industry and store size stratifications to simulate the usual stratification criteria for merchant surveys and see the effect on estimates of card acceptance for different sample sizes. Further, we estimate logistic regression models of card acceptance/usage to see how stratification biases estimates of key determinants of card acceptance/usage.

Part II Variance Estimation

Abstract

When there are few treated clusters in a pure treatment or difference-in-differences setting, t tests based on a cluster-robust variance estimator can severely over-reject. Although procedures based on the wild cluster bootstrap often work well when the number of treated clusters is not too small, they can either over-reject or under-reject seriously when it is. In a previous paper, we showed that procedures based on randomization inference (RI) can work well in such cases. However, RI can be impractical when the number of possible randomizations is small. We propose a bootstrap-based alternative to RI, which mitigates the discrete nature of RI p values in the few-clusters case. We also compare it to two other procedures. None of them works perfectly when the number of clusters is very small, but they can work surprisingly well.

Abstract

Sampling units for the 2013 Methods-of-Payment survey were selected through an approximate stratified two-stage sampling design. To compensate for nonresponse and noncoverage and ensure consistency with external population counts, the observations are weighted through a raking procedure. We apply bootstrap resampling methods to estimate the variance, allowing for randomness from both the sampling design and raking procedure. We find that the variance is smaller when estimated through the bootstrap resampling method than through the naive linearization method, where the latter does not take into account the correlation between the variables used for weighting and the outcome variable of interest.

Part III Estimation and Inference

Abstract

We extend Vuong’s (1989) model-selection statistic to allow for complex survey samples. As a further extension, we use an M-estimation setting so that the tests apply to general estimation problems – such as linear and nonlinear least squares, Poisson regression and fractional response models, to name just a few – and not only to maximum likelihood settings. With stratified sampling, we show how the difference in objective functions should be weighted in order to obtain a suitable test statistic. Interestingly, the weights are needed in computing the model-selection statistic even in cases where stratification is appropriately exogenous, in which case the usual unweighted estimators for the parameters are consistent. With cluster samples and panel data, we show how to combine the weighted objective function with a cluster-robust variance estimator in order to expand the scope of the model-selection tests. A small simulation study shows that the weighted test is promising.

Abstract

We show how to use a smoothed empirical likelihood approach to conduct efficient semiparametric inference in models characterized as conditional moment equalities when data are collected by variable probability sampling. Results from a simulation experiment suggest that the smoothed empirical likelihood based estimator can estimate the model parameters very well in small to moderately sized stratified samples.

Abstract

Applied econometric analysis is often performed using data collected from large-scale surveys. These surveys use complex sampling plans in order to reduce costs and increase the estimation efficiency for subgroups of the population. These sampling plans result in unequal inclusion probabilities across units in the population. The purpose of this paper is to derive the asymptotic properties of a design-based nonparametric regression estimator under a combined inference framework. The nonparametric regression estimator considered is the local constant estimator. This work contributes to the literature in two ways. First, it derives the asymptotic properties for the multivariate mixed-data case, including the asymptotic normality of the estimator. Second, I use least squares cross-validation for selecting the bandwidths for both continuous and discrete variables. I run Monte Carlo simulations designed to assess the finite-sample performance of the design-based local constant estimator versus the traditional local constant estimator for three sampling methods, namely, simple random sampling, exogenous stratification and endogenous stratification. Simulation results show that the estimator is consistent and that efficiency gains can be achieved by weighting observations by the inverse of their inclusion probabilities if the sampling is endogenous.

Abstract

Nearest neighbor imputation has a long tradition for handling item nonresponse in survey sampling. In this article, we study the asymptotic properties of the nearest neighbor imputation estimator for general population parameters, including population means, proportions and quantiles. For variance estimation, we propose novel replication variance estimation, which is asymptotically valid and straightforward to implement. The main idea is to construct replicates of the estimator directly based on its asymptotically linear terms, instead of individual records of variables. The simulation results show that nearest neighbor imputation and the proposed variance estimation provide valid inferences for general population parameters.

Part IV Applications in Business, Household, and Crime Surveys

Abstract

We survey banks to construct national estimates of total noncash payments by type, payments fraud and related information. The survey is designed to create aggregate total estimates of all payments in the United States using data from responses returned by a representative, random sample. In 2016, the number of questions in the survey doubled compared with the previous survey, raising serious concerns of smaller bank nonparticipation. To obtain sufficient response data for all questions from smaller banks, we administered a modified survey design which, in addition to randomly sampling banks, also randomly assigned one of several survey forms, subsets of the full survey. This case study illustrates that while several other factors influenced response outcomes, the approach helped ensure sufficient response for smaller banks. Using such an approach may be especially important in an optional-participation survey, when reducing costs to respondents may affect success, or when imputation of unplanned missing items is already needed for estimation. While a variety of factors affected the outcome, we find that the planned missing data approach improved response outcomes for smaller banks. The planned missing item design should be considered as a way of reducing survey burden or increasing unit-level and item-level responses for individual respondents without reducing the full set of survey items collected.

Abstract

Prior analyses of racial bias in the New York City’s Stop-and-Frisk program implicitly assumed that potential bias of police officers did not vary by crime type and that their decision of which type of crime to report as the basis for the stop did not exhibit any bias. In this paper, we first extend the hit rates model to consider crime type heterogeneity in racial bias and police officer decisions of reported crime type. Second, we reevaluate the program while accounting for heterogeneity in bias along crime types and for the sample selection which may arise from conditioning on crime type. We present evidence that differences in biases across crime types are substantial and specification tests support incorporating corrections for selective crime reporting. However, the main findings on racial bias do not differ sharply once accounting for this choice-based selection.

Abstract

In 2014, the Colombian Government commissioned a unique national survey on illegal liquor. Interviewers purchased bottles of liquor from interviewees and tested them for authenticity in a laboratory. Two factors predict whether liquor is contraband (smuggled): (1) the absence of a receipt and (2) the presence of a discount offered by the seller. Neither factor predicts whether a bottle is adulterated. The results back a story in which sellers are complicit with a contraband economy, but whether buyers are complicit remains unclear. However, buyers are more likely to receive adulterated liquor when specifically asking for a discount.

Cover of The Econometrics of Complex Survey Data
DOI
10.1108/S0731-9053201939
Publication date
2019-04-10
Book series
Advances in Econometrics
Editors
Series copyright holder
Emerald Publishing Limited
ISBN
978-1-78756-726-9
eISBN
978-1-78756-725-2
Book series ISSN
0731-9053