Missing Data: data replacement and imputation

Journal of Modelling in Management

ISSN: 1746-5664

Article publication date: 29 June 2012

1214

Citation

Hutcheson, G. (2012), "Missing Data: data replacement and imputation", Journal of Modelling in Management, Vol. 7 No. 2. https://doi.org/10.1108/jm2.2012.29707baa.002

Publisher

:

Emerald Group Publishing Limited

Copyright © 2012, Emerald Group Publishing Limited


Missing Data: data replacement and imputation

Article Type: Tutorial section From: Journal of Modelling in Management, Volume 7, Issue 2

Datasets with missing values are ubiquitous in the social sciences. Although it is common to just ignore these missing data points and use “techniques” such as list-wise deletion that eliminate entire observations, this is not a particularly good strategy for analysis as it results in the loss of valuable information at best and severe selection bias at worst (King et al., 2001). The simple removal of cases that includes missing data is, however, not the only option available to analysts as individual data points may be replaced using a range of different techniques, for example, replacement by random values (within certain parameters), mean values, values predicted from regression models, or values imputed using a dedicated imputation procedure.

Even though missing data is important, it is rarely dealt with or even acknowledged in much published research and has been very slow to be adopted by non-specialist researchers. Some plausible explanations for this include the ignorance of researchers about the damaging effects that missing data can have and the options that are available to remedy this. The difficulty associated with the procedures for imputing data in combination with the lack of training and availability of efficient tools have also been cited as reasons for the apparent reluctance shown by applied researchers to properly deal with incomplete datasets. However, recent developments in statistical software have now removed many of these problems.

This tutorial aims to highlight the damaging effects that missing data can have and also demonstrate how easy it is to replace missing cases using readily available software. The emphasis is on providing examples that readers can analyze themselves and apply to their own data if they wish. An example dataset is made available and some examples of common data replacement techniques and the effects that these have on the resulting models is also provided. A number of “simple” data replacement techniques are ullustrated along with a multiple imputation technique. Although there are a number of packages that can be used to impute data, we use the open-source statistical library “Amelia” (Honaker et al., 2012), which is available free as part of the R statistical software system (R Development Core Team, 2012; for information about R, see the tutorial section in the Journal of Modelling in Management, 2010, Volume 5, Issue 3, which is available online at: www.emeraldinsight.com/products/journals/journals.htm?id=jm2) as this software is freely available to all and will run on any computer platform.

Demonstrating missing data analyses

The data in this tutorial have been made up purely for the purpose of demonstration and shows the amount of ice cream sold by a single ice cream seller (kilos per day), the outdoor temperature at mid-day (degrees Fahrenheit) and the location where the ice cream was sold (three locations are compared – A, B and C). This example dataset does not contain any unusual or unexpected relationships between the variables and basically shows a strong positive relationship between sales and outdoor temperature (the higher the temperature, the more ice cream sold) and a much weaker relationship between location and sales. Table I shows the data and also includes a number of data points that are shaded (every third value of temeprature); these represent the “missing data” that are replaced or imputed later in this tutorial. As there is no particular order in the dataset, the missing data can be regarded as a random selection of points. A scatterplot showing the relationship between sales and temperature is shown in Figure 1 and also shows the data points selected as missing. The object of this tutorial is to replace the missing data points and “recreate” the relationships in the original data.

Table I An example dataset showing ice cream consumption, outdoor temperature and location of sale

 Figure 1 The relationship between sales and temperature

Figure 1 The relationship between sales and temperature

1. Modelling “sales” using all of the data (n=45)

A simple OLS regression model of “sales” using all of the data (n=45) show in Table II.

The regression model provides evidence for a significant positive relationship between “sales” and “temperature” and a non-significant relationship between “sales” and “location”.

Table II

2. Modelling “sales” after removing the missing cases (n=30)

After removing the 15 cases selected as missing, a regression model of sales on the 30 remaining cases show in Table III.

Using the reduced dataset, it is unsurprising that the relationships in the model are now less significant. This is particularly evident with “temperature”, which now shows a t-value of 2.6 (down from a value of 3.4 obtained for the model on all the data), a result which shows a significance level an order of magnitude below that of the complete dataset. The model computed on the missing dataset underestimates the significance of the relationship between “sales” and “temperature” in the full dataset. It is not appropriate, therefore, to just leave out the missing cases when we analyse these data.

Table III

3. Replacing missing cases with the mean

One option that is often used to replace the missing data, is to use the mean value of the variable (a best-guess for the data that is missing). Using this method of missing data replacement, all the missing data points for sales are replaced with the mean value of sales computed from the 30 un-shaded cases. For this example, all the shaded numbers are replaced by 6.98, increasing the number of cases back to 45. A regression model of sales on these 45 cases shows Table IV.

Table IV

The regression model of sales suggests that replacing the missing data with the mean has not been entirely successful, as the t-value for “temperature” is not a particularly good representation of the relationship between temperature and sales that was found in the model using the complete dataset (the significance level for “temperature” is now about ten times smaller). Although substitution with the mean value is a simple and popular method for replacing missing data, it is not always a particularly good method to use. The problems of replacement with the mean is shown graphically in Figure 2, where the “observed” and “imputed” data are identified on a single plot. The distribution of the imputed values is clearly not representative of the distribution of the observed data.

 Figure 2 The relationship between sales and temperature when missing values
for sales are replaced by the average (6.98)

Figure 2 The relationship between sales and temperature when missing values for sales are replaced by the average (6.98)

4. Replacing missing cases with predictions from regression

Another option is to replace missing data is with values for the variable predicted using regression. Basically, a regression model is computed for the available data and predictions are made for the missing data from this model. The regression model using data imputed using the regression procedure is provided in Table V.

The regression model of sales suggests that replacing the missing data with values predicted from regression has not been entirely successful, as the t-value for “temperature” is not a particularly good representation of the relationship between temperature and sales for the original, complete dataset (the significance level for “temperature” is about ten times larger than it should be). Although substitution with the predicted values from regression is a popular method for replacing missing data, it is not always a particularly good method to use. The problems of replacement with predicted values from regression is shown graphically in Figure 3, where the “observed” and “imputed” data are identified on a single plot. The distribution of the imputed values is clearly not representative of the distribution of the observed data.

Table V

 Figure 3 The relationship between sales and temperature when missing values
for sales are replaced by those predicted from regression

Figure 3 The relationship between sales and temperature when missing values for sales are replaced by those predicted from regression

5. Replacing missing cases with random values

Another option for replacing missing data is to use random values. There are many methods for computing random values; we have used a very simple one that just computes random values within the range of the data. For our data (the 30 non-missing cases), values of sales range between 0 and 12.58. A simple random number generator was used to provide the 15 missing values for these data which replaced the shaded values in Table I. The regression model using these data is provided in Table VI.

Table VI

The regression model of sales suggests that replacing the missing data with random values has not been entirely successful, as the t-value for “temperature” is not a particularly good representation of the relationship between temperature and sales for the original, complete dataset (the significance level for “temperature” is about 30 times smaller than it should be). Although substitution with random values is a simple and popular method for replacing missing data, it is not always a particularly good method to use. The problems of replacement with random values is shown graphically in Figure 4, where the “observed” and “imputed” data are identified on a single plot. The distribution of the imputed values is clearly not representative of the distribution of the observed data.

 Figure 4 The relationship between sales and temperature when missing values
for sales are replaced by random values

Figure 4 The relationship between sales and temperature when missing values for sales are replaced by random values

6. Modelling “sales” after replacing missing cases using multiple imputation

Replacing missing data using mean values, values predicted directly from regression models, or random values have not proved to be optimal strategies for replacing our missing data. This is shown in the regression models, as the t-values for temperature are not comparable to those found in the analysis of the original dataset. The problem is also shown graphically in Figures 2-4, which shows that the distribution of the “imputed” values are clearly not the same as the distribution of the non-missing data.

 

An alternative method for replacing missing data is to impute the values using multiple imputation; a technique that is becoming increasingly popular. The basic idea of multiple imputation, as proposed by Rubin (1987), involves the following steps:

  1. 1.

    Impute missing values using an appropriate model that incorporates random variation;

  2. 2.

    Do this M times, producing M “complete” datasets;

  3. 3.

    Perform the desired analysis on each dataset using standard complete-data methods;

  4. 4.

    Average the values of the parameter estimates across the M samples to produce a single-point estimate; and

  5. 5.

    Calculate the standard errors by:

    • averaging the squared standard errors of the M estimates;

    • calculating the variance of the M parameter estimates across samples; and

    • combining the two quantities using an adjustment term (i.e. 1+1/M).

Multiple imputation may be computed using a variaety of software and is demonstrated here using the open-source statistical library “Amelia” (Honaker et al., 2012), which is available free as part of the R statistical software system[1] (R Development Core Team, 2012). Amelia features an easy-to-use graphical user interface, which allows datasets to be easily loaded and the missing data analysed and imputed. A full description of the available options in Amelia is beyond the scope of this tutorial; however, this is unnecessary as Amelia has an excellent manual, example datasets and vignettes (examples of the program in use) available online at: www.cran.r-project.org/web/packages/Amelia. This tutorial will simply use Amelia to impute the missing data in our example and see how well this imputation method works for our simple dataset.

The procedure for imputing the data using Amelia is very simple. First, start R, load the Amelia library and start the graphical interface program using the command AmeliaView(). The AmeliaView opening screen is shown in Figure 5 which shows some of the options available for loading data (top graphic). Once a dataset is loaded, the variables can be defined in a number of different ways (for example, time-series, cross-sectional, nominal, ordinal, ID variales, etc.), investigated using histograms and missingness plots, missing values imputed and the results investigated using diagnostics. Figure 5 shows the process of imputing the missing data.

 Figure 5 Using Amelia. Load the data file, define “location” as a
cross-section variable, press the “Impute!” button

Figure 5 Using Amelia. Load the data file, define “location” as a cross-section variable, press the “Impute!” button

Using the procedure outlined in Figure 5, Amelia imputes a number of missing datasets labelled:

  • DATAmissing-imp1.csv

  • DATAmissing-imp2.csv

  • DATAmissing-imp3.csv

  • DATAmissing-imp4.csv

  • DATAmissing-imp5.csv

These datasets contain the imputed values and may be saved in memory or saved to disk using a number of different formats (for example, csv, tab-delimited, Stata, R). Users can then run the analysis model on each imputed dataset and combine the results using the rules described in King et al. (2001) and Schafer (1997), or by using the Zelig library for R (Imai et al., 2012). The following regression model shows one of the imputed datasets used to model “sales” (Table VII).

Table VII

The regression model of sales suggests that replacing the missing data using multiple imputation has been successful, as the t-value for “temperature” is now close to the values obtained from the original, complete dataset (3.565 compared to 3.435). Figure 6 shows plots of the first four imputed datasets and shows that the distribution of the imputed values is now much more representative of the distribution of the observed data.

 Figure 6 The relationship between sales and temperature

Figure 6 The relationship between sales and temperature

Amelia does more than just impute missing data, it also offers diagnostics of the imputations which allow users to investigate how good the imputations are likely to be. Two diagnostic tools are to “compare” the distribution of the imputed values with the observed values and to “overimpute”, a technique which involves sequentially treating each of the observed values as if they had actually been missing. These techniques are described in detail in the manual and are shown for our data in Figure 7.

 Figure 7 Diagnostics for imputed data; compare: the imputed data for the
variable sales shows a similar distribution to the observed data (bottom left
graphic)

Figure 7 Diagnostics for imputed data; compare: the imputed data for the variable sales shows a similar distribution to the observed data (bottom left graphic)

 

7. Conclusion

When missing data were just ignored, the resulting model underestimated the significance of temperature, suggesting that the list-wise deletion procedure did not result in a particularly good model. Of all the methods we presented for replacing the missing data, multiple imputation proved to be the best. All of the models presented in this tutorial are summarised in Table VIII in relation to the significance of temperature. We can see that the multiple imputation model has quite accurately estimated the original significance of temperature whereas the other methods have proved to be far less successful.

Table VIII Summary of the results from the different models

This tutorial provided a very simple introduction to data imputation and illustrated how easy it is to impute data using readily available software. There are many packages that can be used to analyse and impute missing data and readers are advised to investigate these. Packages that we have found to be particularly useful are VIM (visualization and imputation of missing values, which offers some alternative methods to impute missing values, including K-nearest neighbour and hot-deck imputation; Templ et al., 2011) and GGOBI (www.ggobi.org; Cook and Swayne, 2007), which are both available for R and offer sophisticated missing data analyses. Given the ready availability of high quality software for analysing and imputing missing data, there should be little excuse for not imputing data that is missing.

Note

1. All analyses were run in R (R Development Core Team, 2012) via the R-commander graphical interface (Fox, 2005). All graphics were procuded using the TikZ package (Tantau, 2010) in conjunction with the tikzDevice R library (Sharpsteen and Bracken, 2012) and the Qtikz software to edit (www.hackenberger.at/blog/ktikz-editor-for-the-tikz-language/).

Graeme Hutcheson, Maria PampakaManchester University, Manchester, UK

References

Cook, D. and Swayne, D.F. (2007), Interactive and Dynamic Graphics for Data Analysis: With R and GGobi, Springer, New York, NY

Fox, J. (2005), “The R commander: a basic statistics graphical user interface to R”, Journal of Statistical Software, Vol. 14 No. 9, pp. 1–42

Honaker, J., King, G. and Blackwell, M. (2012), “Amelia II: A Program for Missing Data (version 1.6.1)”, available at: www.gking.harvard.edu/amelia

Imai, K., King, G. and Lau, O. (2012), “Package Zelig: Everyone’s statistical software (version 3.5.5)”, available at: www.cran.r-project.org/web/packages/Zelig/index.html

King, G., Honaker, J., Joseph, A. and Scheve, K. (2001), “Analyzing incomplete political science data: an alternative algorithm for multiple imputation”, American Political Science Review, Vol. 95, pp. 49–69

R Development Core Team (2012), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, available at: www.R-project.org

Rubin, D.B. (1987), Multiple Imputation for Nonresponse in Surveys, Wiley, New York, NY

Schafer, J.L. (1997), Analysis of Incomplete Multivariate Data, Chapman & Hall, London

Sharpsteen, C. and Bracken, C. (2012), “Package tikzDevice: A Device for R Graphics Output in PGF/TikZ Format”, available at: www.cran.r-project.org/web/packages/tikzDevice/index.html

Tantau, T. (2010), “The TikZ and PGF packages (version 2.10)”, available at: www.ctan.org/pkg/pgf, www.spurcforge.net/projects/pgf

Templ, M., Alfons, A., Kowarik, A. and Prantner, B. (2011), “Package VIM: visualization and imputation of missing values (version 3.0.0)”, available at: www.cran.r-project.org/package=VIM

Further Reading

Allison, P.D. (2000), “Multiple imputation for missing data: a cautionary tale”, Sociological Methods and Research, Vol. 28 No. 3, pp. 301–9

Durrant, G.B. (2009), “Imputation methods for handling item non-response in practice: methodological issues and recent debates”, International Journal of Social Research Methodology, Vol. 12 No. 4, pp. 293–304

Fox, J., Andronic, L., Ash, M., Bouchet-Valat, M., Boye, T., Calza, S., Chang, A, Grosjean, P., Heiberger, R., Karimi Pour, K., Jay Kerns, G., Lancelot, R., Lesnoff, M., Ligges, U., Messad, S., Maechler, M., Muenchen, R., Murdoch, D., Neuwirth, E., Putler, D., Ripley, B., Ristic, M. and Wolf, P. (2012), “Package Rcmdr (version 1.8-3)”, available at: www.cran.r-project.org/web/packages/Rcmdr/index.html

Horton, N.J. and Kleinman, K.P. (2007), “Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models”, The American Statistician, Vol. 61 No. 1, pp. 79–90

Schulte Nordholt, E. (1998), “Imputation: methods, simulation experiments and practical examples”, International Statistical Review, Vol. 66 No. 2, pp. 157–80

Related articles