Emerald Group Publishing Limited
Copyright © 2011, Emerald Group Publishing Limited
Variable selection: towards a restricted-set multi-model procedure
Article Type: Tutorial section From: Journal of Modelling in Management, Volume 6, Issue 3
A popular research “strategy” is to collect a lot of data (usually from a questionnaire or survey) about a certain research area, identify those variables that are “significant” and then attempt to interpret these results in relation to the population. This type of research is very common, particularly for those projects that are essentially exploratory in nature and involve the analysis of many variables. The goal of such research is often to identify the “important” variables (with respect to describing processes in the population) and only include these in a final parsimonious model. There are many methods for achieving this and there is also a huge literature on variable selection that is worth consulting, particularly as there are a number of thorny problems related to this topic (Agresti, 2007; Harrell, 2001; Fox and Weisburg, 2011; Burnham and Anderson, 2002; Weisburg, 1985).
Even though the problems associated with variable selection are well known and documented, in my experience, researchers are often unaware of them or simply ignore them when reporting and interpreting their results. This is problematic, as the methods used for selecting the variables are crucial if a model is to be correctly interpreted. This tutorial aims simply to demonstrate some of the issues with variable selection and also provide some ideas about how they may be dealt with. In particular, the dangers of selecting a small subset of variables from a much larger set of contender variables is demonstrated along with the problems associated with reporting a single best-fitting model. This tutorial advocates the use of a restricted-set approach accompanied with multi-model presentations.
It is useful at this point to describe a couple of analyses that are fairly typical of many reported in research papers and reports and where variable selection might be an issue.
Table I An analysis of deviance table for a multi-nomial logit model of the destination of travellers at Portuguese airports
Table II An analysis of deviance table from a binary logit model of union membership
1 Tourism in Portugal
Visitors to Portuguese airports were helped to complete a large questionnaire which contained over 100 questions and investigated their behaviour, profile and preferences for travel. Table I shows an analysis of deviance table for a multi-nomial logit model of the region of Portugal being visited (see Hutcheson and Moutinho, 2008, for a detailed explanation of the use of analysis of deviance tables with logit models). The analysis suggests that the type of tourism (TourType; golf, beach, cultural, etc.) is by far the most important criteria for travellers selecting a destination (it accounts for a reduction in deviance of 370.7), followed by the nationality of the traveller (country). Earnings, age and gender also show significant associations.
2 Union membership
Another study showing a fairly typical exploratory analysis used data from 534 persons to model the probability of someone being a member of a union (Hutcheson and Moutinho, 2008, Chapter 3, for a detailed discussion of these data). This model suggests that union membership is associated with a number of different variables, primarily a person’s occupation, but also their wage level, age and gender. The analysis of deviance table for the model of union membership (in this case, a binary logit model) is shown in Table II.
Both of these models “make sense” and can easily be described in relation to the research questions the study was designed to address (or, maybe, the research questions that were derived from the analysis). There appears to be little “wrong” with the models, assuming of course that diagnostics have been investigated and any problems dealt with appropriately.
There are, however, some serious potential problems with interpreting these results, as we do not know how these models were derived or how many variables were considered for entry. This information, although crucial for interpreting the models, is often missing from published papers and reports (quite often, the only information presented are the statistics from the final model and a vague reference to which selection procedure was used). Of particular concern is the exploratory nature of the research and the use of a “design” that attempts to identify important relationships from a large number of variables.
In order to understand the problems associated with variable selection it is useful to demonstrate some of these. At a very basic level, there are two mistakes that can be made when selecting a model. First, variables that are not really important may be included in the model and, second, variables that are important may be left out (the definition of a model used here is one that is “useful” (Box, 1979) a model should aim to describe and predict the population, rather than the sample). The use of the word “important” here simply indicates a relationship that exists in the population rather than for a specific sample. The models in the examples shown above may well include variables that are not important for the population and may also omit other variables that are.
Problem 1: including non-important variables in the model
An effective demonstration of how unimportant variables can be included in a model can be made using simulated data. Numeric and categorical variables can be derived simply using any number of statistical or spreadsheet packages. For example, the following command (R Development Core Team, 2011):
generates 300 randomised cases from a normal distribution (using the function rnorm) with a mean of 61 and a standard deviation of 12 and saves it as “variable1”. Categorical variables can be derived in a similar fashion using the command:
which generates 300 randomised cases belonging to two groups from a binomial distribution (using the function rbinom) according to the defined probability (one category will have a probability of 0.57) and saves it as the variable “gender”. It is simple to obtain samples from other distributions (for example, the Weibull, Poisson or Gumbel) and also define different numbers of categories. Figure 1 shows the distributions of four variables that have been selected using commands similar to those shown above.
Using these simple commands, a number of datasets were constructed containing numeric variables that had a range of mean values and standard deviations, and categorical variables that had different numbers and sizes of categories. A simple automated script then selected a “best model” for each dataset based on one of the many “model-fit criteria” that can be used (e.g. AIC, BIC, Mallow’s CP, R2). Such automated variable selection is very common and is provided by most statistical software (for example, a stepwise procedure).
Models derived from the simulated data
The following models were derived from datasets composed of 27 simulated variables (i.e. 27 variables that were independently generated and absolutely random). The variable selection procedure used was a backwards/forwards selection process based on the AIC criteria (for information about this, see Fox and Weisburg, 2011). The four models shown below are illustrative of the models obtained. These models are not particularly “unusual” with respect to the levels of significance and the number of variables retained in the models. In order to make things easier for description, the derived variables are given names such as Gender, Ethnicity, MathsScore, EnglishScore, Course, SES, FirstLanguage, etc. (Table III).
We could continue to run models of the “success” variable and obtain many different combinations of significant explanatory variables. The important thing to note, is that the significant associations found in the models do not tell us anything about the actual relationship between the variables in the population. For example, although “Language” is highly significant in model 4, it would be wrong to conclude that “Language” and “success” are related “in the population”. These are just two random variables which showed an association purely due to chance. The variables included in the models above are all randomly and independently generated – any associations apply to the sample, rather than the population.
Selecting variables from a large number of candidates in this way is commonly known as data dredging, data trawling, post hoc data analysis, data snooping or data mining, and is recognised as a poor practice that often leads to incorrect interpretations of effects when applied to the population. Miller (1990) in a paper on selection bias warns that:
P-values from subset selection are totally without foundation, and large biases in regression coefficients are often caused by data-based model selection.
In their book on model selection and multi-model inference, Burnham and Anderson identify two types of data dredging. The first type is a highly interactive, data dependent, iterative post hoc approach (models are based on the data):
Standard inferential tests and estimates of precision are invalid when a final model results from data-dredging. Resulting “P-values” are misleading and there is no valid basis to claim “significance”.
A second type of data dredging uses automated selection procedures to trawl through a large number of models and find one that can be considered best fitting:
[…] one might consider Bonferroni adjustments of the α-levels or P-values. However, if there were 1,000 models, then the α-level would be 0.00005, instead of the usual 0.05! […] This approach is hardly satisfactory; thus analysts have ignored the issue and merely pretended that the data dredging is without peril and that the usual inferential methods somehow still apply.
The Bonferroni adjustment argument is very important, as automated selection procedures compare a large number of models. For example, an all-subsets regression model on the 27 variables used above (without taking account of any interactions) will compare over 130,000,000 models. A stepwise regression using a basic backward elimination procedure will compare 648 models in order to select the first model above. The final models given above are clearly the result of a two-stage process, making the parameter estimates and significance levels invalid.
Although this method of variable selection is demonstrably wrong (try running multiple models on your own simulated data, or randomise your response variable and re-run models using the actual explanatory variables), it is remarkably similar to how many models are derived. Data trawling is very common in social science research and is also frequently combined with simplistic interpretations of the regression coefficients.
The conclusion I draw from these simulations is that large numbers of variables should not be considered when attempting to derive a model. A restricted set of hypotheses need to be carefully defined and tested. This point is made clearly by Burnham and Anderson (2002, p. 47):
We cannot overstate the importance of the scientific issues, the careful formulation of multiple working hypotheses, and the building of a small set of models to clearly and uniquely represent these hypotheses […] We try to emphasize a more confirmatory endeavour in the applied sciences, rather than exploratory work that has become so common and has often led to so little (Anderson et al., 2000).
Problem 2: not including important variables in the model
In addition to including unimportant variables in a model, it is also possible to exclude variables which are important and thereby provide a misleading impression of the research findings. Leaving “important” variables out is an issue in social research as we tend to rely on the identification of a single “best model” to describe research findings. One of the major problems with selecting a single best-fitting model is that there are a number of different techniques that can be used to select a final model (forward selection, backward deletion or a combination of these) utilising a number of different model-fit criteria (AIC, BIC, QAIC, FIC, R2, adjusted R2, Mallow’s CP, etc.). This often results in many “best-fitting” models each containing a different selection of variables, any of which may be chosen to represent the research findings. If you have enough time (and good enough software) you can probably find a model that you like!
The selection process is further complicated if the explanatory variables are related to one another (multi-collinearity) as model selection can become quite unstable with many models capable of being selected to represent the research findings. Related explanatory variables are extremely common in social science research and it is not unusual to see five or six separate indicators of individual factors (for example, socio-economic status may be “measured” using post code, salary, educational level, being a recipient of free school meals, type of school attended, etc.).
The problem of selecting a single model is clearly demonstrated in the following output from an all-subsets regression procedure conducted on a dataset that is highly interrelated (but not unusual). The Cars93 dataset used here is available as part of the “MASS” library (Venables et al., 2011), which is automatically loaded with Rcmdr (for a discussion of Rcmdr, see Hutcheson, 2010). The Cars93 dataset contains information on 93 new cars including price, miles per gallon (mpg) ratings, engine size, body size and indicators of features. Figure 2 shows how to load this dataset directly from the Rcmdr menus. Information about the dataset can be found using the menu options in Rcmdr (Help on selected dataset) and a full discussion of how it has been used in other analyses is available at: www.amstat.org/publications/jse/v1n1/datasets.lock.html, from where the data can also be downloaded.
Figure 2 Loading a demonstration dataset
The regsubsets function from the leaps package (Lumley, 2009) is used to plot a graphic of all possible models of mpg. A subset of these is shown in Figure 3, which displays the ten best-fitting models for each number of model parameters. The “best-fitting” models are those with the lowest BIC scores. The important thing to note here is that a large number of models have similar fit statistics. ANY of those that fall within the “cloud” of models at the bottom of the graphic are essentially as good fitting as each other according to the BIC criterion. There are, therefore, a multitude of models that can be legitimately selected on the basis of the model-fit criteria. The selection of a particular model is unlikely to be stable – use a different model-fit criteria and you are likely to get a different selection of variables, use a slightly different sample and you are likely to get a different selection of variables, remove “unimportant” variables from contention and you are likely to get a different selection of variables. Describing your research in relation to only one of these models will only include a small subset of variables that are important. Some variables that are important will inevitably be left out. Selecting a best-fitting model from multi-collinear datasets is a very common problem and is one that has been encountered by most researchers. This issue is often “solved” by the analyst simply choosing to use one particular method to select a model and effectively ignoring any other model that may have been selected using another equally valid method.
Although a lot of effort may be spent in selecting the “right” model (using any number of techniques), there is a question as to whether ANY single regression model can adequately describe the relationships in this set of data. One solution is to present more than one final model using either a table of models or a graphic. A graphic of multiple models from the Cars93 data is shown in Figure 4, which uses adjusted R2 as the model-fit criteria. The variables that have been considered for entry into the model are shown on the x-axis with each individual model shown as a line on the graph. For example, the top line shows a model that includes the intercept, horsepower and weight and has an adjusted R2 of 0.71. The next best-fitting model includes the intercept, price and weight and also has an adjusted R2 of 0.71. It is not unexpected that these models have similar fit-statistics, as horsepower and price are highly correlated (Pearson’s correlation=0.788). Figure 4 shows many models that have similar model-fits and provides a picture of the analysis that is much richer than any single model.
Showing multiple models in a single graphic is a very economical method of displaying analyses and is particularly effective when used on a relatively small subset of candidate variables (including too many models and variables tends to make the graphic confusing). The multiple plot graphic is useful for illustrating competing models and also identifying when variables do not enter into any of the best-fitting models. For example, a graphic showing that “gender” does not enter into any of the best-fitting models provides a convincing argument that it is not important for predicting the response (a much more convincing argument than merely quoting a single model that does not include it).
In conclusion, single models are unlikely to provide a complete picture of the analyses when dealing with models selected from variables that are interrelated. In such situations, analysts should consider reporting multiple models and, possibly, illustrate them using an appropriate graphic.
In the light of the issues dealt with in this tutorial, it is likely that the initial analyses described may be misleading. In all probability, the models were selected from many candidate variables and it is uncertain whether the variables provide useful information about the relationships in the population, or whether they have been selected as the result of chance. We certainly need to know how many variables were considered for the models and how many models were tested. It is also likely that there are a number of competing models that could have been selected. Other variables that were measured but did not enter into the model may have featured in other models with comparable model-fit statistics. Providing information about these competing models would have provided a much richer analysis and one less prone to misinterpretation.
Although providing a full description of how to model is beyond the scope of this tutorial, I would suggest that a reasonable strategy would be to select a small number of theoretically justified models a priori based on a limited number of variables and then display a number of competing models from this subset. Analysts should resist the temptation to trawl through the data looking for significance, as this will likely identify variables on the basis of chance. Any relationships identified in this way may be the subject of further analysis and should remain unpublished.
Accessible introductions to variable selection and model building can be found in Burnham and Anderson (2002), Fox and Weisburg (2011) and Weisburg (1985) and readers are strongly encouraged to consult these for detailed information.
Graeme HutchesonManchester University, Manchester, UK
Agresti, A. (2007), An Introduction to Categorical Data Analysis, 2nd ed., Wiley, New York, NY
Anderson, D.R., Burnham, K.P. and Thompson, W.L. (2000), “Null hypothesis testing: problems, prevalence, and an alternative”, Journal of Wildlife Management, Vol. 64, pp. 912–23
Box, G.E.P. (1979), “Robustness in the strategy of scientific model building”, in Launer, R.L. and Wilkinson, G.N. (Eds), Robustness in Statistics, Academic Press, New York, NY
Burnham, K.P. and Anderson, D.R. (2002), Model Selection and Multimodel Inference: A Practical Information-theoretic Approach, 2nd ed., Springer, Berlin
Fox, J. and Weisburg, S. (2011), An R Companion to Applied Regression, 2nd ed., Sage, Thousand Oaks, CA
Harrell, F.E. Jr (2001), Regression Modelling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis, Springer, Berlin
Hutcheson, G.D. (2010), “Open-source statistical software: R and the R Commander”, Journal of Modelling in Management, Vol. 5 No. 3
Hutcheson, G.D. and Moutinho, L. (2008), Statistical Modeling for Management, Sage, Thousand Oaks, CA
Lumley, T. (2009), R Package “Leaps”: Regression Subset Selection (Version 2.9), available at: http://cran.r-project.org/web/packages/leaps
Miller, A.J. (1990), Subset Selection in Regression, Chapman and Hall, London
R Development Core Team (2011), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, available at: www.R-project.org/
Venables, W.N., Ripley, B.D., Hornik, K. and Gebhardt, A. (2011), “R package ‘MASS’: functions and datasets to support Venables and Ripley”, Modern Applied Statistics with S, 7.3-13th ed., Springer, Berlin, (4th ed., 2002)
Weisburg, S. (1985), Applied Linear Regression, 2nd ed., Wiley, New York, NY