Emerald Group Publishing Limited
Copyright © 2011, Emerald Group Publishing Limited
Categorical explanatory variables
Article Type: Tutorial Section From: Journal of Modelling in Management, Volume 6, Issue 2
Categorical variables (ordered and unordered) are very common in social science research and are often of primary interest. In order to include these variables appropriately into statistical models, they need to be coded into a number of individual “dummy” categories, which can be entered directly into the model. There are a variety of methods that can be used to code these dummy variables, each of which provides a different set of comparisons between the categories that make up the variable. Some coding techniques compare individual categories, others compare specific categories with mean values whilst others provide information about possible linear and non-linear trends. These coding methods provide a wealth of information that can be of great benefit to researchers.
Even though many different coding methods are available for categorical data, researchers tend to opt for the simplest method of coding, or use the default method offered by their statistical software. The default or simplest coding is often not, however, the most appropriate or useful way to represent the categorical variable, particularly when the variable is ordered, or specific comparisons are required.
This tutorial provides a demonstration of a number of methods for coding categorical explanatory variables and shows how these can be used to describe ordered and well as unordered categories. The use of these coding methods can greatly improve the interpretation of the results and enhance analyses.
1 Coding unordered data
Figure 1 shows the price of a “standard drink” and the location of the bar where the drink was purchased. Three different locations are shown; the town centre, the seafront and other areas. Figure 1 shows these data in a box plot and clearly shows that seafront bars tend to charge the most, closely followed by bars in the town centre with those in other locations charging somewhat lower prices.
Figure 1 The relationship between price and location
The relationship between Price and Location can also be described using an OLS regression model of “Price” with “Location” included as an explanatory variable. In order to include the categorical variable “Location” in the regression model, it is necessary to dummy code it (either by hand or by software). Below are shown two popular methods of including unordered categorical variables in statistical models.
1.1 Treatment coding (comparing each category to a reference)
One of the most popular methods for coding categorical data is a technique known as treatment coding (also known as indicator or simple coding) which transforms categorical data into a number of dichotomies. Table I shows how the variable Location may be coded into a series of dichotomies.
Table I Treatment coding of location
The relationship between Price and Location can be investigated using a regression model that substitutes the dummy codes for the original variable. In general, if we have j categories, j−1 dummy variables are entered into the model. Location is, therefore, represented by two dummy variables, each of which indicates a specific location that is compared to the reference category (the location that is not included as a parameter). Although many software packages dummy code automatically “in the background”, dummy codes can also be entered directly into the data frame (the spreadsheet containing the data). The treatment-coded dummy variables are shown in the data frame in Table II. We can either model “Price” using the variable “Location” (if our software allows this), or model “Price” using both the dummy variables “D.SeaFront” and “D.TownCentre”. The resulting models will be identical (try it and see!).
Table II A data frame showing treatment coding of location
Running the regression model: the default option
A regression model of Price (an OLS model; Price∼Location) computed in R (2011) via the Rcmdr interface (Fox and Weisberg, 2011) provides the following output:Although Location is a single variable, it is represented in the model as two (j−1) dummy variables. The default coding used by R is treatment coding (hence, the letter “T” in the parameter description) with the reference category being the first category alphabetically (the category “Other”). The first Location parameter compares “SeaFront” with “Other” and the second compares “TownCentre” with “Other”. The software has simply recoded “Location” in the background (without saving these codes to the dataset).
In R, it is simple to show the contrasts used in the regression model using the “contrasts()” command. For example, to show which contrasts are being used for the variable “Location” in the data frame “BarPrices”:
which shows that treatment codes are used (indicated by “T.”) and “Other” is the reference category as it is the dummy code that is missing. This default coding does not provide a complete picture of the relationship between Price and Location. One obvious difficulty with the model above is that it does not allow us to directly assess the difference between the SeaFront and TownCentre locations. To do this, we need to change the reference category.
Changing the reference category
The reference category for the variable Location (contained within the BarPrices dataset) can be changed to TownCentre easily in Rcmdr using pull-down menus, or directly in R using the command:and checked using:
Dummy categories are now provided for “SeaFront” and “Other”, making “TownCentre” the reference category. Changing the reference category is usually very simple to do using most software – refer to the relevant manual for instructions. Changing the reference category to TownCentre produces the following model:
Changing the reference category has made a huge difference to the parameters for bars located on the SeaFront. In the first model, it is highly significant and in the second, it is non-significant. Although this is to be expected given the different comparisons being made in the model (a quick look at Figure 1 will confirm that the difference between SeaFront and Other is large compared to the difference between SeaFront and TownCentre), it can be misleading if just one model is shown, particularly to audiences not used to dummy-coded explanatory variables. The first model makes Location look much more significant than the second!
Showing all comparisons
If the explanatory variable is of particular interest, it is useful to construct a table showing all comparisons. (these have been compiled from information gathered using the two models above.) Table III shows the individual comparisons for the model Price∼Location.
Table III A table of comparisons for location
The significance of Location
The models do not provide direct information about the overall significance of the variable “Location” on “Price”. In order to do this, the effect that both parameters have on Price simultaneously needs to be assessed. Although this is simply achieved in most statistical software, it is often missing in research reports and papers. It is not uncommon for readers to have to come to their own conclusions about significance based on the individual estimates of significance given in the reported model, which, as we have seen, can provide very different impressions of significance. The overall significance of Location computed using R, is shown below:
1.2 Sum coding (comparing each category to the average)
It is sometimes appropriate to compare each category with an average value from all categories, rather than a specific reference. This is possible using a different dummy-coding technique, where the codes are assigned according to the scheme laid out in Table IV.
Table IV Sum coding of location
Using these codes, each category is compared to the average of all categories. Similar to the treatment coding method discussed above, only j−1 categories enter into a model. It is (usually) a simple matter to change the coding technique used for a variable. Rcmdr uses pull-down menus to change the contrast coding mehtod (Figure 3), but this can also be achieved directly in R using the command:
and checked using the contrasts() command:which shows that sum coding is used (as indicated by S.) with TownCentre as the reference category.
Running the regression model: the default option
A regression model of Price (an OLS model; Price∼Location) computed in R using sum coding provides the following output:The statistics for the overall model are the same as before (see the F-value). The seafront bars charge significantly more than the average of all bars. To compare “TownCentre” bars to the average of all bars, the reference category can be changed (see the instructions above) and the model rerun:The estimate for Other is the same as before (it is still being compared to the overall average). We can see that the outlying bars charge significantly less than the average. These results are summarised in Table V.
Table V A table of comparisons for location
2 Coding ordered data
Ordered categorical explanatory variables are very common and may be used to indicate information such as educational grade, socio-economic status, attitude, experience, management level, etc. Figure 2 shows an ordered variable (called “Variable”) with five levels and a box plot showing its relationship to a numeric variable (called “Score”). Although this variable can be included as an explanatory in a model using one of the dummy-variable-coding techniques described above (treatment or sum coding), these methods do not take into account the order in the data. A number of alternative coding methods are explored below that take account of order and offer advantages when analysing ordered categorical explanatory variables.
2.1 Helmert coding (comparing each level to the mean of previous levels)
One method of taking account of order in the data is to use Helmert coding, which compares individual levels to the average of previous levels. Table VI shows the coding method used to obtain Helmert contrasts.
Helmert coding is defined in R using the commands:and checked using the command:Using the Helmert contrasts for the OLS regression model “Score∼Variable” gives the following output:Variable1 [H.1] compares level2 with level1. We can see that level2 has a higher score than level1, but not significantly so. Variable2 [H.2] compares level3 with the average of level1 and level2. Variable3 [H.3] compares level4 with the average of the first three levels and Variable4 [H.4] compares level5 with the average of the preceding four levels. These parameters show an increasing trend in the data (all estimates are positive and show that each category is bigger than the average of the preceding categories). Similar to the previous coding schemes, the overall significance of the explanatory variable, cannot be assessed directly from the output – an overall test of all four parameters is needed (an analysis of deviance table could be used, but as there is a single explanatory variable, we will just use the overall F-test, which shows a significance of 7.237×10−5.
Figure 2.The relationship between score and an ordered variable
Table IV Helmert coding of ordered variable
Difference coding (comparing each level to its neighbour)
A useful thing to do with ordered data is to compare each level with its neighbour. This provides information about the trend in the variable and quickly identifies levels that do not “follow the trend”. Difference coding is not one of the techniques that is automatically available in R and Rcmdr, but it can easily be implemented by specifying the contrasts manually. The procedure for achieving this in Rcmdr is shown in Figure 3. (For other software packages, please consult the manual.) The coding used for each category is shown in the “Specify Contrasts” window.
The model of “Score∼Variable”, when using the difference coding technique is shown below:
The Variable.1 parameter compares Level1 with Level2, Variable.2 compares Level2 with Level3, etc. From these parameters it is immediately obvious that level3 and level4 do not follow the same pattern as the others (this is also evident in Figure 2, as Level4 is below level3). On the basis of this evidence, one might want to look more closely at level3 and level4 to see if they might be combined.
Figure 3. Difference coding in Rcmdr
Orthogonal polynomial coding (identifying linear and non-linear trends)
Polynomial coding is one of the most rarely used coding techniques, but it also one of the most informative. The purpose of polynomial coding is to try and identify linear and non-linear trends in the relationship between the ordered explanatory variable and the response. This coding should only be used where the categories can be considered to be “more or less” equally spaced. The polynomial coding scheme for the data is shown in Table VII.
Table VII Polynomial coding of ordered variable
Orthogonal polynomial coding is defined in R using the commands:and checked using the command:The model of “Score∼Variable”, when using the polynomial coding technique is shown below:The first parameter, Variable.L, tests a linear trend, the second parameter (Variable.Q) tests for a quadratic trend (a curve), the third (Variable.C) a cubic trend. Further parameters test for higher order trends. The model shows that the relationship between the Score and the ordered variable is linear, which can be shown in Figure 2.
Polynomial coding is particularly useful for identifying curvilinear relationships, as in the following example where successive increases in level have a decreasing effect:This model shows a curvilinear trend, as parameters Variable.L and Variable.Q are both significant. This is precisely what one would expect from the shape of the relationship shown in Figure 4. It is also evident from the parameter Variable.Q that the quadratic effect decreases as level increases.
Polynomial contrasts are also useful for identifying non-linear trends that are difficult to identify from the regression parameters and fit statistics. For example, the relationship shown in Figure 5 does not show a linear relationship, but a quadratic one might be more useful in describing the relationship:
Figure 4. A curvilinear relationship
Figure 5. A non-linear relationship
Dummy-variable coding is an important part of data manipulation as it enables categorical variables to be included in a wide variety of statistical models (for example, OLS, proportional odds, survival, multinomial and log-linear). It is use increases the utility of regression models and understanding how the coding operates greatly helps with the interpretation of the models. Careful selection of a contrast code and a reference category is crucial to effective data analysis.
Graeme HutchesonManchester University, Manchester, UK
Fox, J. and Weisberg, S. (2011), An R and S-plus Companion to Applied Regression, 2nd ed., Sage, London
Aguinis, H. (2004), Regression Analysis for Categorical Moderators, Guilford, New York, NY
Hardy, M.A. (1993), Regression with Dummy Variables, Sage, London
Hutcheson, G.D. (2011), “Dummy variable coding”, in Moutinho, L. and Hutcheson, G.D. (Eds), The SAGE Dictionary of Quantitative Management Research, Sage, London
Hutcheson, G.D. and Moutinho, L. (2008), Statistical Modelling for Management, Sage, London
R Development Core Team (2011), Core Team R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, available at: www.R-project.org