Abstract
Purpose
The purpose of this paper is to create an automatic interpretation of the results of the method of multiple correspondence analysis (MCA) for categorical variables, so that the nonexpert user can immediately and safely interpret the results, which concern, as the authors know, the categories of variables that strongly interact and determine the trends of the subject under investigation.
Design/methodology/approach
This study is a novel theoretical approach to interpreting the results of the MCA method. The classical interpretation of MCA results is based on three indicators: the projection (F) of the category points of the variables in factorial axes, the point contribution to axis creation (CTR) and the correlation (COR) of a point with an axis. The synthetic use of the aforementioned indicators is arduous, particularly for nonexpert users, and frequently results in misinterpretations. The current study has achieved a synthesis of the aforementioned indicators, so that the interpretation of the results is based on a new indicator, as correspondingly on an index, the well-known method principal component analysis (PCA) for continuous variables is based.
Findings
Two (2) concepts were proposed in the new theoretical approach. The interpretative axis corresponding to the classical factorial axis and the interpretative plane corresponding to the factorial plane that as it will be seen offer clear and safe interpretative results in MCA.
Research limitations/implications
It is obvious that in the development of the proposed automatic interpretation of the MCA results, the authors do not have in the interpretative axes the actual projections of the points as is the case in the original factorial axes, but this is not of interest to the simple user who is only interested in being able to distinguish the categories of variables that determine the interpretation of the most pronounced trends of the phenomenon being examined.
Practical implications
The results of this research can have positive implications for the dissemination of MCA as a method and its use as an integrated exploratory data analysis approach.
Originality/value
Interpreting the MCA results presents difficulties for the nonexpert user and sometimes lead to misinterpretations. The interpretative difficulty persists in the MCA's other interpretative proposals. The proposed method of interpreting the MCA results clearly and accurately allows for the interpretation of its results and thus contributes to the dissemination of the MCA as an integrated method of categorical data analysis and exploration.
Keywords
Citation
Moschidis, S., Markos, A. and Thanopoulos, A.C. (2022), "“Automatic” interpretation of multiple correspondence analysis (MCA) results for nonexpert users, using R programming", Applied Computing and Informatics, Vol. ahead-of-print No. ahead-of-print. https://doi.org/10.1108/ACI-07-2022-0191
Publisher
:Emerald Publishing Limited
Copyright © 2022, Stratos Moschidis, Angelos Markos and Athanasios C. Thanopoulos
License
Published in Applied Computing and Informatics. Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode
1. Introduction
Dimension reduction methods seek to reduce the number of dimensions [1, 2] in the variable space whilst also preserving the most important structure or relationships between the variables, i.e. without significant loss of information (capturing the essential information) [3]. They have also the advantage of handling and visualizing the results of complex and massive amounts of data [4–7]. Principal Component Aanalysis (PCA) is a popular method for performing dimension reduction [8] of a set of continuous variables, an effective approach to capture characteristics [9], with the aim of identifying those variables that contribute most to the creation of new, composite variables unlike in feature selection [10], known as principal components or dominant factorial axes. This is achieved via the diagonalization of a symmetric correlation or covariance matrix [11]. To identify which of the original variables contribute most to the creation of each principal component, the coordinates of the projection [1] of the variable points on each factorial axis can only be used, that express the correlation coefficients of each variable with factorial axis These coordinates express the coefficients of correlation of each variable with the axis.
In this paper we focus on Multiple Correspondence Analysis (MCA), a generalization of PCA for categorical data and a generalization of simple correspondence analysis [12], that is widely used in various scientific fields such as marketing, psychology, health, economics, management and others [13]. The goal of MCA is to describe the associations between the categories of two or more nominal variables in a low-dimensional space containing these categories. Whilst interpreting the results of MCA, users have to identify which column categories have a major contribution to the definition of the factorial axes. MCA has been used as a first step for reducing predictors in classification problems [2], or for receiving new coordinates for performing Hierarchical Clustering on Principal Components [14] or even as a method for meta analyzing literature findings in marketing [15, 16]. Proper interpretation of the MCA results by nonexpert users can often become a difficult task, and consequently it might lead to misinterpretations [17]. The purpose of this paper is to provide nonexpert users with an “automatic” and clear interpretation of the most important points of MCA’s results, via an alternative visualization scheme, based on the construction of the so-called interpretive axes and the corresponding interpretive factorial planes. These proposals come to shrink the existing research gap providing originality to the study. This eliminates the requirement for the user to examine and evaluate the tabular MCA output, as well as looking for numbers and statistics for which additional calculations are frequently required. The proposed scheme is similar to the one used in the context of PCA and users familiar with PCA can easily comprehend this one as well. The novelty of this study is the discovery of a geometrical locus of points on the so-called interpretive plane that improves current alternative approaches on interpreting the most important points of a factorial plane. The paper is organized as follows. Section 2 presents the basic concepts for MCA. Section 3 reviews corresponding literature, discuss about research gap and alternatives addressing the problem. The interpretive axis and the interpretive plane for visualizing MCA results are introduced in Section 4. Section 5 demonstrates the proposed approach on a real data set and compares results. Section 6 includes conclusions, discussion, limitations and implications regarding this research.
2. Basic MCA concepts
Let X, a
The COR index expresses the amount of inertia of point j explained by axis a, that is:
3. Literature review – alternative approaches of visualizing important points of the MCA maps
The problem of interpreting the results of MCA with visualizations has been a subject of interest and this can be highlighted from a number of published studies [23–26]. Special interest presents the effort of interpreting the most important points on a factorial plane [2, 24–26]. Approaches addressing the issue of correct interpretation (importance [24] or proximity of points [2]) of factorial maps vary from the usage of symbolic means and points' colorization to mathematical transformations. Even though MCA has been implemented in different software and programming languages (e.g SPSS, SPAD, Python etc.) in this study we focus on R. In R there are packages that are developed to perform the MCA and visualize its results and others that produce visualizations aiming to help users interpreting the results. FactoMiner [14] is a package that can compute MCA's results and offers the ability of producing the classical factorial maps. Users can also colorize points according to their contributions that shift the problem of distinguishing the most important points to another variable information. CainterprTools [23] is an R package that uses additional optical means (dotplots, scatter plots) in order to help users. However, the user has to resort in additional graphical and numerical information that is not a good solution for nonexpert users. CA package [27] includes research of [24, 25] regarding the asymmetrical biplots [26] and through its functions user can pass parameters that transform the points by multiplying points’ standard coordinates with their corresponding masses. Consequently, points on visualizations receive information of masses which in case of the factorial axis is adequate for immediate extraction of the most important points. While in the factorial plane the points retain and generalize their informative character, here surface one issue. From the moment that the printed arrows (the ones which the ca package is using and connect a point with the beginning of the axes) are not placed on the same axis their lengths must be measured in order to be safe in interpreting the most important points of the plane. This is a drawback, and it is being resolved through our proposed interpretive plane. We also like to underline the findings of [2] where scholars have presented a solution to the problem of interpreting proximity of points on the factorial maps through the defining of “tolerance distance”.
4. Methodology – proposal of interpretive axes and interpretive planes
In this section we introduce the notions of interpretive axis and interpretive plane. We consider that each factorial axis corresponds to an interpretive axis that incorporates or combines all the interpretive information of the coordinates, F, the COR index and the CTR index. This interpretive axis is defined as follows: The coordinate,
1st condition:
Proof: consider the point j in position A which has coordinates
5. Applications
In this section we illustrate the visualizations presented in previous section and compare them with existing alternatives using the “wg93” dataset that can be found on the ca package [27]. More information on dataset can be found on chapter 2 of [12]. MCA was performed with FactoMineR package [14] while libraries “tidyverse” [28], “ggplot” [29–31], “ggrepel” [32], “plotly” [33], “caintertools” [23], “factoextra” [34], “shiny” [35], “DT” [36], “this.path” [37], “soc.ca” [38],“egg” [39] were used as well to produce content that can be found either on the manuscript or on the supplementary material. At the end of the manuscript there is a link that connects with the supplementary material, which can reproduce all visualizations being discussed at this paper and also data tables with numerical evidence for verification. At this section we compare the proposed visualizations with the classical factorial maps while also we compare proposed first interpretive plane with what we consider as the best alternative among approaches that have been discussed on literature section. We encourage readers to download and explore the supplementary material because it’s important for the completeness of the paper.
In Figure 3 we observe that on first factorial axis, point A_5 is the most distant point from the right side of the axis, with maximum F value but without the maximum CTR value among points of the right side (A_5, F: 1.64, CTR: 7.32, e: 2.11) while point B_5 that has the third biggest F value from the right side is the one with the maximum CTR value (B_5,F: 1.10,CTR: 9.71, e: 2.80). The same occurs between points B_1 and C_1 on the left side of the axis. This can mislead nonexpert user resulting to an erroneous interpretation of the most contributing points of this axis and an erroneous overall characterization of the factorial axis. Our proposed first interpretive axis, on the contrary, guarantees a successful interpretation of the most important points of the first factorial axis.
In Figure 4 we observe the comparison between the classic first factorial plane and the proposed first interpretive plane that implements the discovery of the geometrical locus of points (squares). On the first factorial plane we observe that some of the most distant points are the C_5 and B_1. Here, a user in order to make a successful interpretation must manipulate MCA’s output and perform additional calculations in order to extract inertias of each point for each of the two axes for the comparison of different points. Therefore, a nonexpert user could easily be led to erroneous interpretations considering C_5 and B_1 as the most important points of the factorial plane. On the other hand our proposed visualization enables a fast and accurate evaluation of the most important points of the factorial plane. As can be seen on first interpretive plane (Figure 4), points C_5(total inertia on first factorial plane: 3.39) and B_1(total inertia on first factorial plane: 4.75) are less important for example from points C_1(total inertia on first factorial plane: 5.71) and B_5(total inertia on first factorial plane: 5.00) that are in fact the two most important points of the first plane. In our visualization point C_1 is located on the most distant square while also point B_5 is located on the second most distant square and this observation “automatically” gives correct interpretation. Similar comparisons can be observed by reader in other points as well.
In Figure 5 we observe the first factorial planes and the one on the right incorporates the asymmetrical biplot theory with the Greenacre’s transformation (contribution biplots). In this transformation points in standard coordinates are multiplied by the square root of the corresponding masses. However, this is a different transformation to ours but we acknowledge that provides improvement over the basic factorial plane's visualization and the others that are discussed in literature section. This visualization comparing to ours, lacks of the extremely important observation of the geometrical locus of points which eliminates the need for any further calculation to reach to decision about the importance of a point. In Figure 5 (right plot) now notice that points with similar arrows’ lengths need to be measured (with some calculation) and then compared. For example points B_5 and B_1 has visually similar arrows’ lengths, therefore is hard to tell which point is more contributing than the other without the evidence from a numerical calculation about their lengths; on the contrary our proposed interpretive plane that incorporates the squares as a geometrical locus of points is a superior and improving approach than the one that is depicted in Figure 5. On our proposed visualization (Figure 4 middle plot) the points B_5 and B_1 are easily compared to each other since the B_5 stands in a more distant square than the B_1 so it’s more important than B_1 in the first factorial plane. For a correct interpretation, our proposed visualization requires no additional calculations but only observation.
6. Conclusions
This paper proposes a new visualization scheme through the introduction of the interpretive coordinate, the interpretive axis and the interpretive plane which address the problem of finding and interpreting the most important points on the MCA’s factorial maps by nonexpert users. Several scholars through their work have indicated this research gap and provided corresponding solutions. We presented and compared them with our proposal, and we concluded that our interpretive plane with the squares is a quicker and overall better way to find and interpret the most important points of a factorial plane. The originality/novelty of our work is the discovery of the geometrical locus of points which provide immediate optical identification of most important points. In short, the further a point is from the beginning of the interpretive axis, the more important it becomes for the factorial axis, and also when a point on the interpretive plane is at the perimeter of a more distant square, the more important that point is to the factorial plane. This work can have practical implications through disseminating the use of MCA in a wider audience while it opens a new window for theoretical research on the geometrical relations of the points in factorial maps. Interpretive coordinates as a transformation outcome cannot be used for other analysis methods (e.g.hierarchical clustering) so users must use original MCA coordinates and that can be considered as a limitation. Future research involves more theoretical investigation on geometrical relationships of the points on the factorial maps and providing Information Technology (IT) tools which will help to educate users about this new visualization scheme.
Figures
Supplementary material for this article can be found online at: https://drive.google.com/drive/folders/1Xxz4RxSaltYcwJLi2jiCbR4a-WjNklr4?usp=sharing
References
1Nguyen S, Golas E, Zywiak W, Kennedy K. Dimension reduction in bankruptcy prediction: a case study of North American companies. Adv Bus Manag Forecast. 2019; 13: 83-92. Emerald Publishing. doi: 10.1108/S1477-407020190000013010.
2Almeida R, Infantosi A, Suassuna J, Costa J. Multiple correspondence analysis in predictive logistic modelling: application to a living-donor kidney transplantation data. Comput Methods Programs Biomed. 2009; 95(2): 116-28. doi: 10.1016/j.cmpb.2009.02.003.
3Ghosh D. Sufficient dimension reduction: an information-theoretic viewpoint. Entropy. 2022; 24(2), Art. no. 2. doi: 10.3390/e24020167.
4Diday E. Principal component analysis for bar charts and metabins tables. Stat Anal Data Mining: ASA Data Sci J. 2013; 6(5): 403-30. doi: 10.1002/sam.11188.
5Fernstad SJ, Shaw J, Johansson J. Quality-based guidance for exploratory dimensionality reduction. Inf Visualization. 2013; 12(1): 44-64. doi: 10.1177/1473871612460526.
6Gardner-Lubbe S, Le Roux NJ, Maunders H, Shah V, Patwardhan S. Biplot methodology in exploratory analysis of microarray data. Stat Anal Data Mining: ASA Data Sci J. 2009; 2(2): 135-45. doi: 10.1002/sam.10038.
7Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Phil Trans R Soc A: Math Phys Eng Sci. 2016; 374(2065): 20150202. doi: 10.1098/rsta.2015.0202.
8Kurita T. Principal component analysis (PCA). In: Computer vision: a reference guide. Cham: Springer International Publishing; 2019. p. 1-4. doi: 10.1007/978-3-030-03243-2_649-1.
9Alsaqre F, Almathkour O. Moving objects classification via category-wise two-dimensional principal component analysis. Appl Comput Inform. 2020; 18(1/2): 136-50. doi: 10.1016/j.aci.2019.02.001.
10AlNuaimi N, Masud MM, Serhani MA, Zaki N. Streaming feature selection algorithms for big data: a survey. Appl Comput Inform. 2020; 18(1/2): 113-35. doi: 10.1016/j.aci.2019.01.001.
11Salem N, Hussein S. Data dimensional reduction and principal components analysis. Proced Comput Sci. 2019; 163: 292-9. doi: 10.1016/j.procs.2019.12.111.
12Greenacre M, Blasius J, editors. Multiple correspondence analysis and related methods. New York: Chapman and Hall/CRC; 2006. doi: 10.1201/9781420011319.
13Blasius J, Greenacre M. Visualization and verbalization of data. New York: Chapman and Hall/CRC; 2014.
14Lê S, Josse J, Husson F. FactoMineR: an R package for multivariate analysis. J Stat Softw. 2008; 25: 1-18. doi: 10.18637/jss.v025.i01.
15Saura J, Ribeiro-Soriano D, Palacios-Marqués D. Setting B2B digital marketing in artificial intelligence-based CRMs: a review and directions for future research. Ind Marketing Manag. 2021; 98: 161-78. doi: 10.1016/j.indmarman.2021.08.006.
16Saura J, Palacios-Marqués D, Ribeiro-Soriano D. Digital marketing in SMEs via data-driven strategies: reviewing the current state of research. J Small Business Manag. 2021: 1-36. doi: 10.1080/00472778.2021.1955127.
17Greenacre MJ. Interpreting multiple correspondence analysis. Appl Stochastic Models Data Anal. 1991; 7(2): 195-210. doi: 10.1002/asm.3150070208.
18le Roux B, Rouanet H. Multiple correspondence analysis. SAGE Publications; 2010. 163.
19Moschidis OE. A different approach to multiple correspondence analysis (MCA) than that of specific MCA. Mathématiques Sciences Humaines Mathematics Soc Sci. 2009; 186. Art. no. 186. doi: 10.4000/msh.11091.
20Kaciak E, Louviere J. Multiple correspondence analysis of multiple choice experiment data. J Marketing Res. 1990; 27(4): 455-65. doi: 10.1177/002224379002700407.
21Lebart L. Techniques de la description statistique: Méthodes et logiciels pour l’analyse des grands tableaux. Paris: Dunod; 1977.
22Abdi H, Valentin D. Multiple correspondence analysis. Encyclopedia Meas Stat. 2007; 2(4): 651-7.
23Alberti G. CAinterprTools: an R package to help interpreting Correspondence Analysis' results. SoftwareX. 2015; 1(2): 26-31. doi: 10.1016/j.softx.2015.07.001.
24Greenacre M. Contribution biplots. J Comput Graphical Stat. 2013; 22(1): 107-22. doi: 10.1080/10618600.2012.702494.
25Gabriel KR, Odoroff CL. Biplots in biomedical research, Statistics in medicine. 1990; 9(5), 469-85. doi: 10.1002/sim.4780090502.
26Greenacre MJ. Biplots in correspondence analysis. J Appl Stat. 1993; 20(2): 251-69.
27Nenadic O, Greenacre M. Correspondence analysis in R, with two- and three-dimensional graphics: the ca package. J Stat Softw. 2007; 20: 1-13. doi: 10.18637/jss.v020.i03.
28Wickham H, et al. Welcome to the tidyverse. J open source Softw. 2019; 4(43): 1686.
29Wickham H. Data analysis. In: ggplot2. Springer; 2016: 189-201.
30Wickham H. ggplot2: elegant graphics for data analysis. In: 2009. Corr. 3rd printing 2010 edition. 1st ed. New York: Springer; 2010.
31Wickham H, Grolemund G. R for data science: import, tidy, transform, visualize, and model data. Sebastopol, CA: O’Reilly Media; 2016.
32Slowikowski K. Ggrepel: automatically position non-overlapping text labels with ‘ggplot2’. 2021. R Package Version 0.9, CRAN repository. 2019; 1.
33Sievert C. Interactive web-based data visualization with R, plotly, and shiny. Plotly-r.com; 2022. Available from: https://plotly-r.com/
34Kassambara A. Practical guide to principal component methods in R: PCA, M (CA). FAMD, MFA, HCPC, Factoextra. 2017; 2. Sthda.
35Chang W, Cheng J, Allaire J, Xie Y, McPherson J. Package ‘shiny,’; 2015. Available from: http://citeseerx.ist.psu.edu/viewdoc/download
36Xie Y, Cheng J, Tan X. DT: a wrapper of the JavaScript library ‘DataTables,’. R Package Version 0.4, CRAN repository. 2018.
37Simmons A. This.path: get executing script's path, from ‘RStudio’, ‘Rgui’, ‘rscript’ (shells including windows command-line//unix terminal), and ‘source.’. 2022; 11. Available from: https://CRAN.R-project.org/package=this.path
38Larsen AG, Andrade S. Package ‘soc. ca,’; 2016. Available from: https://cran.r-project.org/web/packages/soc.ca/soc.ca.pdf
39Auguie B. Extensions for 'ggplot2': custom geom, custom themes, plot alignment, labelled panels, symmetric scales, and fixed panel size [R package egg version 0.4.5]. Cran.r-project.org. 2022. Available from: https://cran.r-project.org/web/packages/egg/index.html