Predicting student performance in a blended learning environment using learning management system interaction data Predictingstudent performance

Purpose – Student attritions in tertiary educational institutes may play a significant role to achieve core values leading towards strategic mission and financial well-being. Analysis of data generated from student interactionwith learningmanagement systems(LMSs) in blendedlearning(BL)environments mayassist with the identification of students at risk of failing, but to what extent this may be possible is unknown. However, existing studies are limited to address the issues at a significant scale. Design/methodology/approach – Thisstudydevelopsanewapproachharnessingapplicationsofmachine learning (ML) models on a dataset, that is publicly available, relevant to student attrition to identify potential studentsatrisk.ThedatasetconsistsofthedatageneratedbytheinteractionofstudentswithLMSfortheirBLenvironment. Findings – Identifying students at risk through an innovative approach will promote timely intervention in the learning process, such as for improving student academic progress. To evaluate the performance of the proposed approach, the accuracy is compared with other representational ML methods. Originality/value – The best ML algorithm random forest with 85% is selected to support educators in implementing various pedagogical practices to improve students ’ learning.


Introduction
Student attrition [1] is of a great concern to, and an extraordinarily challenging issue to address for higher education (HE) providers. Various factors contribute to student attrition [1,2], such as withdrawal from courses because of academic failure, peer pressure, financial issues, inter-institutional transfer, employment-related factors or myriad personal reasons.
Academic progress is frequently cited as a key factor associated with student attrition [3], with HE providers offering various interventions to improve it. The objective of one form of intervention, the student support program, is to extend additional, tailored academic support, typically involving academic or language services, to students experiencing academic problems. By doing so the student attrition rate is reduced, and the reputation and financial viability of the HE institution are maintained or even improved.
While support programs may address some issues with attrition rates [4], the first step in this process-the identification of at-risk students-is a manual and time-consuming exercise that can be biased by personnel involvement. Moreover, the delay between identification of an at-risk student, the onset of intervention and any assessment of the effect of this intervention, can be lengthy. Early, if not real-time identification of struggling students is preferable because it would enable educators to provide timely and appropriate support to students when it is most needed and effective [5] and almost real-time assessment of the effects of any intervention.
Various techniques have been used to predict student academic progress [6] through the application of different machine learning (ML) algorithms on student demographic, socioeconomic, pre-enrollment, enrollment, academic and learning management system (LMS) data [6,7], the latter automatically generated through student interaction. Approaches to identify these at-risk students based on data related to socioeconomic and cultural factors [4,8] lack precision. We apply ML algorithms to identify struggling students accurately and rapidly from a dataset collected from student LMS interaction. We investigate how the application of existing ML techniques can more accurately and rapidly identify at-risk students.
After a brief review of related work, we define our research question, explain the dataset that we use and any necessary pre-processing of it, the ML algorithms used for data analysis and the various classification techniques that we employ. We conclude this contribution with an evaluation and brief discussion of results from the tree-based classification algorithms, implications of this research and future research directions.

Related works
The application of ML techniques to predict and improve student performance, recommend learning resources and identify students at-risk has increased in recent years. Two main factors affect the identification of students at risk using ML: the dataset and delivery mode and the type of ML algorithm used. We took a stock of recent literature to analyze a wide variety of dataset features, delivery modes and ML techniques for predicting student performance, and the same is presented as supplementary material available at: https:// github.com/KFVU/ML-C/blob/b32b95655f13ce4f623e703df6b07b850688d8eb/1.pdf.

Important attributes for predicting student academic performance
Aspects of a student's demographic and socio-economic background (e.g. place of birth, disability, parent academic and job background, residing region, gender, socioeconomic index, health insurance, frequency of going out with friends (weekday and weekend) and financial status) [4,[8][9][10][11][12][13], pre-enrollment (e.g. high school or level 12 performance and grades, entrance qualification, SAT scores, English and math grades, awards and the school they attended) [4, 8-10, 14, 15], enrollment (e.g. enrollment date, enrollment test marks, the number of courses students previously enrolled in, type of study program and study mode) [16,17], tertiary academic (e.g. attendance, number of assessment submissions, student engagement ratio, major, time left to complete the degree, course credits, semester work marks, placements and count and date of attempted exams) [4,[14][15][16][17][18] and LMS-based data have all been studied in previous analyses regarding the prediction of student academic performance.
Student record such as grade point average (GPA) has been frequently used as a categorical variable, as have a semester or final results of a student [4,[8][9][10][11][12][14][15][16][17][18][19][20] and the graduate or drop out the status of a student [16]. These are considered to be significant indicators of academic potential. Therefore, we consider the final result of a student to be the nominal variable on which basis we assess a student's study performance.
Few studies have used LMS-generated data to predict student achievement. Attributes include the frequency of interaction of a student with each module on LMS [21], LMS log data, counts of hits, forum post details, counts of assessments viewed and submitted on LMS [11], start and end dates and assessment submission dates [20]. LMS data are automatically generated and stored by the LMS, which is cost-effective, and the data are accessible and relatively easy to analyze. LMS data provide complete information about a student's engagement in online learning sessions and workshops. Few studies have researched correlations between LMS attributes, selection of relevant attributes and tuning of classifier algorithm parameters for accurate prediction of student progress. To our knowledge, the use of student learning behavior and LMS participation in blended learning (BL) has not been previously investigated. Additionally, the focus of most studies was not on the early detection of at-risk students for the purposes of taking timely action to implement remedial measures to improve their progress.
Most research datasets have been acquired from traditional face-to-face or online classroom settings [19], although several studies have used datasets obtained from BL [11,21]. The BL approach combines traditional classroom environments with online learning, starting from a 10%-25% digital to 90%-75% classroom ratio, to the reverse situation, a 75%-90% digital component. BL represents a transition from synchronous to asynchronous learning and potentially enriches and extends the opportunities for students to learn in ways that were previously unachievable.
We apply decision tree based classification methods because of their simplicity, appropriateness and ease of interpretation. We used the frequently used tree-based algorithms on an LMS-based dataset to identify students at-risk, to address our research question -"what is the most feasible ML approach to apply to an LMS-interaction dataset to accurately identify at-risk students?"

Method
We source freely available data from the UCI (University of California, Irvine) ML repository [26] which comprises 230,318 data instances built from the recordings of about 112 students' activities and interactions while learning with LMS in six laboratory sessions conducted in a simulated e-learning environment. Data were collected from LMS logs, transformed and cleaned into a format appropriate for public dissemination.
We build a dimensional vector using student LMS interaction data, which is then transformed to include response features. This transformed dataset is then used to build an algorithm to identify students at risk. We use five tree-based classifiers (random forest, J48, NBTree, OneR, decision stump) which use a series of if-then decisions to generate highly accurate, easily interpretable predictions, to predict at-risk students. We then compare the performance of these different classification methods using various metrics (e.g. accuracy, precision, recall and F-measure). The dataset is fine-tuned by using a Booster ensemble method on each classification method. Finally, based on accuracy, we identify which classifier is most appropriate for building our algorithm to identify at-risk students.
The dataset consists of multiple comma-separated value (csv) files. One set of csv files contains information regarding sessions and students. Each folder represents a session, and each csv file contains data for a specific student identified by their student Id. Each csv file contains data about all exercises performed by each student in a specific session. Each record comprises information regarding dimensions, the activities a student attempted during a specific session for a specific exercise, the start and end times of the activity and other related features. Two additional files contain the final exam grade of a student and attendance records for each student for each session. A summary of dataset dimensions may be perused as supplementary material at: https://github.com/KFVU/ML-C/blob/ b32b95655f13ce4f623e703df6b07b850688d8eb/2.pdf.
Our simulation research methods involve dataset cleaning and pre-processing, application of different classification algorithms to the dataset and selection of the most accurate model to predict student performance. This step-by-step process is depicted in Figure 1. This framework explains the series of rigorous and iterative phases required to develop an innovative educational artifact (predictive model) for predicting student progress based on ML techniques [27].
Step 1: Data collection and feature exploration UCI data comprising 230,318 data instances based on activities and interactions of 112 students with an e-learning system in six sessions were sourced. Each csv file in this dataset consists of 13 attributes of text data alongside numerical attributes (i.e. Exercise, SessionID, Activity, StudentID, Start-time, Idle-time, End-time, Mouse-wheel, Mouse-click-left, Mousewheel-click, Mouse-click-right, Keystroke and Mouse-movement). Additional files contain intermediate and final student marks. All csv files for sessions were combined into a single csv file to transform the mixed attribute dataset into a numerical feature dataset with nine attributes. The transformed dataset was obtained by aggregating the attributes for each student for all sessions using algorithm to create and validate the dimensional vector V and in-detail algorithm is given as supplementary material at: https://github.com/KFVU/ML-C/ blob/b32b95655f13ce4f623e703df6b07b850688d8eb/3.pdf.
The algorithm employed in the procedure aimed at building and verifying the correctness of aggregated data in the dimensional vector V. Null, empty, or negative values are removed from the dataset. V is first built using aggregated values of each feature for each student, and the total final marks for students are merged with V using StudentID. This extracts records of students who attended all sessions and the final exam. In theory, V should contain data about the students who attended all sessions that can be verified by attendance data in logs.txt. A Boolean attribute DV is created for this rule and a StudentID attribute is created to store the StudentID attribute of each row of V. A variable (totalAttendance) is computed for all rows of V, the value of which (n) is equal to the total sessions a student attended, which in this study is 6 (i.e. n 5 6). If this Boolean expression is satisfied, the DV value becomes true; if not it becomes false. This verifies that for each selected instance of the student, the sum of attendance should be equal to the value 6.
Step 2: Dataset pre-processing Dataset pre-processing involves the cleaning of variables and data instances and converting the dataset into a csv file as an outcome of algorithm 1. After aggregating data instances, the numeric values of final result marks are classified into the categorical variables "Pass" or "Fail," 62% and 38% of the dataset, respectively.
Step 3: Feature selection Suitable features are selected in exploratory data analysis, which affects the prediction result. A correlation heatmap is produced using the open-source software Python Pandas is a data analysis and manipulation library; the value sign indicates a þve or Àve correlation with the final score. For example, if "keystroke" is high or "idle_time" is low then there is a higher probability that the final score is higher. Attribute correlations are depicted using a heatmap ( Figure 2). This exploratory analysis supports the selection of features for building classifiers. Because ensemble techniques effectively improve the performance of early prediction models, we use five classifiers to first train our model. An adaptive boosting (AdaBoost) technique is used in the subsequent iteration to improve classification accuracy; this ensemble boosting technique learns from the previous misclassification of data points by increasing their weights and boosting decision trees.
Step 4: Machine learning models-classification algorithm To undertake the classification of ML techniques we used Waikato Environment, distributed under the GNU General Public License. This workbench offers a wide collection of classification ML algorithms and visualization features. We loaded cleaned and aggregated data into WEKA to apply the classification ML algorithm. We used supervised learning methods to train the model, where the model learns from labeled classes (e.g. Pass, Fail). Random forest, J48, OneR, NBTree and decision stump were used to classify at-risk students. A 10-fold cross-validation split the student dataset into 10 groups of approximately equal size, wherein the first group was treated as a validation group and the classifier was trained on the nine remaining groups (repeated 10 times). Results for each group are summarized using evaluation scores. Classifier accuracy is presented in Table 1. Classifiers were then tuned with AdaBoost, which sequentially trained several models and combined multiple weak models into a single strong classifier. The tuned classifier was applied to the updated dataset obtained from the former step. The 10-fold cross-validation method is again used to  (Table 1). Dataset pre-processing, feature selection by exploratory data analysis, ML classifier training and analysis and comparison of performance metrics are iterative steps required to filter attributes and tune the model.
The process of predicting student progress (steps 1 to 4) using the most accurate classification method is then automated by one more algorithm which develops and verifies the dimensional vector V (step I) and an overview of this algorithm may be perused as supplementary material at: https://github.com/KFVU/ML-C/blob/ b32b95655f13ce4f623e703df6b07b850688d8eb/4.pdf. The same five ML classifiers (random forest, J48, NBTree, OneR, decision stump) are again applied to V using a kfold cross validation. The boosting ensemble technique is then applied on V using the five classification algorithms with k-fold cross validation. The accuracy of different ML methods is saved without applying the ensemble technique in vector PM and with it in vector PME. The performance metrics of the five classification models are then compared and the most accurate method is selected. This algorithm fully automates the process of creating the dimensional vector, selecting the best classifier and identifying students with learning difficulties. The process of remedial activities to improve student learning can then commence.

Evaluation and discussion
Different performance metrics (classification accuracy, precision, recall, F-measure, root mean square error and incorrectly identified instances) are used to evaluate the five algorithms as presented in Figure 3. Classification Accuracy is used to select the bestperforming classifier, which is calculated by using confusion metrics. All five classifiers perform well with high accuracy, demonstrating the feasibility and effectiveness of dataset pre-processing and feature selection. The objective is to maximize TN and minimize FP. Confusion metrics for the five classifiers used to evaluate performance are presented in Figure 4.
To provide timely and appropriate support it is important that our model is accurate. Performance metrics for selected tree-based ML classification algorithms with and without Booster ensemble tuning are presented in Table 1. Classification accuracy (%) represents the ratio of correct classifier prediction over the total number of observations (3100). Random forest with and without ensemble tuning outperforms other classifiers in classification accuracy, precision, recall, F-measure, root mean square error and incorrectly identified A comparison of the performance of the five ML models with and without booster ensemble tuning is presented in Figure 5. ML classification models are more accurate with booster ensemble tuning, and the random forest method again outperforms other classifiers in both cases.
Of the five classifiers, the precision (the ratio of TP to the sum of all positive instances identified by the classifier) and F-measure are highest for random forest. Higher precision is preferable because it means fewer instances of FP. The F-measure indicates that this classifier has low FP and FN. Ensemble tuning also reduces the prediction error of FP in the "Fail" class using random forest.
Random forest may perform better than other classifiers for a number of reasons. It may improve accuracy because the boosting ensemble method can vote high-ranking instances. It also does not prune trees like other tree-based algorithms. At each tree node, splitting is considered for a random subset of features, resulting in features being split into more and smaller random subsets, increasing the diversity among the forest of trees, leading to its outperformance compared to other decision tree based algorithms. Random forest also uses bagging and generates a forest based on the subset of the model features. The combination of bagging and boosting may reduce overfitting and bias issues, thereby reducing prediction variance.
We introduce an algorithmic method to construct and evaluate ML models to develop an educational decision support system (EDSS) that accurately identifies students at risk of

Accuracy comparison of five ML techniques
Increase in accuracy with ensemble method based tuning Accuracy without ensemble method based tuning  Predicting student performance increasing retention rate). We build the model to analyze the learning behavior of a student which automatically, accurately classifies those students "at risk" of failure.
Our objective is to support educator efforts to improve teaching and learning through the use of an EDSS artifact, instead of using traditional, time-consuming methods that involve a heavy administrative workload. This artifact enables near real-time identification of struggling students', and the timely implementation of appropriate interventions to enhance their progress. We consider that this timely detection and measurement of at-risk students will contribute to improved progress of some struggling students, increase retention and decrease attrition rates as a consequence and have positive and cascading impacts on the student and institutional reputation and institution financials. Cascading effects could include those on a nation's economy because it is anticipated that qualified students would be better positioned to repay study debts.

Conclusion
We outline a new framework based on ML for improving the academic performance of a student, with appropriate intervention. The proposed ML-based EDSS framework offers better options in terms of the accuracy of classification models. We recognize random forest to be the best of five ML classification algorithms that we appraise at classifying students at risk based on their interaction with LMS. Our automation of this process enables almost realtime identification of at-risk students, which is beneficial from both academic and administrative perspectives. This framework could be set to alert educators to prospective problematic students, triggering the need for support or remedial assistance to facilitate passing.
Future research might be considered using enhanced datasets that incorporate behavioral attributes like interaction with other students, teamwork participation and other student academic attributes to further enhance the model application. Additionally, the dataset could be enhanced by adding new grade levels (other than the binary modes of Pass and Fail), treating it as a multi-class classification problem. Deep learning techniques or classification techniques other than tree-based classifiers and parameter tuning with Weka class-balancer could also be applied, thereby increasing model accuracy.
Note 1. In this study, we relate student attrition to their readiness in learning and capability development in terms of the talented effort that may contribute to make them succeed in their higher education.