Search results
1 – 10 of 77Rahila Umer, Teo Susnjak, Anuradha Mathrani and Suriadi Suriadi
The purpose of this paper is to propose a process mining approach to help in making early predictions to improve students’ learning experience in massive open online courses…
Abstract
Purpose
The purpose of this paper is to propose a process mining approach to help in making early predictions to improve students’ learning experience in massive open online courses (MOOCs). It investigates the impact of various machine learning techniques in combination with process mining features to measure effectiveness of these techniques.
Design/methodology/approach
Student’s data (e.g. assessment grades, demographic information) and weekly interaction data based on event logs (e.g. video lecture interaction, solution submission time, time spent weekly) have guided this design. This study evaluates four machine learning classification techniques used in the literature (logistic regression (LR), Naïve Bayes (NB), random forest (RF) and K-nearest neighbor) to monitor weekly progression of students’ performance and to predict their overall performance outcome. Two data sets – one, with traditional features and second, with features obtained from process conformance testing – have been used.
Findings
The results show that techniques used in the study are able to make predictions on the performance of students. Overall accuracy (F1-score, area under curve) of machine learning techniques can be improved by integrating process mining features with standard features. Specifically, the use of LR and NB classifiers outperforms other techniques in a statistical significant way.
Practical implications
Although MOOCs provide a platform for learning in highly scalable and flexible manner, they are prone to early dropout and low completion rate. This study outlines a data-driven approach to improve students’ learning experience and decrease the dropout rate.
Social implications
Early predictions based on individual’s participation can help educators provide support to students who are struggling in the course.
Originality/value
This study outlines the innovative use of process mining techniques in education data mining to help educators gather data-driven insight on student performances in the enrolled courses.
Details
Keywords
Loan default risk or credit risk evaluation is important to financial institutions which provide loans to businesses and individuals. Loans carry the risk of being defaulted. To…
Abstract
Purpose
Loan default risk or credit risk evaluation is important to financial institutions which provide loans to businesses and individuals. Loans carry the risk of being defaulted. To understand the risk levels of credit users (corporations and individuals), credit providers (bankers) normally collect vast amounts of information on borrowers. Statistical predictive analytic techniques can be used to analyse or to determine the risk levels involved in loans. This paper aims to address the question of default prediction of short-term loans for a Tunisian commercial bank.
Design/methodology/approach
The authors have used a database of 924 files of credits granted to industrial Tunisian companies by a commercial bank in the years 2003, 2004, 2005 and 2006. The naive Bayesian classifier algorithm was used, and the results show that the good classification rate is of the order of 63.85 per cent. The default probability is explained by the variables measuring working capital, leverage, solvency, profitability and cash flow indicators.
Findings
The results of the validation test show that the good classification rate is of the order of 58.66 per cent; nevertheless, the error types I and II remain relatively high at 42.42 and 40.47 per cent, respectively. A receiver operating characteristic curve is plotted to evaluate the performance of the model. The result shows that the area under the curve criterion is of the order of 69 per cent.
Originality/value
The paper highlights the fact that the Tunisian central bank obliged all commercial banks to conduct a survey study to collect qualitative data for better credit notation of the borrowers.
Propósito
El riesgo de incumplimiento de préstamos o la evaluación del riesgo de crédito es importante para las instituciones financieras que otorgan préstamos a empresas e individuos. Existe el riesgo de que el pago de préstamos no se cumpla. Para entender los niveles de riesgo de los usuarios de crédito (corporaciones e individuos), los proveedores de crédito (banqueros) normalmente recogen gran cantidad de información sobre los prestatarios. Las técnicas analíticas predictivas estadísticas pueden utilizarse para analizar o determinar los niveles de riesgo involucrados en los préstamos. En este artículo abordamos la cuestión de la predicción por defecto de los préstamos a corto plazo para un banco comercial tunecino.
Diseño/metodología/enfoque
Utilizamos una base de datos de 924 archivos de créditos concedidos a empresas industriales tunecinas por un banco comercial en 2003, 2004, 2005 y 2006. El algoritmo bayesiano de clasificadores se llevó a cabo y los resultados muestran que la tasa de clasificación buena es del orden del 63.85%. La probabilidad de incumplimiento se explica por las variables que miden el capital de trabajo, el apalancamiento, la solvencia, la rentabilidad y los indicadores de flujo de efectivo.
Hallazgos
Los resultados de la prueba de validación muestran que la buena tasa de clasificación es del orden de 58.66% ; sin embargo, los errores tipo I y II permanecen relativamente altos, siendo de 42.42% y 40.47%, respectivamente. Se traza una curva ROC para evaluar el rendimiento del modelo. El resultado muestra que el criterio de área bajo curva (AUC, por sus siglas en inglés) es del orden del 69%.
Originalidad/valor
El documento destaca el hecho de que el Banco Central tunecino obligó a todas las entidades del sector llevar a cabo un estudio de encuesta para recopilar datos cualitativos para un mejor registro de crédito de los prestatarios.
Palabras clave
Curva ROC, Evaluación de riesgos, Riesgo de incumplimiento, Sector bancario, Algoritmo clasificador bayesiano.
Tipo de artículo
Artículo de investigación
Details
Keywords
Spam emails classification using data mining and machine learning approaches has enticed the researchers' attention duo to its obvious positive impact in protecting internet…
Abstract
Spam emails classification using data mining and machine learning approaches has enticed the researchers' attention duo to its obvious positive impact in protecting internet users. Several features can be used for creating data mining and machine learning based spam classification models. Yet, spammers know that the longer they will use the same set of features for tricking email users the more probably the anti-spam parties might develop tools for combating this kind of annoying email messages. Spammers, so, adapt by continuously reforming the group of features utilized for composing spam emails. For that reason, even though traditional classification methods possess sound classification results, they were ineffective for lifelong classification of spam emails duo to the fact that they might be prone to the so-called “Concept Drift”. In the current study, an enhanced model is proposed for ensuring lifelong spam classification model. For the evaluation purposes, the overall performance of the suggested model is contrasted against various other stream mining classification techniques. The results proved the success of the suggested model as a lifelong spam emails classification method.
Details
Keywords
Financial health of a corporation is a great concern for every investor level and decision-makers. For many years, financial solvency prediction is a significant issue throughout…
Abstract
Purpose
Financial health of a corporation is a great concern for every investor level and decision-makers. For many years, financial solvency prediction is a significant issue throughout academia, precisely in finance. This requirement leads this study to check whether machine learning can be implemented in financial solvency prediction.
Design/methodology/approach
This study analyzed 244 Dhaka stock exchange public-listed companies over the 2015–2019 period, and two subsets of data are also developed as training and testing datasets. For machine learning model building, samples are classified as secure, healthy and insolvent by the Altman Z-score. R statistical software is used to make predictive models of five classifiers and all model performances are measured with different performance metrics such as logarithmic loss (logLoss), area under the curve (AUC), precision recall AUC (prAUC), accuracy, kappa, sensitivity and specificity.
Findings
This study found that the artificial neural network classifier has 88% accuracy and sensitivity rate; also, AUC for this model is 96%. However, the ensemble classifier outperforms all other models by considering logLoss and other metrics.
Research limitations/implications
The major result of this study can be implicated to the financial institution for credit scoring, credit rating and loan classification, etc. And other companies can implement machine learning models to their enterprise resource planning software to trace their financial solvency.
Practical implications
Finally, a predictive application is developed through training a model with 1,200 observations and making it available for all rational and novice investors (Abdullah, 2020).
Originality/value
This study found that, with the best of author expertise, the author did not find any studies regarding machine learning research of financial solvency that examines a comparable number of a dataset, with all these models in Bangladesh.
Details
Keywords
Babitha Philip and Hamad AlJassmi
To proactively draw efficient maintenance plans, road agencies should be able to forecast main road distress parameters, such as cracking, rutting, deflection and International…
Abstract
Purpose
To proactively draw efficient maintenance plans, road agencies should be able to forecast main road distress parameters, such as cracking, rutting, deflection and International Roughness Index (IRI). Nonetheless, the behavior of those parameters throughout pavement life cycles is associated with high uncertainty, resulting from various interrelated factors that fluctuate over time. This study aims to propose the use of dynamic Bayesian belief networks for the development of time-series prediction models to probabilistically forecast road distress parameters.
Design/methodology/approach
While Bayesian belief network (BBN) has the merit of capturing uncertainty associated with variables in a domain, dynamic BBNs, in particular, are deemed ideal for forecasting road distress over time due to its Markovian and invariant transition probability properties. Four dynamic BBN models are developed to represent rutting, deflection, cracking and IRI, using pavement data collected from 32 major road sections in the United Arab Emirates between 2013 and 2019. Those models are based on several factors affecting pavement deterioration, which are classified into three categories traffic factors, environmental factors and road-specific factors.
Findings
The four developed performance prediction models achieved an overall precision and reliability rate of over 80%.
Originality/value
The proposed approach provides flexibility to illustrate road conditions under various scenarios, which is beneficial for pavement maintainers in obtaining a realistic representation of expected future road conditions, where maintenance efforts could be prioritized and optimized.
Details
Keywords
Kumash Kapadia, Hussein Abdel-Jaber, Fadi Thabtah and Wael Hadi
Indian Premier League (IPL) is one of the more popular cricket world tournaments, and its financial is increasing each season, its viewership has increased markedly and the…
Abstract
Indian Premier League (IPL) is one of the more popular cricket world tournaments, and its financial is increasing each season, its viewership has increased markedly and the betting market for IPL is growing significantly every year. With cricket being a very dynamic game, bettors and bookies are incentivised to bet on the match results because it is a game that changes ball-by-ball. This paper investigates machine learning technology to deal with the problem of predicting cricket match results based on historical match data of the IPL. Influential features of the dataset have been identified using filter-based methods including Correlation-based Feature Selection, Information Gain (IG), ReliefF and Wrapper. More importantly, machine learning techniques including Naïve Bayes, Random Forest, K-Nearest Neighbour (KNN) and Model Trees (classification via regression) have been adopted to generate predictive models from distinctive feature sets derived by the filter-based methods. Two featured subsets were formulated, one based on home team advantage and other based on Toss decision. Selected machine learning techniques were applied on both feature sets to determine a predictive model. Experimental tests show that tree-based models particularly Random Forest performed better in terms of accuracy, precision and recall metrics when compared to probabilistic and statistical models. However, on the Toss featured subset, none of the considered machine learning algorithms performed well in producing accurate predictive models.
Details
Keywords
Karlo Puh and Marina Bagić Babac
As the tourism industry becomes more vital for the success of many economies around the world, the importance of technology in tourism grows daily. Alongside increasing tourism…
Abstract
Purpose
As the tourism industry becomes more vital for the success of many economies around the world, the importance of technology in tourism grows daily. Alongside increasing tourism importance and popularity, the amount of significant data grows, too. On daily basis, millions of people write their opinions, suggestions and views about accommodation, services, and much more on various websites. Well-processed and filtered data can provide a lot of useful information that can be used for making tourists' experiences much better and help us decide when selecting a hotel or a restaurant. Thus, the purpose of this study is to explore machine and deep learning models for predicting sentiment and rating from tourist reviews.
Design/methodology/approach
This paper used machine learning models such as Naïve Bayes, support vector machines (SVM), convolutional neural network (CNN), long short-term memory (LSTM) and bidirectional long short-term memory (BiLSTM) for extracting sentiment and ratings from tourist reviews. These models were trained to classify reviews into positive, negative, or neutral sentiment, and into one to five grades or stars. Data used for training the models were gathered from TripAdvisor, the world's largest travel platform. The models based on multinomial Naïve Bayes (MNB) and SVM were trained using the term frequency-inverse document frequency (TF-IDF) for word representations while deep learning models were trained using global vectors (GloVe) for word representation. The results from testing these models are presented, compared and discussed.
Findings
The performance of machine and learning models achieved high accuracy in predicting positive, negative, or neutral sentiments and ratings from tourist reviews. The optimal model architecture for both classification tasks was a deep learning model based on BiLSTM. The study’s results confirmed that deep learning models are more efficient and accurate than machine learning algorithms.
Practical implications
The proposed models allow for forecasting the number of tourist arrivals and expenditure, gaining insights into the tourists' profiles, improving overall customer experience, and upgrading marketing strategies. Different service sectors can use the implemented models to get insights into customer satisfaction with the products and services as well as to predict the opinions given a particular context.
Originality/value
This study developed and compared different machine learning models for classifying customer reviews as positive, negative, or neutral, as well as predicting ratings with one to five stars based on a TripAdvisor hotel reviews dataset that contains 20,491 unique hotel reviews.
Details
Keywords
Ruchi Kejriwal, Monika Garg and Gaurav Sarin
Stock market has always been lucrative for various investors. But, because of its speculative nature, it is difficult to predict the price movement. Investors have been using both…
Abstract
Purpose
Stock market has always been lucrative for various investors. But, because of its speculative nature, it is difficult to predict the price movement. Investors have been using both fundamental and technical analysis to predict the prices. Fundamental analysis helps to study structured data of the company. Technical analysis helps to study price trends, and with the increasing and easy availability of unstructured data have made it important to study the market sentiment. Market sentiment has a major impact on the prices in short run. Hence, the purpose is to understand the market sentiment timely and effectively.
Design/methodology/approach
The research includes text mining and then creating various models for classification. The accuracy of these models is checked using confusion matrix.
Findings
Out of the six machine learning techniques used to create the classification model, kernel support vector machine gave the highest accuracy of 68%. This model can be now used to analyse the tweets, news and various other unstructured data to predict the price movement.
Originality/value
This study will help investors classify a news or a tweet into “positive”, “negative” or “neutral” quickly and determine the stock price trends.
Details
Keywords
Khalid Iqbal and Muhammad Shehrayar Khan
In this digital era, email is the most pervasive form of communication between people. Many users become a victim of spam emails and their data have been exposed.
Abstract
Purpose
In this digital era, email is the most pervasive form of communication between people. Many users become a victim of spam emails and their data have been exposed.
Design/methodology/approach
Researchers contribute to solving this problem by a focus on advanced machine learning algorithms and improved models for detecting spam emails but there is still a gap in features. To achieve good results, features also play an important role. To evaluate the performance of applied classifiers, 10-fold cross-validation is used.
Findings
The results approve that the spam emails are correctly classified with the accuracy of 98.00% for the Support Vector Machine and 98.06% for the Artificial Neural Network as compared to other applied machine learning classifiers.
Originality/value
In this paper, Point-Biserial correlation is applied to each feature concerning the class label of the University of California Irvine (UCI) spambase email dataset to select the best features. Extensive experiments are conducted on selected features by training the different classifiers.
Details
Keywords
Manju Priya Arthanarisamy Ramaswamy and Suja Palaniswamy
The aim of this study is to investigate subject independent emotion recognition capabilities of EEG and peripheral physiological signals namely: electroocoulogram (EOG)…
Abstract
Purpose
The aim of this study is to investigate subject independent emotion recognition capabilities of EEG and peripheral physiological signals namely: electroocoulogram (EOG), electromyography (EMG), electrodermal activity (EDA), temperature, plethysmograph and respiration. The experiments are conducted on both modalities independently and in combination. This study arranges the physiological signals in order based on the prediction accuracy obtained on test data using time and frequency domain features.
Design/methodology/approach
DEAP dataset is used in this experiment. Time and frequency domain features of EEG and physiological signals are extracted, followed by correlation-based feature selection. Classifiers namely – Naïve Bayes, logistic regression, linear discriminant analysis, quadratic discriminant analysis, logit boost and stacking are trained on the selected features. Based on the performance of the classifiers on the test set, the best modality for each dimension of emotion is identified.
Findings
The experimental results with EEG as one modality and all physiological signals as another modality indicate that EEG signals are better at arousal prediction compared to physiological signals by 7.18%, while physiological signals are better at valence prediction compared to EEG signals by 3.51%. The valence prediction accuracy of EOG is superior to zygomaticus electromyography (zEMG) and EDA by 1.75% at the cost of higher number of electrodes. This paper concludes that valence can be measured from the eyes (EOG) while arousal can be measured from the changes in blood volume (plethysmograph). The sorted order of physiological signals based on arousal prediction accuracy is plethysmograph, EOG (hEOG + vEOG), vEOG, hEOG, zEMG, tEMG, temperature, EMG (tEMG + zEMG), respiration, EDA, while based on valence prediction accuracy the sorted order is EOG (hEOG + vEOG), EDA, zEMG, hEOG, respiration, tEMG, vEOG, EMG (tEMG + zEMG), temperature and plethysmograph.
Originality/value
Many of the emotion recognition studies in literature are subject dependent and the limited subject independent emotion recognition studies in the literature report an average of leave one subject out (LOSO) validation result as accuracy. The work reported in this paper sets the baseline for subject independent emotion recognition using DEAP dataset by clearly specifying the subjects used in training and test set. In addition, this work specifies the cut-off score used to classify the scale as low or high in arousal and valence dimensions. Generally, statistical features are used for emotion recognition using physiological signals as a modality, whereas in this work, time and frequency domain features of physiological signals and EEG are used. This paper concludes that valence can be identified from EOG while arousal can be predicted from plethysmograph.
Details