Search results
1 – 10 of 88Son Nguyen, Phyllis Schumacher, Alan Olinsky and John Quinn
We study the performances of various predictive models including decision trees, random forests, neural networks, and linear discriminant analysis on an imbalanced data set of…
Abstract
We study the performances of various predictive models including decision trees, random forests, neural networks, and linear discriminant analysis on an imbalanced data set of home loan applications. During the process, we propose our undersampling algorithm to cope with the issues created by the imbalance of the data. Our technique is shown to work competitively against popular resampling techniques such as random oversampling, undersampling, synthetic minority oversampling technique (SMOTE), and random oversampling examples (ROSE). We also investigate the relation between the true positive rate, true negative rate, and the imbalance of the data.
Details
Keywords
Sharifah Heryati Syed Nor, Shafinar Ismail and Bee Wah Yap
Personal bankruptcy is on the rise in Malaysia. The Insolvency Department of Malaysia reported that personal bankruptcy has increased since 2007, and the total accumulated…
Abstract
Purpose
Personal bankruptcy is on the rise in Malaysia. The Insolvency Department of Malaysia reported that personal bankruptcy has increased since 2007, and the total accumulated personal bankruptcy cases stood at 131,282 in 2014. This is indeed an alarming issue because the increasing number of personal bankruptcy cases will have a negative impact on the Malaysian economy, as well as on the society. From the aspect of individual’s personal economy, bankruptcy minimizes their chances of securing a job. Apart from that, their account will be frozen, lost control on their assets and properties and not allowed to start any business nor be a part of any company’s management. Bankrupts also will be denied from any loan application, restricted from travelling overseas and cannot act as a guarantor. This paper aims to investigate this problem by developing the personal bankruptcy prediction model using the decision tree technique.
Design/methodology/approach
In this paper, bankrupt is defined as terminated members who failed to settle their loans. The sample comprised of 24,546 cases with 17 per cent settled cases and 83 per cent terminated cases. The data included a dependent variable, i.e. bankruptcy status (Y = 1(bankrupt), Y = 0 (non-bankrupt)) and 12 predictors. SAS Enterprise Miner 14.1 software was used to develop the decision tree model.
Findings
Upon completion, this study succeeds to come out with the profiles of bankrupts, reliable personal bankruptcy scoring model and significant variables of personal bankruptcy.
Practical implications
This decision tree model is possible for patent and income generation. Financial institutions are able to use this model for potential borrowers to predict their tendency toward personal bankruptcy.
Social implications
Create awareness to society on significant variables of personal bankruptcy so that they can avoid being a bankrupt.
Originality/value
This decision tree model is able to facilitate and assist financial institutions in evaluating and assessing their potential borrower. It helps to identify potential defaulting borrowers. It also can assist financial institutions in implementing the right strategies to avoid defaulting borrowers.
Details
Keywords
Zhongyi Hu, Raymond Chiong, Ilung Pranata, Yukun Bao and Yuqing Lin
Malicious web domain identification is of significant importance to the security protection of internet users. With online credibility and performance data, the purpose of this…
Abstract
Purpose
Malicious web domain identification is of significant importance to the security protection of internet users. With online credibility and performance data, the purpose of this paper to investigate the use of machine learning techniques for malicious web domain identification by considering the class imbalance issue (i.e. there are more benign web domains than malicious ones).
Design/methodology/approach
The authors propose an integrated resampling approach to handle class imbalance by combining the synthetic minority oversampling technique (SMOTE) and particle swarm optimisation (PSO), a population-based meta-heuristic algorithm. The authors use the SMOTE for oversampling and PSO for undersampling.
Findings
By applying eight well-known machine learning classifiers, the proposed integrated resampling approach is comprehensively examined using several imbalanced web domain data sets with different imbalance ratios. Compared to five other well-known resampling approaches, experimental results confirm that the proposed approach is highly effective.
Practical implications
This study not only inspires the practical use of online credibility and performance data for identifying malicious web domains but also provides an effective resampling approach for handling the class imbalance issue in the area of malicious web domain identification.
Originality/value
Online credibility and performance data are applied to build malicious web domain identification models using machine learning techniques. An integrated resampling approach is proposed to address the class imbalance issue. The performance of the proposed approach is confirmed based on real-world data sets with different imbalance ratios.
Details
Keywords
Alicia T. Lamere, Son Nguyen, Gao Niu, Alan Olinsky and John Quinn
Predicting a patient's length of stay (LOS) in a hospital setting has been widely researched. Accurately predicting an individual's LOS can have a significant impact on a…
Abstract
Predicting a patient's length of stay (LOS) in a hospital setting has been widely researched. Accurately predicting an individual's LOS can have a significant impact on a healthcare provider's ability to care for individuals by allowing them to properly prepare and manage resources. A hospital's productivity requires a delicate balance of maintaining enough staffing and resources without being overly equipped or wasteful. This has become even more important in light of the current COVID-19 pandemic, during which emergency departments around the globe have been inundated with patients and are struggling to manage their resources.
In this study, the authors focus on the prediction of LOS at the time of admission in emergency departments at Rhode Island hospitals through discharge data obtained from the Rhode Island Department of Health over the time period of 2012 and 2013. This work also explores the distribution of discharge dispositions in an effort to better characterize the resources patients require upon leaving the emergency department.
Details
Keywords
Lu Wang, Jiahao Zheng, Jianrong Yao and Yuangao Chen
With the rapid growth of the domestic lending industry, assessing whether the borrower of each loan is at risk of default is a pressing issue for financial institutions. Although…
Abstract
Purpose
With the rapid growth of the domestic lending industry, assessing whether the borrower of each loan is at risk of default is a pressing issue for financial institutions. Although there are some models that can handle such problems well, there are still some shortcomings in some aspects. The purpose of this paper is to improve the accuracy of credit assessment models.
Design/methodology/approach
In this paper, three different stages are used to improve the classification performance of LSTM, so that financial institutions can more accurately identify borrowers at risk of default. The first approach is to use the K-Means-SMOTE algorithm to eliminate the imbalance within the class. In the second step, ResNet is used for feature extraction, and then two-layer LSTM is used for learning to strengthen the ability of neural networks to mine and utilize deep information. Finally, the model performance is improved by using the IDWPSO algorithm for optimization when debugging the neural network.
Findings
On two unbalanced datasets (category ratios of 700:1 and 3:1 respectively), the multi-stage improved model was compared with ten other models using accuracy, precision, specificity, recall, G-measure, F-measure and the nonparametric Wilcoxon test. It was demonstrated that the multi-stage improved model showed a more significant advantage in evaluating the imbalanced credit dataset.
Originality/value
In this paper, the parameters of the ResNet-LSTM hybrid neural network, which can fully mine and utilize the deep information, are tuned by an innovative intelligent optimization algorithm to strengthen the classification performance of the model.
Details
Keywords
Murtaza Nasir, Carole South-Winter, Srini Ragothaman and Ali Dag
The purpose of this paper is to formulate a framework to construct a patient-specific risk score and therefore to classify these patients into various risk groups that can be used…
Abstract
Purpose
The purpose of this paper is to formulate a framework to construct a patient-specific risk score and therefore to classify these patients into various risk groups that can be used as a decision support mechanism by the medical decision makers to augment their decision-making process, allowing them to optimally use the limited resources available.
Design/methodology/approach
A conventional statistical model (logistic regression) and two machine learning-based (i.e. artificial neural networks (ANNs) and support vector machines) data mining models were employed by also using five-fold cross-validation in the classification phase. In order to overcome the data imbalance problem, random undersampling technique was utilized. After constructing the patient-specific risk score, k-means clustering algorithm was employed to group these patients into risk groups.
Findings
Results showed that the ANN model achieved the best results with an area under the curve score of 0.867, while the sensitivity and specificity were 0.715 and 0.892, respectively. Also, the construction of patient-specific risk scores offer useful insights to the medical experts, by helping them find a trade-off between risks, costs and resources.
Originality/value
The study contributes to the existing body of knowledge by constructing a framework that can be utilized to determine the risk level of the targeted patient, by employing data mining-based predictive approach.
Details
Keywords
Yaotan Xie and Fei Xiang
This study aimed to adapt existing text-mining techniques and propose a novel topic recognition approach for textual patient reviews.
Abstract
Purpose
This study aimed to adapt existing text-mining techniques and propose a novel topic recognition approach for textual patient reviews.
Design/methodology/approach
The authors first transformed multilabel samples for adapting model training forms. Then, an improved method was proposed based on dynamic mixed sampling and transfer learning to improve the learning problem caused by imbalanced samples. Specifically, the training of our model was based on the framework of a convolutional neural network and self-trained Word2Vector on large-scale corpora.
Findings
Compared with the SVM and other CNN-based models, the CNN+ DMS + TL model proposed in this study has made significant improvement in F1 score.
Originality/value
The improved methods based on dynamic mixed sampling and transfer learning can adequately manage the learning problem caused by the skewed distribution of samples and achieve the effective and automatic topic recognition of textual patient reviews.
Peer review
The peer-review history for this article is available at: https://publons.com/publon/10.1108/OIR-01-2021-0059.
Details
Keywords
Bachriah Fatwa Dhini, Abba Suganda Girsang, Unggul Utan Sufandi and Heny Kurniawati
The authors constructed an automatic essay scoring (AES) model in a discussion forum where the result was compared with scores given by human evaluators. This research proposes…
Abstract
Purpose
The authors constructed an automatic essay scoring (AES) model in a discussion forum where the result was compared with scores given by human evaluators. This research proposes essay scoring, which is conducted through two parameters, semantic and keyword similarities, using a SentenceTransformers pre-trained model that can construct the highest vector embedding. Combining these models is used to optimize the model with increasing accuracy.
Design/methodology/approach
The development of the model in the study is divided into seven stages: (1) data collection, (2) pre-processing data, (3) selected pre-trained SentenceTransformers model, (4) semantic similarity (sentence pair), (5) keyword similarity, (6) calculate final score and (7) evaluating model.
Findings
The multilingual paraphrase-multilingual-MiniLM-L12-v2 and distilbert-base-multilingual-cased-v1 models got the highest scores from comparisons of 11 pre-trained multilingual models of SentenceTransformers with Indonesian data (Dhini and Girsang, 2023). Both multilingual models were adopted in this study. A combination of two parameters is obtained by comparing the response of the keyword extraction responses with the rubric keywords. Based on the experimental results, proposing a combination can increase the evaluation results by 0.2.
Originality/value
This study uses discussion forum data from the general biology course in online learning at the open university for the 2020.2 and 2021.2 semesters. Forum discussion ratings are still manual. In this survey, the authors created a model that automatically calculates the value of discussion forums, which are essays based on the lecturer's answers moreover rubrics.
Details