Search results

1 – 10 of over 3000
Article
Publication date: 10 March 2023

Jingyi Li and Shiwei Chao

Binary classification on imbalanced data is a challenge; due to the imbalance of the classes, the minority class is easily masked by the majority class. However, most existing…

Abstract

Purpose

Binary classification on imbalanced data is a challenge; due to the imbalance of the classes, the minority class is easily masked by the majority class. However, most existing classifiers are better at identifying the majority class, thereby ignoring the minority class, which leads to classifier degradation. To address this, this paper proposes a twin-support vector machines for binary classification on imbalanced data.

Design/methodology/approach

In the proposed method, the authors construct two support vector machines to focus on majority classes and minority classes, respectively. In order to promote the learning ability of the two support vector machines, a new kernel is derived for them.

Findings

(1) A novel twin-support vector machine is proposed for binary classification on imbalanced data, and new kernels are derived. (2) For imbalanced data, the complexity of data distribution has negative effects on classification results; however, advanced classification results can be gained and desired boundaries are learned by using optimizing kernels. (3) Classifiers based on twin architectures have more advantages than those based on single architecture for binary classification on imbalanced data.

Originality/value

For imbalanced data, the complexity of data distribution has negative effects on classification results; however, advanced classification results can be gained and desired boundaries are learned through using optimizing kernels.

Details

Data Technologies and Applications, vol. 57 no. 3
Type: Research Article
ISSN: 2514-9288

Keywords

Book part
Publication date: 6 September 2019

Son Nguyen, Gao Niu, John Quinn, Alan Olinsky, Jonathan Ormsbee, Richard M. Smith and James Bishop

In recent years, the problem of classification with imbalanced data has been growing in popularity in the data-mining and machine-learning communities due to the emergence of an…

Abstract

In recent years, the problem of classification with imbalanced data has been growing in popularity in the data-mining and machine-learning communities due to the emergence of an abundance of imbalanced data in many fields. In this chapter, we compare the performance of six classification methods on an imbalanced dataset under the influence of four resampling techniques. These classification methods are the random forest, the support vector machine, logistic regression, k-nearest neighbor (KNN), the decision tree, and AdaBoost. Our study has shown that all of the classification methods have difficulty when working with the imbalanced data, with the KNN performing the worst, detecting only 27.4% of the minority class. However, with the help of resampling techniques, all of the classification methods experience improvement on overall performances. In particular, the Random Forest, in combination with the random over-sampling technique, performs the best, achieving 82.8% balanced accuracy (the average of the true-positive rate and true-negative rate).

We then propose a new procedure to resample the data. Our method is based on the idea of eliminating “easy” majority observations before under-sampling them. It has further improved the balanced accuracy of the Random Forest to 83.7%, making it the best approach for the imbalanced data.

Details

Advances in Business and Management Forecasting
Type: Book
ISBN: 978-1-78754-290-7

Keywords

Article
Publication date: 19 August 2022

Anjali More and Dipti Rana

Referred data set produces reliable information about the network flows and common attacks meeting with real-world criteria. Accordingly, this study aims to focus on the use of…

Abstract

Purpose

Referred data set produces reliable information about the network flows and common attacks meeting with real-world criteria. Accordingly, this study aims to focus on the use of imbalanced intrusion detection benchmark knowledge discovery in database (KDD) data set. KDD data set is most preferably used by many researchers for experimentation and analysis. The proposed algorithm improvised random forest classification with error tuning factors (IRFCETF) deals with experimentation on KDD data set and evaluates the performance of a complete set of network traffic features through IRFCETF.

Design/methodology/approach

In the current era of applications, the attention of researchers is immersed by a diverse number of existing time applications that deals with imbalanced data classification (ImDC). Real-time application areas, artificial intelligence (AI), Industrial Internet of Things (IIoT), etc. are dealing ImDC undergo with diverted classification performance due to skewed data distribution (SkDD). There are numerous application areas that deal with SkDD. Many of the data applications in AI and IIoT face the diverted data classification rate in SkDD. In recent advancements, there is an exponential expansion in the volume of computer network data and related application developments. Intrusion detection is one of the demanding applications of ImDC. The proposed study focusses on imbalanced intrusion benchmark data set, KDD data set and other benchmark data set with the proposed IRFCETF approach. IRFCETF justifies the enriched classification performance on imbalanced data set over the existing approach. The purpose of this work is to review imbalanced data applications in numerous application areas including AI and IIoT and tuning the performance with respect to principal component analysis. This study also focusses on the out-of-bag error performance-tuning factor.

Findings

Experimental results on KDD data set shows that proposed algorithm gives enriched performance. For referred intrusion detection data set, IRFCETF classification accuracy is 99.57% and error rate is 0.43%.

Research limitations/implications

This research work extended for further improvements in classification techniques with multiple correspondence analysis (MCA); hierarchical MCA can be focussed with the use of classification models for wide range of skewed data sets.

Practical implications

The metrics enhancement is measurable and helpful in dealing with intrusion detection systems–related imbalanced applications in current application domains such as security, AI and IIoT digitization. Analytical results show improvised metrics of the proposed approach than other traditional machine learning algorithms. Thus, error-tuning parameter creates a measurable impact on classification accuracy is justified with the proposed IRFCETF.

Social implications

Proposed algorithm is useful in numerous IIoT applications such as health care, machinery automation etc.

Originality/value

This research work addressed classification metric enhancement approach IRFCETF. The proposed method yields a test set categorization for each case with error reduction mechanism.

Details

International Journal of Pervasive Computing and Communications, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 1742-7371

Keywords

Article
Publication date: 22 October 2018

Sihem Khemakhem, Fatma Ben Said and Younes Boujelbene

Credit scoring datasets are generally unbalanced. The number of repaid loans is higher than that of defaulted ones. Therefore, the classification of these data is biased toward…

1078

Abstract

Purpose

Credit scoring datasets are generally unbalanced. The number of repaid loans is higher than that of defaulted ones. Therefore, the classification of these data is biased toward the majority class, which practically means that it tends to attribute a mistaken “good borrower” status even to “very risky borrowers”. In addition to the use of statistics and machine learning classifiers, this paper aims to explore the relevance and performance of sampling models combined with statistical prediction and artificial intelligence techniques to predict and quantify the default probability based on real-world credit data.

Design/methodology/approach

A real database from a Tunisian commercial bank was used and unbalanced data issues were addressed by the random over-sampling (ROS) and synthetic minority over-sampling technique (SMOTE). Performance was evaluated in terms of the confusion matrix and the receiver operating characteristic curve.

Findings

The results indicated that the combination of intelligent and statistical techniques and re-sampling approaches are promising for the default rate management and provide accurate credit risk estimates.

Originality/value

This paper empirically investigates the effectiveness of ROS and SMOTE in combination with logistic regression, artificial neural networks and support vector machines. The authors address the role of sampling strategies in the Tunisian credit market and its impact on credit risk. These sampling strategies may help financial institutions to reduce the erroneous classification costs in comparison with the unbalanced original data and may serve as a means for improving the bank’s performance and competitiveness.

Details

Journal of Modelling in Management, vol. 13 no. 4
Type: Research Article
ISSN: 1746-5664

Keywords

Article
Publication date: 15 March 2021

Putta Hemalatha and Geetha Mary Amalanathan

Adequate resources for learning and training the data are an important constraint to develop an efficient classifier with outstanding performance. The data usually follows a…

Abstract

Purpose

Adequate resources for learning and training the data are an important constraint to develop an efficient classifier with outstanding performance. The data usually follows a biased distribution of classes that reflects an unequal distribution of classes within a dataset. This issue is known as the imbalance problem, which is one of the most common issues occurring in real-time applications. Learning of imbalanced datasets is a ubiquitous challenge in the field of data mining. Imbalanced data degrades the performance of the classifier by producing inaccurate results.

Design/methodology/approach

In the proposed work, a novel fuzzy-based Gaussian synthetic minority oversampling (FG-SMOTE) algorithm is proposed to process the imbalanced data. The mechanism of the Gaussian SMOTE technique is based on finding the nearest neighbour concept to balance the ratio between minority and majority class datasets. The ratio of the datasets belonging to the minority and majority class is balanced using a fuzzy-based Levenshtein distance measure technique.

Findings

The performance and the accuracy of the proposed algorithm is evaluated using the deep belief networks classifier and the results showed the efficiency of the fuzzy-based Gaussian SMOTE technique achieved an AUC: 93.7%. F1 Score Prediction: 94.2%, Geometric Mean Score: 93.6% predicted from confusion matrix.

Research limitations/implications

The proposed research still retains some of the challenges that need to be focused such as application FG-SMOTE to multiclass imbalanced dataset and to evaluate dataset imbalance problem in a distributed environment.

Originality/value

The proposed algorithm fundamentally solves the data imbalance issues and challenges involved in handling the imbalanced data. FG-SMOTE has aided in balancing minority and majority class datasets.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 14 no. 2
Type: Research Article
ISSN: 1756-378X

Keywords

Book part
Publication date: 1 September 2021

Son Nguyen, Phyllis Schumacher, Alan Olinsky and John Quinn

We study the performances of various predictive models including decision trees, random forests, neural networks, and linear discriminant analysis on an imbalanced data set of…

Abstract

We study the performances of various predictive models including decision trees, random forests, neural networks, and linear discriminant analysis on an imbalanced data set of home loan applications. During the process, we propose our undersampling algorithm to cope with the issues created by the imbalance of the data. Our technique is shown to work competitively against popular resampling techniques such as random oversampling, undersampling, synthetic minority oversampling technique (SMOTE), and random oversampling examples (ROSE). We also investigate the relation between the true positive rate, true negative rate, and the imbalance of the data.

Article
Publication date: 29 September 2020

Hari Hara Krishna Kumar Viswanathan, Punniyamoorthy Murugesan, Sundar Rengasamy and Lavanya Vilvanathan

The purpose of this study is to compare the classification learning ability of our algorithm based on boosted support vector machine (B-SVM), against other classification

Abstract

Purpose

The purpose of this study is to compare the classification learning ability of our algorithm based on boosted support vector machine (B-SVM), against other classification techniques in predicting the credit ratings of banks. The key feature of this study is the usage of an imbalanced dataset (in the response variable/rating) with a smaller number of observations (number of banks).

Design/methodology/approach

In general, datasets in banking sector are small and imbalanced too. In this study, 23 Scheduled Commercial Banks (SCBs) have been chosen (in India), and their corresponding corporate ratings have been collated from the Indian subsidiary of reputed global rating agency. The top management of the rating agency provided 12 input (quantitative) variables that are considered essential for rating a bank within India. In order to overcome the challenge of dataset being imbalanced and having small number of observations, this study uses an algorithm, namely “Modified Boosted Support Vector Machines” (MBSVMs) proposed by Punniyamoorthy Murugesan and Sundar Rengasamy. This study also compares the classification ability of the aforementioned algorithm against other classification techniques such as multi-class SVM, back propagation neural networks, multi-class linear discriminant analysis (LDA) and k-nearest neighbors (k-NN) classification, on the basis of geometric mean (GM).

Findings

The performances of each algorithm have been compared based on one metric—the geometric mean, also known as GMean (GM). This metric typically indicates the class-wise sensitivity by using the values of products. The findings of the study prove that the proposed MBSVM technique outperforms the other techniques.

Research limitations/implications

This study provides an algorithm to predict ratings of banks where the dataset is small and imbalanced. One of the limitations of this research study is that subjective factors have not been included in our model; the sole focus is on the results generated by the models (driven by quantitative parameters). In future, studies may be conducted which may include subjective parameters (proxied by relevant and quantifiable variables).

Practical implications

Various stakeholders such as investors, regulators and central banks can predict the credit ratings of banks by themselves, by inputting appropriate data to the model.

Originality/value

In the process of rating banks, the usage of an imbalanced dataset can lessen the performance of the soft-computing techniques. In order to overcome this, the authors have come up with a novel classification approach based on “MBSVMs”, which can be used as a yardstick for such imbalanced datasets. For this purpose, through primary research, 12 features have been identified that are considered essential by the credit rating agencies.

Details

Benchmarking: An International Journal, vol. 28 no. 1
Type: Research Article
ISSN: 1463-5771

Keywords

Open Access
Article
Publication date: 10 June 2024

Lua Thi Trinh

The purpose of this paper is to compare nine different models to evaluate consumer credit risk, which are the following: Logistic Regression (LR), Naive Bayes (NB), Linear…

Abstract

Purpose

The purpose of this paper is to compare nine different models to evaluate consumer credit risk, which are the following: Logistic Regression (LR), Naive Bayes (NB), Linear Discriminant Analysis (LDA), k-Nearest Neighbor (k-NN), Support Vector Machine (SVM), Classification and Regression Tree (CART), Artificial Neural Network (ANN), Random Forest (RF) and Gradient Boosting Decision Tree (GBDT) in Peer-to-Peer (P2P) Lending.

Design/methodology/approach

The author uses data from P2P Lending Club (LC) to assess the efficiency of a variety of classification models across different economic scenarios and to compare the ranking results of credit risk models in P2P lending through three families of evaluation metrics.

Findings

The results from this research indicate that the risk classification models in the 2013–2019 economic period show greater measurement efficiency than for the difficult 2007–2012 period. Besides, the results of ranking models for predicting default risk show that GBDT is the best model for most of the metrics or metric families included in the study. The findings of this study also support the results of Tsai et al. (2014) and Teplý and Polena (2019) that LR, ANN and LDA models classify loan applications quite stably and accurately, while CART, k-NN and NB show the worst performance when predicting borrower default risk on P2P loan data.

Originality/value

The main contributions of the research to the empirical literature review include: comparing nine prediction models of consumer loan application risk through statistical and machine learning algorithms evaluated by the performance measures according to three separate families of metrics (threshold, ranking and probabilistic metrics) that are consistent with the existing data characteristics of the LC lending platform through two periods of reviewing the current economic situation and platform development.

Details

Journal of Economics, Finance and Administrative Science, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2077-1886

Keywords

Article
Publication date: 20 August 2018

Sihem Khemakhem and Younes Boujelbene

Data mining for predicting credit risk is a beneficial tool for financial institutions to evaluate the financial health of companies. However, the ubiquity of selecting parameters…

2384

Abstract

Purpose

Data mining for predicting credit risk is a beneficial tool for financial institutions to evaluate the financial health of companies. However, the ubiquity of selecting parameters and the presence of unbalanced data sets is a very typical problem of this technique. This study aims to provide a new method for evaluating credit risk, taking into account not only financial and non-financial variables, but also the class imbalance.

Design/methodology/approach

The most significant financial and non-financial variables were determined to build a credit scoring model and identify the creditworthiness of companies. Moreover, the Synthetic Minority Oversampling Technique was used to solve the problem of class imbalance and improve the performance of the classifier. The artificial neural networks and decision trees were designed to predict default risk.

Findings

Results showed that profitability ratios, repayment capacity, solvency, duration of a credit report, guarantees, size of the company, loan number, ownership structure and the corporate banking relationship duration turned out to be the key factors in predicting default. Also, both algorithms were found to be highly sensitive to class imbalance. However, with balanced data, the decision trees displayed higher predictive accuracy for the assessment of credit risk than artificial neural networks.

Originality/value

Classification results depend on the appropriateness of data characteristics and the appropriate analysis algorithm for data sets. The selection of financial and non-financial variables, as well as the resolution of class imbalance allows companies to assess their credit risk successfully.

Details

Review of Accounting and Finance, vol. 17 no. 3
Type: Research Article
ISSN: 1475-7702

Keywords

Article
Publication date: 17 September 2021

Liang He, Haiyan Xu and Ginger Y. Ke

Despite better accessibility and flexibility, peer-to-peer (P2P) lending has suffered from excessive credit risks, which may cause significant losses to the lenders and even lead…

Abstract

Purpose

Despite better accessibility and flexibility, peer-to-peer (P2P) lending has suffered from excessive credit risks, which may cause significant losses to the lenders and even lead to the collapse of P2P platforms. The purpose of this research is to construct a hybrid predictive framework that integrates classification, feature selection, and data balance algorithms to cope with the high-dimensional and imbalanced nature of P2P credit data.

Design/methodology/approach

An improved synthetic minority over-sampling technique (IMSMOTE) is developed to incorporate the randomness and probability into the traditional synthetic minority over-sampling technique (SMOTE) to enhance the quality of synthetic samples and the controllability of synthetic processes. IMSMOTE is then implemented along with the grey relational clustering (GRC) and the support vector machine (SVM) to facilitate a comprehensive assessment of the P2P credit risks. To enhance the associativity and functionality of the algorithm, a dynamic selection approach is integrated with GRC and then fed in the SVM's process of parameter adaptive adjustment to select the optimal critical value. A quantitative model is constructed to recognize key criteria via multidimensional representativeness.

Findings

A series of experiments based on real-world P2P data from Prosper Funding LLC demonstrates that our proposed model outperforms other existing approaches. It is also confirmed that the grey-based GRC approach with dynamic selection succeeds in reducing data dimensions, selecting a critical value, identifying key criteria, and IMSMOTE can efficiently handle the imbalanced data.

Originality/value

The grey-based machine-learning framework proposed in this work can be practically implemented by P2P platforms in predicting the borrowers' credit risks. The dynamic selection approach makes the first attempt in the literature to select a critical value and indicate key criteria in a dynamic, visual and quantitative manner.

Details

Grey Systems: Theory and Application, vol. 12 no. 3
Type: Research Article
ISSN: 2043-9377

Keywords

1 – 10 of over 3000