Search results

1 – 10 of over 2000
Article
Publication date: 10 March 2023

Jingyi Li and Shiwei Chao

Binary classification on imbalanced data is a challenge; due to the imbalance of the classes, the minority class is easily masked by the majority class. However, most existing…

Abstract

Purpose

Binary classification on imbalanced data is a challenge; due to the imbalance of the classes, the minority class is easily masked by the majority class. However, most existing classifiers are better at identifying the majority class, thereby ignoring the minority class, which leads to classifier degradation. To address this, this paper proposes a twin-support vector machines for binary classification on imbalanced data.

Design/methodology/approach

In the proposed method, the authors construct two support vector machines to focus on majority classes and minority classes, respectively. In order to promote the learning ability of the two support vector machines, a new kernel is derived for them.

Findings

(1) A novel twin-support vector machine is proposed for binary classification on imbalanced data, and new kernels are derived. (2) For imbalanced data, the complexity of data distribution has negative effects on classification results; however, advanced classification results can be gained and desired boundaries are learned by using optimizing kernels. (3) Classifiers based on twin architectures have more advantages than those based on single architecture for binary classification on imbalanced data.

Originality/value

For imbalanced data, the complexity of data distribution has negative effects on classification results; however, advanced classification results can be gained and desired boundaries are learned through using optimizing kernels.

Details

Data Technologies and Applications, vol. 57 no. 3
Type: Research Article
ISSN: 2514-9288

Keywords

Book part
Publication date: 6 September 2019

Son Nguyen, Gao Niu, John Quinn, Alan Olinsky, Jonathan Ormsbee, Richard M. Smith and James Bishop

In recent years, the problem of classification with imbalanced data has been growing in popularity in the data-mining and machine-learning communities due to the emergence of an…

Abstract

In recent years, the problem of classification with imbalanced data has been growing in popularity in the data-mining and machine-learning communities due to the emergence of an abundance of imbalanced data in many fields. In this chapter, we compare the performance of six classification methods on an imbalanced dataset under the influence of four resampling techniques. These classification methods are the random forest, the support vector machine, logistic regression, k-nearest neighbor (KNN), the decision tree, and AdaBoost. Our study has shown that all of the classification methods have difficulty when working with the imbalanced data, with the KNN performing the worst, detecting only 27.4% of the minority class. However, with the help of resampling techniques, all of the classification methods experience improvement on overall performances. In particular, the Random Forest, in combination with the random over-sampling technique, performs the best, achieving 82.8% balanced accuracy (the average of the true-positive rate and true-negative rate).

We then propose a new procedure to resample the data. Our method is based on the idea of eliminating “easy” majority observations before under-sampling them. It has further improved the balanced accuracy of the Random Forest to 83.7%, making it the best approach for the imbalanced data.

Details

Advances in Business and Management Forecasting
Type: Book
ISBN: 978-1-78754-290-7

Keywords

Article
Publication date: 19 August 2022

Anjali More and Dipti Rana

Referred data set produces reliable information about the network flows and common attacks meeting with real-world criteria. Accordingly, this study aims to focus on the use of…

Abstract

Purpose

Referred data set produces reliable information about the network flows and common attacks meeting with real-world criteria. Accordingly, this study aims to focus on the use of imbalanced intrusion detection benchmark knowledge discovery in database (KDD) data set. KDD data set is most preferably used by many researchers for experimentation and analysis. The proposed algorithm improvised random forest classification with error tuning factors (IRFCETF) deals with experimentation on KDD data set and evaluates the performance of a complete set of network traffic features through IRFCETF.

Design/methodology/approach

In the current era of applications, the attention of researchers is immersed by a diverse number of existing time applications that deals with imbalanced data classification (ImDC). Real-time application areas, artificial intelligence (AI), Industrial Internet of Things (IIoT), etc. are dealing ImDC undergo with diverted classification performance due to skewed data distribution (SkDD). There are numerous application areas that deal with SkDD. Many of the data applications in AI and IIoT face the diverted data classification rate in SkDD. In recent advancements, there is an exponential expansion in the volume of computer network data and related application developments. Intrusion detection is one of the demanding applications of ImDC. The proposed study focusses on imbalanced intrusion benchmark data set, KDD data set and other benchmark data set with the proposed IRFCETF approach. IRFCETF justifies the enriched classification performance on imbalanced data set over the existing approach. The purpose of this work is to review imbalanced data applications in numerous application areas including AI and IIoT and tuning the performance with respect to principal component analysis. This study also focusses on the out-of-bag error performance-tuning factor.

Findings

Experimental results on KDD data set shows that proposed algorithm gives enriched performance. For referred intrusion detection data set, IRFCETF classification accuracy is 99.57% and error rate is 0.43%.

Research limitations/implications

This research work extended for further improvements in classification techniques with multiple correspondence analysis (MCA); hierarchical MCA can be focussed with the use of classification models for wide range of skewed data sets.

Practical implications

The metrics enhancement is measurable and helpful in dealing with intrusion detection systems–related imbalanced applications in current application domains such as security, AI and IIoT digitization. Analytical results show improvised metrics of the proposed approach than other traditional machine learning algorithms. Thus, error-tuning parameter creates a measurable impact on classification accuracy is justified with the proposed IRFCETF.

Social implications

Proposed algorithm is useful in numerous IIoT applications such as health care, machinery automation etc.

Originality/value

This research work addressed classification metric enhancement approach IRFCETF. The proposed method yields a test set categorization for each case with error reduction mechanism.

Details

International Journal of Pervasive Computing and Communications, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 1742-7371

Keywords

Article
Publication date: 22 October 2018

Sihem Khemakhem, Fatma Ben Said and Younes Boujelbene

Credit scoring datasets are generally unbalanced. The number of repaid loans is higher than that of defaulted ones. Therefore, the classification of these data is biased toward…

1045

Abstract

Purpose

Credit scoring datasets are generally unbalanced. The number of repaid loans is higher than that of defaulted ones. Therefore, the classification of these data is biased toward the majority class, which practically means that it tends to attribute a mistaken “good borrower” status even to “very risky borrowers”. In addition to the use of statistics and machine learning classifiers, this paper aims to explore the relevance and performance of sampling models combined with statistical prediction and artificial intelligence techniques to predict and quantify the default probability based on real-world credit data.

Design/methodology/approach

A real database from a Tunisian commercial bank was used and unbalanced data issues were addressed by the random over-sampling (ROS) and synthetic minority over-sampling technique (SMOTE). Performance was evaluated in terms of the confusion matrix and the receiver operating characteristic curve.

Findings

The results indicated that the combination of intelligent and statistical techniques and re-sampling approaches are promising for the default rate management and provide accurate credit risk estimates.

Originality/value

This paper empirically investigates the effectiveness of ROS and SMOTE in combination with logistic regression, artificial neural networks and support vector machines. The authors address the role of sampling strategies in the Tunisian credit market and its impact on credit risk. These sampling strategies may help financial institutions to reduce the erroneous classification costs in comparison with the unbalanced original data and may serve as a means for improving the bank’s performance and competitiveness.

Details

Journal of Modelling in Management, vol. 13 no. 4
Type: Research Article
ISSN: 1746-5664

Keywords

Open Access
Article
Publication date: 30 July 2020

Alaa Tharwat

Classification techniques have been applied to many applications in various fields of sciences. There are several ways of evaluating classification algorithms. The analysis of…

33411

Abstract

Classification techniques have been applied to many applications in various fields of sciences. There are several ways of evaluating classification algorithms. The analysis of such metrics and its significance must be interpreted correctly for evaluating different learning algorithms. Most of these measures are scalar metrics and some of them are graphical methods. This paper introduces a detailed overview of the classification assessment measures with the aim of providing the basics of these measures and to show how it works to serve as a comprehensive source for researchers who are interested in this field. This overview starts by highlighting the definition of the confusion matrix in binary and multi-class classification problems. Many classification measures are also explained in details, and the influence of balanced and imbalanced data on each metric is presented. An illustrative example is introduced to show (1) how to calculate these measures in binary and multi-class classification problems, and (2) the robustness of some measures against balanced and imbalanced data. Moreover, some graphical measures such as Receiver operating characteristics (ROC), Precision-Recall, and Detection error trade-off (DET) curves are presented with details. Additionally, in a step-by-step approach, different numerical examples are demonstrated to explain the preprocessing steps of plotting ROC, PR, and DET curves.

Details

Applied Computing and Informatics, vol. 17 no. 1
Type: Research Article
ISSN: 2634-1964

Keywords

Book part
Publication date: 1 September 2021

Son Nguyen, Phyllis Schumacher, Alan Olinsky and John Quinn

We study the performances of various predictive models including decision trees, random forests, neural networks, and linear discriminant analysis on an imbalanced data set of…

Abstract

We study the performances of various predictive models including decision trees, random forests, neural networks, and linear discriminant analysis on an imbalanced data set of home loan applications. During the process, we propose our undersampling algorithm to cope with the issues created by the imbalance of the data. Our technique is shown to work competitively against popular resampling techniques such as random oversampling, undersampling, synthetic minority oversampling technique (SMOTE), and random oversampling examples (ROSE). We also investigate the relation between the true positive rate, true negative rate, and the imbalance of the data.

Article
Publication date: 15 March 2021

Putta Hemalatha and Geetha Mary Amalanathan

Adequate resources for learning and training the data are an important constraint to develop an efficient classifier with outstanding performance. The data usually follows a…

Abstract

Purpose

Adequate resources for learning and training the data are an important constraint to develop an efficient classifier with outstanding performance. The data usually follows a biased distribution of classes that reflects an unequal distribution of classes within a dataset. This issue is known as the imbalance problem, which is one of the most common issues occurring in real-time applications. Learning of imbalanced datasets is a ubiquitous challenge in the field of data mining. Imbalanced data degrades the performance of the classifier by producing inaccurate results.

Design/methodology/approach

In the proposed work, a novel fuzzy-based Gaussian synthetic minority oversampling (FG-SMOTE) algorithm is proposed to process the imbalanced data. The mechanism of the Gaussian SMOTE technique is based on finding the nearest neighbour concept to balance the ratio between minority and majority class datasets. The ratio of the datasets belonging to the minority and majority class is balanced using a fuzzy-based Levenshtein distance measure technique.

Findings

The performance and the accuracy of the proposed algorithm is evaluated using the deep belief networks classifier and the results showed the efficiency of the fuzzy-based Gaussian SMOTE technique achieved an AUC: 93.7%. F1 Score Prediction: 94.2%, Geometric Mean Score: 93.6% predicted from confusion matrix.

Research limitations/implications

The proposed research still retains some of the challenges that need to be focused such as application FG-SMOTE to multiclass imbalanced dataset and to evaluate dataset imbalance problem in a distributed environment.

Originality/value

The proposed algorithm fundamentally solves the data imbalance issues and challenges involved in handling the imbalanced data. FG-SMOTE has aided in balancing minority and majority class datasets.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 14 no. 2
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 14 May 2021

Zhenyuan Wang, Chih-Fong Tsai and Wei-Chao Lin

Class imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques…

Abstract

Purpose

Class imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques, which aim to identify anomalies as the minority class from the normal data as the majority class, are one representative solution for class imbalanced datasets. Since one-class classifiers are trained using only normal data to create a decision boundary for later anomaly detection, the quality of the training set, i.e. the majority class, is one key factor that affects the performance of one-class classifiers.

Design/methodology/approach

In this paper, we focus on two data cleaning or preprocessing methods to address class imbalanced datasets. The first method examines whether performing instance selection to remove some noisy data from the majority class can improve the performance of one-class classifiers. The second method combines instance selection and missing value imputation, where the latter is used to handle incomplete datasets that contain missing values.

Findings

The experimental results are based on 44 class imbalanced datasets; three instance selection algorithms, including IB3, DROP3 and the GA, the CART decision tree for missing value imputation, and three one-class classifiers, which include OCSVM, IFOREST and LOF, show that if the instance selection algorithm is carefully chosen, performing this step could improve the quality of the training data, which makes one-class classifiers outperform the baselines without instance selection. Moreover, when class imbalanced datasets contain some missing values, combining missing value imputation and instance selection, regardless of which step is first performed, can maintain similar data quality as datasets without missing values.

Originality/value

The novelty of this paper is to investigate the effect of performing instance selection on the performance of one-class classifiers, which has never been done before. Moreover, this study is the first attempt to consider the scenario of missing values that exist in the training set for training one-class classifiers. In this case, performing missing value imputation and instance selection with different orders are compared.

Details

Data Technologies and Applications, vol. 55 no. 5
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 29 September 2020

Hari Hara Krishna Kumar Viswanathan, Punniyamoorthy Murugesan, Sundar Rengasamy and Lavanya Vilvanathan

The purpose of this study is to compare the classification learning ability of our algorithm based on boosted support vector machine (B-SVM), against other classification

Abstract

Purpose

The purpose of this study is to compare the classification learning ability of our algorithm based on boosted support vector machine (B-SVM), against other classification techniques in predicting the credit ratings of banks. The key feature of this study is the usage of an imbalanced dataset (in the response variable/rating) with a smaller number of observations (number of banks).

Design/methodology/approach

In general, datasets in banking sector are small and imbalanced too. In this study, 23 Scheduled Commercial Banks (SCBs) have been chosen (in India), and their corresponding corporate ratings have been collated from the Indian subsidiary of reputed global rating agency. The top management of the rating agency provided 12 input (quantitative) variables that are considered essential for rating a bank within India. In order to overcome the challenge of dataset being imbalanced and having small number of observations, this study uses an algorithm, namely “Modified Boosted Support Vector Machines” (MBSVMs) proposed by Punniyamoorthy Murugesan and Sundar Rengasamy. This study also compares the classification ability of the aforementioned algorithm against other classification techniques such as multi-class SVM, back propagation neural networks, multi-class linear discriminant analysis (LDA) and k-nearest neighbors (k-NN) classification, on the basis of geometric mean (GM).

Findings

The performances of each algorithm have been compared based on one metric—the geometric mean, also known as GMean (GM). This metric typically indicates the class-wise sensitivity by using the values of products. The findings of the study prove that the proposed MBSVM technique outperforms the other techniques.

Research limitations/implications

This study provides an algorithm to predict ratings of banks where the dataset is small and imbalanced. One of the limitations of this research study is that subjective factors have not been included in our model; the sole focus is on the results generated by the models (driven by quantitative parameters). In future, studies may be conducted which may include subjective parameters (proxied by relevant and quantifiable variables).

Practical implications

Various stakeholders such as investors, regulators and central banks can predict the credit ratings of banks by themselves, by inputting appropriate data to the model.

Originality/value

In the process of rating banks, the usage of an imbalanced dataset can lessen the performance of the soft-computing techniques. In order to overcome this, the authors have come up with a novel classification approach based on “MBSVMs”, which can be used as a yardstick for such imbalanced datasets. For this purpose, through primary research, 12 features have been identified that are considered essential by the credit rating agencies.

Details

Benchmarking: An International Journal, vol. 28 no. 1
Type: Research Article
ISSN: 1463-5771

Keywords

Article
Publication date: 9 April 2024

Lu Wang, Jiahao Zheng, Jianrong Yao and Yuangao Chen

With the rapid growth of the domestic lending industry, assessing whether the borrower of each loan is at risk of default is a pressing issue for financial institutions. Although…

Abstract

Purpose

With the rapid growth of the domestic lending industry, assessing whether the borrower of each loan is at risk of default is a pressing issue for financial institutions. Although there are some models that can handle such problems well, there are still some shortcomings in some aspects. The purpose of this paper is to improve the accuracy of credit assessment models.

Design/methodology/approach

In this paper, three different stages are used to improve the classification performance of LSTM, so that financial institutions can more accurately identify borrowers at risk of default. The first approach is to use the K-Means-SMOTE algorithm to eliminate the imbalance within the class. In the second step, ResNet is used for feature extraction, and then two-layer LSTM is used for learning to strengthen the ability of neural networks to mine and utilize deep information. Finally, the model performance is improved by using the IDWPSO algorithm for optimization when debugging the neural network.

Findings

On two unbalanced datasets (category ratios of 700:1 and 3:1 respectively), the multi-stage improved model was compared with ten other models using accuracy, precision, specificity, recall, G-measure, F-measure and the nonparametric Wilcoxon test. It was demonstrated that the multi-stage improved model showed a more significant advantage in evaluating the imbalanced credit dataset.

Originality/value

In this paper, the parameters of the ResNet-LSTM hybrid neural network, which can fully mine and utilize the deep information, are tuned by an innovative intelligent optimization algorithm to strengthen the classification performance of the model.

Details

Kybernetes, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 0368-492X

Keywords

1 – 10 of over 2000