Search results

1 – 10 of 126

View access options

Book part

Publication date: 1 September 2021

Effects of Resampling Techniques on Imbalanced Data Classification: A New Under-resampling Method

Son Nguyen, Phyllis Schumacher, Alan Olinsky and John Quinn

We study the performances of various predictive models including decision trees, random forests, neural networks, and linear discriminant analysis on an imbalanced data set of…

HTML

PDF (1.3 MB)

EPUB (10.6 MB)

Abstract

We study the performances of various predictive models including decision trees, random forests, neural networks, and linear discriminant analysis on an imbalanced data set of home loan applications. During the process, we propose our undersampling algorithm to cope with the issues created by the imbalance of the data. Our technique is shown to work competitively against popular resampling techniques such as random oversampling, undersampling, synthetic minority oversampling technique (SMOTE), and random oversampling examples (ROSE). We also investigate the relation between the true positive rate, true negative rate, and the imbalance of the data.

Details

Advances in Business and Management Forecasting

Type: Book

DOI:

ISBN: 978-1-83982-091-5

Keywords

View access options

Article

Publication date: 4 December 2018

Malicious web domain identification using online credibility and performance data by considering the class imbalance issue

Zhongyi Hu, Raymond Chiong, Ilung Pranata, Yukun Bao and Yuqing Lin

Malicious web domain identification is of significant importance to the security protection of internet users. With online credibility and performance data, the purpose of this…

HTML

PDF (1.8 MB)

Downloads

540

Abstract

Purpose

Malicious web domain identification is of significant importance to the security protection of internet users. With online credibility and performance data, the purpose of this paper to investigate the use of machine learning techniques for malicious web domain identification by considering the class imbalance issue (i.e. there are more benign web domains than malicious ones).

Design/methodology/approach

The authors propose an integrated resampling approach to handle class imbalance by combining the synthetic minority oversampling technique (SMOTE) and particle swarm optimisation (PSO), a population-based meta-heuristic algorithm. The authors use the SMOTE for oversampling and PSO for undersampling.

Findings

By applying eight well-known machine learning classifiers, the proposed integrated resampling approach is comprehensively examined using several imbalanced web domain data sets with different imbalance ratios. Compared to five other well-known resampling approaches, experimental results confirm that the proposed approach is highly effective.

Practical implications

This study not only inspires the practical use of online credibility and performance data for identifying malicious web domains but also provides an effective resampling approach for handling the class imbalance issue in the area of malicious web domain identification.

Originality/value

Online credibility and performance data are applied to build malicious web domain identification models using machine learning techniques. An integrated resampling approach is proposed to address the class imbalance issue. The performance of the proposed approach is confirmed based on real-world data sets with different imbalance ratios.

Details

Industrial Management & Data Systems, vol. 119 no. 3

Type: Research Article

DOI:

ISSN: 0263-5577

Keywords

View access options

Book part

Publication date: 6 September 2019

Dimension Reduction in Bankruptcy Prediction: A Case Study of North American Companies

Son Nguyen, Edward Golas, William Zywiak and Kristin Kennedy

Bankruptcy prediction has attracted a great deal of research in the data mining/machine learning community, due to its significance in the world of accounting, finance, and…

HTML

PDF (461 KB)

EPUB (306 KB)

Abstract

Bankruptcy prediction has attracted a great deal of research in the data mining/machine learning community, due to its significance in the world of accounting, finance, and investment. This chapter examines the influence of different dimension reduction techniques on decision tree model applied to the bankruptcy prediction problem. The studied techniques are principal component analysis (PCA), sliced inversed regression (SIR), sliced average variance estimation (SAVE), and factor analysis (FA). To focus on the impact of the dimension reduction techniques, we chose only to use decision tree as our predictive model and “undersampling” as the solution to the issue of data imbalance. Our computation shows that the choice of dimension reduction technique greatly affects the performances of predictive models and that one could use dimension reduction techniques to improve the predictive power of the decision tree model. Also, in this study, we propose a method to estimate the true dimension of the data.

Details

Advances in Business and Management Forecasting

Type: Book

DOI:

ISBN: 978-1-78754-290-7

Keywords

View access options

Book part

Publication date: 1 September 2021

Predicting the Length of Stay in Hospital Emergency Rooms in Rhode Island

Alicia T. Lamere, Son Nguyen, Gao Niu, Alan Olinsky and John Quinn

Predicting a patient's length of stay (LOS) in a hospital setting has been widely researched. Accurately predicting an individual's LOS can have a significant impact on a…

HTML

PDF (921 KB)

EPUB (10.6 MB)

Abstract

Predicting a patient's length of stay (LOS) in a hospital setting has been widely researched. Accurately predicting an individual's LOS can have a significant impact on a healthcare provider's ability to care for individuals by allowing them to properly prepare and manage resources. A hospital's productivity requires a delicate balance of maintaining enough staffing and resources without being overly equipped or wasteful. This has become even more important in light of the current COVID-19 pandemic, during which emergency departments around the globe have been inundated with patients and are struggling to manage their resources.

In this study, the authors focus on the prediction of LOS at the time of admission in emergency departments at Rhode Island hospitals through discharge data obtained from the Rhode Island Department of Health over the time period of 2012 and 2013. This work also explores the distribution of discharge dispositions in an effort to better characterize the resources patients require upon leaving the emergency department.

Details

Advances in Business and Management Forecasting

Type: Book

DOI:

ISBN: 978-1-83982-091-5

Keywords

Open Access

Article

Publication date: 6 February 2020

How much information is lost when sampling driving behavior data? Indicators to quantify the extent of information loss

Jun Liu, Asad Khattak, Lee Han and Quan Yuan

Individuals’ driving behavior data are becoming available widely through Global Positioning System devices and on-board diagnostic systems. The incoming data can be sampled at…

HTML

PDF (2.1 MB)

Downloads

1340

Abstract

Purpose

Individuals’ driving behavior data are becoming available widely through Global Positioning System devices and on-board diagnostic systems. The incoming data can be sampled at rates ranging from one Hertz (or even lower) to hundreds of Hertz. Failing to capture substantial changes in vehicle movements over time by “undersampling” can cause loss of information and misinterpretations of the data, but “oversampling” can waste storage and processing resources. The purpose of this study is to empirically explore how micro-driving decisions to maintain speed, accelerate or decelerate, can be best captured, without substantial loss of information.

Design/methodology/approach

This study creates a set of indicators to quantify the magnitude of information loss (MIL). Each indicator is calculated as a percentage to index the extent of information loss (EIL) in different situations. An overall information loss index named EIL is created to combine the MIL indicators. Data from a driving simulator study collected at 20 Hertz are analyzed (N = 718,481 data points from 35,924 s of driving tests). The study quantifies the relationship between information loss indicators and sampling rates.

Findings

The results show that marginally more information is lost as data are sampled down from 20 to 0.5 Hz, but the relationship is not linear. With four indicators of MILs, the overall EIL is 3.85 per cent for 1-Hz sampling rate driving behavior data. If sampling rates are higher than 2 Hz, all MILs are under 5 per cent for importation loss.

Originality/value

This study contributes by developing a framework for quantifying the relationship between sampling rates, and information loss and depending on the objective of their study, researchers can choose the appropriate sampling rate necessary to get the right amount of accuracy.

Details

Journal of Intelligent and Connected Vehicles, vol. 3 no. 1

Type: Research Article

DOI:

ISSN: 2399-9802

Keywords

Content available

Book part

Publication date: 1 September 2021

Index

Free Access

HTML

PDF (35 KB)

EPUB (10.6 MB)

Details

Advances in Business and Management Forecasting

Type: Book

DOI:

ISBN: 978-1-83982-091-5

View access options

Article

Publication date: 22 July 2022

Incremental and parallel proximal SVM algorithm tailored on the Jetson Nano for the ImageNet challenge

Thanh-Nghi Do

This paper aims to propose the new incremental and parallel training algorithm of proximal support vector machines (Inc-Par-PSVM) tailored on the edge device (i.e. the Jetson…

HTML

PDF (535 KB)

Downloads

173

Abstract

Purpose

This paper aims to propose the new incremental and parallel training algorithm of proximal support vector machines (Inc-Par-PSVM) tailored on the edge device (i.e. the Jetson Nano) to handle the large-scale ImageNet challenging problem.

Design/methodology/approach

The Inc-Par-PSVM trains in the incremental and parallel manner ensemble binary PSVM classifiers used for the One-Versus-All multiclass strategy on the Jetson Nano. The binary PSVM model is the average in bagged binary PSVM models built in undersampling training data block.

Findings

The empirical test results on the ImageNet data set show that the Inc-Par-PSVM algorithm with the Jetson Nano (Quad-core ARM A57 @ 1.43 GHz, 128-core NVIDIA Maxwell architecture-based graphics processing unit, 4 GB RAM) is faster and more accurate than the state-of-the-art linear SVM algorithm run on a PC [Intel(R) Core i7-4790 CPU, 3.6 GHz, 4 cores, 32 GB RAM].

Originality/value

The new incremental and parallel PSVM algorithm tailored on the Jetson Nano is able to efficiently handle the large-scale ImageNet challenge with 1.2 million images and 1,000 classes.

Details

International Journal of Web Information Systems, vol. 18 no. 2/3

Type: Research Article

DOI:

ISSN: 1744-0084

Keywords

View access options

Article

Publication date: 29 November 2021

A novel semi-supervised self-training method based on resampling for Twitter fake account identification

Ziming Zeng, Tingting Li, Shouqiang Sun, Jingjing Sun and Jie Yin

Twitter fake accounts refer to bot accounts created by third-party organizations to influence public opinion, commercial propaganda or impersonate others. The effective…

HTML

PDF (3.5 MB)

Downloads

277

Abstract

Purpose

Twitter fake accounts refer to bot accounts created by third-party organizations to influence public opinion, commercial propaganda or impersonate others. The effective identification of bot accounts is conducive to accurately judge the disseminated information for the public. However, in actual fake account identification, it is expensive and inefficient to manually label Twitter accounts, and the labeled data are usually unbalanced in classes. To this end, the authors propose a novel framework to solve these problems.

Design/methodology/approach

In the proposed framework, the authors introduce the concept of semi-supervised self-training learning and apply it to the real Twitter account data set from Kaggle. Specifically, the authors first train the classifier in the initial small amount of labeled account data, then use the trained classifier to automatically label large-scale unlabeled account data. Next, iteratively select high confidence instances from unlabeled data to expand the labeled data. Finally, an expanded Twitter account training set is obtained. It is worth mentioning that the resampling technique is integrated into the self-training process, and the data class is balanced at the initial stage of the self-training iteration.

Findings

The proposed framework effectively improves labeling efficiency and reduces the influence of class imbalance. It shows excellent identification results on 6 different base classifiers, especially for the initial small-scale labeled Twitter accounts.

Originality/value

This paper provides novel insights in identifying Twitter fake accounts. First, the authors take the lead in introducing a self-training method to automatically label Twitter accounts from the semi-supervised background. Second, the resampling technique is integrated into the self-training process to effectively reduce the influence of class imbalance on the identification effect.

Details

Data Technologies and Applications, vol. 56 no. 3

Type: Research Article

DOI:

ISSN: 2514-9288

Keywords

View access options

Article

Publication date: 9 April 2024

A multi-stage integrated model based on deep neural network for credit risk assessment with unbalanced data

Lu Wang, Jiahao Zheng, Jianrong Yao and Yuangao Chen

With the rapid growth of the domestic lending industry, assessing whether the borrower of each loan is at risk of default is a pressing issue for financial institutions. Although…

HTML

PDF (3.7 MB)

Downloads

Abstract

Purpose

With the rapid growth of the domestic lending industry, assessing whether the borrower of each loan is at risk of default is a pressing issue for financial institutions. Although there are some models that can handle such problems well, there are still some shortcomings in some aspects. The purpose of this paper is to improve the accuracy of credit assessment models.

Design/methodology/approach

In this paper, three different stages are used to improve the classification performance of LSTM, so that financial institutions can more accurately identify borrowers at risk of default. The first approach is to use the K-Means-SMOTE algorithm to eliminate the imbalance within the class. In the second step, ResNet is used for feature extraction, and then two-layer LSTM is used for learning to strengthen the ability of neural networks to mine and utilize deep information. Finally, the model performance is improved by using the IDWPSO algorithm for optimization when debugging the neural network.

Findings

On two unbalanced datasets (category ratios of 700:1 and 3:1 respectively), the multi-stage improved model was compared with ten other models using accuracy, precision, specificity, recall, G-measure, F-measure and the nonparametric Wilcoxon test. It was demonstrated that the multi-stage improved model showed a more significant advantage in evaluating the imbalanced credit dataset.

Originality/value

In this paper, the parameters of the ResNet-LSTM hybrid neural network, which can fully mine and utilize the deep information, are tuned by an innovative intelligent optimization algorithm to strengthen the classification performance of the model.

Details

Kybernetes, vol. ahead-of-print no. ahead-of-print

Type: Research Article

DOI:

ISSN: 0368-492X

Keywords

Open Access

Article

Publication date: 5 March 2019

Personal bankruptcy prediction using decision tree model

Sharifah Heryati Syed Nor, Shafinar Ismail and Bee Wah Yap

Personal bankruptcy is on the rise in Malaysia. The Insolvency Department of Malaysia reported that personal bankruptcy has increased since 2007, and the total accumulated…

HTML

PDF (654 KB)

Downloads

4316

Abstract

Purpose

Personal bankruptcy is on the rise in Malaysia. The Insolvency Department of Malaysia reported that personal bankruptcy has increased since 2007, and the total accumulated personal bankruptcy cases stood at 131,282 in 2014. This is indeed an alarming issue because the increasing number of personal bankruptcy cases will have a negative impact on the Malaysian economy, as well as on the society. From the aspect of individual’s personal economy, bankruptcy minimizes their chances of securing a job. Apart from that, their account will be frozen, lost control on their assets and properties and not allowed to start any business nor be a part of any company’s management. Bankrupts also will be denied from any loan application, restricted from travelling overseas and cannot act as a guarantor. This paper aims to investigate this problem by developing the personal bankruptcy prediction model using the decision tree technique.

Design/methodology/approach

In this paper, bankrupt is defined as terminated members who failed to settle their loans. The sample comprised of 24,546 cases with 17 per cent settled cases and 83 per cent terminated cases. The data included a dependent variable, i.e. bankruptcy status (Y = 1(bankrupt), Y = 0 (non-bankrupt)) and 12 predictors. SAS Enterprise Miner 14.1 software was used to develop the decision tree model.

Findings

Upon completion, this study succeeds to come out with the profiles of bankrupts, reliable personal bankruptcy scoring model and significant variables of personal bankruptcy.

Practical implications

This decision tree model is possible for patent and income generation. Financial institutions are able to use this model for potential borrowers to predict their tendency toward personal bankruptcy.

Social implications

Create awareness to society on significant variables of personal bankruptcy so that they can avoid being a bankrupt.

Originality/value

This decision tree model is able to facilitate and assist financial institutions in evaluating and assessing their potential borrower. It helps to identify potential defaulting borrowers. It also can assist financial institutions in implementing the right strategies to avoid defaulting borrowers.

Details

Journal of Economics, Finance and Administrative Science, vol. 24 no. 47

Type: Research Article

DOI:

ISSN: 2077-1886

Keywords

Access

Year

Content type

1 – 10 of 126

Abstract

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Practical implications

Originality/value

Details

Keywords

Abstract

Details

Keywords

Abstract

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Details

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Practical implications

Social implications

Originality/value

Details

Keywords

Access

Year

Content type

We’re listening — tell us what you think

Something didn’t work…

All feedback is valuable

Join us on our journey

Platform update page

Questions & More Information