Search results

1 – 10 of 192
Book part
Publication date: 6 September 2019

Son Nguyen, Gao Niu, John Quinn, Alan Olinsky, Jonathan Ormsbee, Richard M. Smith and James Bishop

In recent years, the problem of classification with imbalanced data has been growing in popularity in the data-mining and machine-learning communities due to the emergence of an…

Abstract

In recent years, the problem of classification with imbalanced data has been growing in popularity in the data-mining and machine-learning communities due to the emergence of an abundance of imbalanced data in many fields. In this chapter, we compare the performance of six classification methods on an imbalanced dataset under the influence of four resampling techniques. These classification methods are the random forest, the support vector machine, logistic regression, k-nearest neighbor (KNN), the decision tree, and AdaBoost. Our study has shown that all of the classification methods have difficulty when working with the imbalanced data, with the KNN performing the worst, detecting only 27.4% of the minority class. However, with the help of resampling techniques, all of the classification methods experience improvement on overall performances. In particular, the Random Forest, in combination with the random over-sampling technique, performs the best, achieving 82.8% balanced accuracy (the average of the true-positive rate and true-negative rate).

We then propose a new procedure to resample the data. Our method is based on the idea of eliminating “easy” majority observations before under-sampling them. It has further improved the balanced accuracy of the Random Forest to 83.7%, making it the best approach for the imbalanced data.

Details

Advances in Business and Management Forecasting
Type: Book
ISBN: 978-1-78754-290-7

Keywords

Article
Publication date: 28 September 2023

Moh. Riskiyadi

This study aims to compare machine learning models, datasets and splitting training-testing using data mining methods to detect financial statement fraud.

3585

Abstract

Purpose

This study aims to compare machine learning models, datasets and splitting training-testing using data mining methods to detect financial statement fraud.

Design/methodology/approach

This study uses a quantitative approach from secondary data on the financial reports of companies listed on the Indonesia Stock Exchange in the last ten years, from 2010 to 2019. Research variables use financial and non-financial variables. Indicators of financial statement fraud are determined based on notes or sanctions from regulators and financial statement restatements with special supervision.

Findings

The findings show that the Extremely Randomized Trees (ERT) model performs better than other machine learning models. The best original-sampling dataset compared to other dataset treatments. Training testing splitting 80:10 is the best compared to other training-testing splitting treatments. So the ERT model with an original-sampling dataset and 80:10 training-testing splitting are the most appropriate for detecting future financial statement fraud.

Practical implications

This study can be used by regulators, investors, stakeholders and financial crime experts to add insight into better methods of detecting financial statement fraud.

Originality/value

This study proposes a machine learning model that has not been discussed in previous studies and performs comparisons to obtain the best financial statement fraud detection results. Practitioners and academics can use findings for further research development.

Details

Asian Review of Accounting, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 1321-7348

Keywords

Article
Publication date: 23 June 2022

Kerim Koc, Ömer Ekmekcioğlu and Asli Pelin Gurgun

Central to the entire discipline of construction safety management is the concept of construction accidents. Although distinctive progress has been made in safety management…

Abstract

Purpose

Central to the entire discipline of construction safety management is the concept of construction accidents. Although distinctive progress has been made in safety management applications over the last decades, construction industry still accounts for a considerable percentage of all workplace fatalities across the world. This study aims to predict occupational accident outcomes based on national data using machine learning (ML) methods coupled with several resampling strategies.

Design/methodology/approach

Occupational accident dataset recorded in Turkey was collected. To deal with the class imbalance issue between the number of nonfatal and fatal accidents, the dataset was pre-processed with random under-sampling (RUS), random over-sampling (ROS) and synthetic minority over-sampling technique (SMOTE). In addition, random forest (RF), Naïve Bayes (NB), K-Nearest neighbor (KNN) and artificial neural networks (ANNs) were employed as ML methods to predict accident outcomes.

Findings

The results highlighted that the RF outperformed other methods when the dataset was preprocessed with RUS. The permutation importance results obtained through the RF exhibited that the number of past accidents in the company, worker's age, material used, number of workers in the company, accident year, and time of the accident were the most significant attributes.

Practical implications

The proposed framework can be used in construction sites on a monthly-basis to detect workers who have a high probability to experience fatal accidents, which can be a valuable decision-making input for safety professionals to reduce the number of fatal accidents.

Social implications

Practitioners and occupational health and safety (OHS) departments of construction firms can focus on the most important attributes identified by analysis results to enhance the workers' quality of life and well-being.

Originality/value

The literature on accident outcome predictions is limited in terms of dealing with imbalanced dataset through integrated resampling techniques and ML methods in the construction safety domain. A novel utilization plan was proposed and enhanced by the analysis results.

Details

Engineering, Construction and Architectural Management, vol. 30 no. 9
Type: Research Article
ISSN: 0969-9988

Keywords

Article
Publication date: 19 March 2024

Thao-Trang Huynh-Cam, Long-Sheng Chen and Tzu-Chuen Lu

This study aimed to use enrollment information including demographic, family background and financial status, which can be gathered before the first semester starts, to construct…

Abstract

Purpose

This study aimed to use enrollment information including demographic, family background and financial status, which can be gathered before the first semester starts, to construct early prediction models (EPMs) and extract crucial factors associated with first-year student dropout probability.

Design/methodology/approach

The real-world samples comprised the enrolled records of 2,412 first-year students of a private university (UNI) in Taiwan. This work utilized decision trees (DT), multilayer perceptron (MLP) and logistic regression (LR) algorithms for constructing EPMs; under-sampling, random oversampling and synthetic minority over sampling technique (SMOTE) methods for solving data imbalance problems; accuracy, precision, recall, F1-score, receiver operator characteristic (ROC) curve and area under ROC curve (AUC) for evaluating constructed EPMs.

Findings

DT outperformed MLP and LR with accuracy (97.59%), precision (98%), recall (97%), F1_score (97%), and ROC-AUC (98%). The top-ranking factors comprised “student loan,” “dad occupations,” “mom educational level,” “department,” “mom occupations,” “admission type,” “school fee waiver” and “main sources of living.”

Practical implications

This work only used enrollment information to identify dropout students and crucial factors associated with dropout probability as soon as students enter universities. The extracted rules could be utilized to enhance student retention.

Originality/value

Although first-year student dropouts have gained non-stop attention from researchers in educational practices and theories worldwide, diverse previous studies utilized while-and/or post-semester factors, and/or questionnaires for predicting. These methods failed to offer universities early warning systems (EWS) and/or assist them in providing in-time assistance to dropouts, who face economic difficulties. This work provided universities with an EWS and extracted rules for early dropout prevention and intervention.

Details

Journal of Applied Research in Higher Education, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2050-7003

Keywords

Book part
Publication date: 10 December 2018

Thomas Keil, Pasi Kuusela and Nils Stieglitz

How do organizations respond to negative feedback regarding their innovation activities? In this chapter, the authors reconcile contradictory predictions stemming from behavioral…

Abstract

How do organizations respond to negative feedback regarding their innovation activities? In this chapter, the authors reconcile contradictory predictions stemming from behavioral learning and from the escalation of commitment (EoC) perspectives regarding persistence under negative performance feedback. The authors core argument suggests that the seemingly contradictory psychological processes indicated by these two perspectives occur simultaneously in decision makers but that the design of organizational roles and reward systems affects their prevalence in decision-making tasks. Specifically, the authors argue that for decision makers responsible for an individual project, responses given to negative performance feedback regarding a project are dominated by self-justification and loss-avoidance mechanisms predicted by the EoC literature, while for decision makers responsible for a portfolio of projects, responses to negative performance regarding a project are dominated by an under-sampling of poorly performing alternatives that behavioral learning theory predicts. In addition to assigning decision-making authority to different organizational roles, organizational designers shape the strength of these mechanisms through the design of reward systems and specifically by setting more or less ambiguous goals, aspiration levels, time horizons of incentives provided, and levels of failure tolerance.

Book part
Publication date: 15 January 2010

Sean M. Puckett and John M. Rose

Currently, the state of practice in experimental design centres on orthogonal designs (Alpizar et al., 2003), which are suitable when applied to surveys with a large sample size…

Abstract

Currently, the state of practice in experimental design centres on orthogonal designs (Alpizar et al., 2003), which are suitable when applied to surveys with a large sample size. In a stated choice experiment involving interdependent freight stakeholders in Sydney (see Hensher & Puckett, 2007; Puckett et al., 2007; Puckett & Hensher, 2008), one significant empirical constraint was difficult in recruiting unique decision-making groups to participate. The expected relatively small sample size led us to seek an alternative experimental design. That is, we decided to construct an optimal design that utilised extant information regarding the preferences and experiences of respondents, to achieve statistically significant parameter estimates under a relatively low sample size (see Bliemer & Rose, 2006).

The D-efficient experimental design developed for the study is unique, in that it centred on the choices of interdependent respondents. Hence, the generation of the design had to account for the preferences of two distinct classes of decision makers: buyers and sellers of road freight transport. This paper discusses the process by which these (non-coincident) preferences were used to seed the generation of the experimental design, and then examines the relative power of the design through an extensive bootstrap analysis of increasingly restricted sample sizes for both decision-making classes in the sample. We demonstrate the strong potential for efficient designs to achieve empirical goals under sampling constraints, whilst identifying limitations to their power as sample size decreases.

Details

Choice Modelling: The State-of-the-art and The State-of-practice
Type: Book
ISBN: 978-1-84950-773-8

Abstract

Details

Rutgers Studies in Accounting Analytics: Audit Analytics in the Financial Industry
Type: Book
ISBN: 978-1-78743-086-0

Article
Publication date: 14 May 2020

Byungdae An and Yongmoo Suh

Financial statement fraud (FSF) committed by companies implies the current status of the companies may not be healthy. As such, it is important to detect FSF, since such companies…

Abstract

Purpose

Financial statement fraud (FSF) committed by companies implies the current status of the companies may not be healthy. As such, it is important to detect FSF, since such companies tend to conceal bad information, which causes a great loss to various stakeholders. Thus, the objective of the paper is to propose a novel approach to building a classification model to identify FSF, which shows high classification performance and from which human-readable rules are extracted to explain why a company is likely to commit FSF.

Design/methodology/approach

Having prepared multiple sub-datasets to cope with class imbalance problem, we build a set of decision trees for each sub-dataset; select a subset of the set as a model for the sub-dataset by removing the tree, each of whose performance is less than the average accuracy of all trees in the set; and then select one such model which shows the best accuracy among the models. We call the resulting model MRF (Modified Random Forest). Given a new instance, we extract rules from the MRF model to explain whether the company corresponding to the new instance is likely to commit FSF or not.

Findings

Experimental results show that MRF classifier outperformed the benchmark models. The results also revealed that all the variables related to profit belong to the set of the most important indicators to FSF and that two new variables related to gross profit which were unapprised in previous studies on FSF were identified.

Originality/value

This study proposed a method of building a classification model which shows the outstanding performance and provides decision rules that can be used to explain the classification results. In addition, a new way to resolve the class imbalance problem was suggested in this paper.

Details

Data Technologies and Applications, vol. 54 no. 2
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 28 May 2021

Subbaraju Pericherla and E. Ilavarasan

Nowadays people are connected by social media like Facebook, Instagram, Twitter, YouTube and much more. Bullies take advantage of these social networks to share their comments…

Abstract

Purpose

Nowadays people are connected by social media like Facebook, Instagram, Twitter, YouTube and much more. Bullies take advantage of these social networks to share their comments. Cyberbullying is one typical kind of harassment by making aggressive comments, abuses to hurt the netizens. Social media is one of the areas where bullying happens extensively. Hence, it is necessary to develop an efficient and autonomous cyberbullying detection technique.

Design/methodology/approach

In this paper, the authors proposed a transformer network-based word embeddings approach for cyberbullying detection. RoBERTa is used to generate word embeddings and Light Gradient Boosting Machine is used as a classifier.

Findings

The proposed approach outperforms machine learning algorithms such as logistic regression, support vector machine and deep learning models such as word-level convolutional neural networks (word CNN) and character convolutional neural networks with short cuts (char CNNS) in terms of precision, recall, F1-score.

Originality/value

One of the limitations of traditional word embeddings methods is context-independent. In this work, only text data are utilized to identify cyberbullying. This work can be extended to predict cyberbullying activities in multimedia environment like image, audio and video.

Details

International Journal of Intelligent Unmanned Systems, vol. 12 no. 1
Type: Research Article
ISSN: 2049-6427

Keywords

Article
Publication date: 10 March 2023

Jingyi Li and Shiwei Chao

Binary classification on imbalanced data is a challenge; due to the imbalance of the classes, the minority class is easily masked by the majority class. However, most existing…

Abstract

Purpose

Binary classification on imbalanced data is a challenge; due to the imbalance of the classes, the minority class is easily masked by the majority class. However, most existing classifiers are better at identifying the majority class, thereby ignoring the minority class, which leads to classifier degradation. To address this, this paper proposes a twin-support vector machines for binary classification on imbalanced data.

Design/methodology/approach

In the proposed method, the authors construct two support vector machines to focus on majority classes and minority classes, respectively. In order to promote the learning ability of the two support vector machines, a new kernel is derived for them.

Findings

(1) A novel twin-support vector machine is proposed for binary classification on imbalanced data, and new kernels are derived. (2) For imbalanced data, the complexity of data distribution has negative effects on classification results; however, advanced classification results can be gained and desired boundaries are learned by using optimizing kernels. (3) Classifiers based on twin architectures have more advantages than those based on single architecture for binary classification on imbalanced data.

Originality/value

For imbalanced data, the complexity of data distribution has negative effects on classification results; however, advanced classification results can be gained and desired boundaries are learned through using optimizing kernels.

Details

Data Technologies and Applications, vol. 57 no. 3
Type: Research Article
ISSN: 2514-9288

Keywords

1 – 10 of 192