Search results

1 – 10 of over 2000
Click here to view access options
Book part
Publication date: 6 September 2019

Son Nguyen, Gao Niu, John Quinn, Alan Olinsky, Jonathan Ormsbee, Richard M. Smith and James Bishop

In recent years, the problem of classification with imbalanced data has been growing in popularity in the data-mining and machine-learning communities due to the emergence…

Abstract

In recent years, the problem of classification with imbalanced data has been growing in popularity in the data-mining and machine-learning communities due to the emergence of an abundance of imbalanced data in many fields. In this chapter, we compare the performance of six classification methods on an imbalanced dataset under the influence of four resampling techniques. These classification methods are the random forest, the support vector machine, logistic regression, k-nearest neighbor (KNN), the decision tree, and AdaBoost. Our study has shown that all of the classification methods have difficulty when working with the imbalanced data, with the KNN performing the worst, detecting only 27.4% of the minority class. However, with the help of resampling techniques, all of the classification methods experience improvement on overall performances. In particular, the Random Forest, in combination with the random over-sampling technique, performs the best, achieving 82.8% balanced accuracy (the average of the true-positive rate and true-negative rate).

We then propose a new procedure to resample the data. Our method is based on the idea of eliminating “easy” majority observations before under-sampling them. It has further improved the balanced accuracy of the Random Forest to 83.7%, making it the best approach for the imbalanced data.

Details

Advances in Business and Management Forecasting
Type: Book
ISBN: 978-1-78754-290-7

Keywords

Click here to view access options

Abstract

Details

Machine Learning and Artificial Intelligence in Marketing and Sales
Type: Book
ISBN: 978-1-80043-881-1

Click here to view access options
Book part
Publication date: 1 September 2021

Son Nguyen, Phyllis Schumacher, Alan Olinsky and John Quinn

We study the performances of various predictive models including decision trees, random forests, neural networks, and linear discriminant analysis on an imbalanced data…

Abstract

We study the performances of various predictive models including decision trees, random forests, neural networks, and linear discriminant analysis on an imbalanced data set of home loan applications. During the process, we propose our undersampling algorithm to cope with the issues created by the imbalance of the data. Our technique is shown to work competitively against popular resampling techniques such as random oversampling, undersampling, synthetic minority oversampling technique (SMOTE), and random oversampling examples (ROSE). We also investigate the relation between the true positive rate, true negative rate, and the imbalance of the data.

Click here to view access options
Book part
Publication date: 15 March 2021

Jochen Hartmann

Across disciplines, researchers and practitioners employ decision tree ensembles such as random forests and XGBoost with great success. What explains their popularity…

Abstract

Across disciplines, researchers and practitioners employ decision tree ensembles such as random forests and XGBoost with great success. What explains their popularity? This chapter showcases how marketing scholars and decision-makers can harness the power of decision tree ensembles for academic and practical applications. The author discusses the origin of decision tree ensembles, explains their theoretical underpinnings, and illustrates them empirically using a real-world telemarketing case, with the objective of predicting customer conversions. Readers unfamiliar with decision tree ensembles will learn to appreciate them for their versatility, competitive accuracy, ease of application, and computational efficiency and will gain a comprehensive understanding why decision tree ensembles contribute to every data scientist's methodological toolbox.

Details

The Machine Age of Customer Insight
Type: Book
ISBN: 978-1-83909-697-6

Keywords

Click here to view access options
Article
Publication date: 21 December 2021

Laouni Djafri

This work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as…

Abstract

Purpose

This work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds computing or other technologies.

Design/methodology/approach

In the age of Big Data, all companies want to benefit from large amounts of data. These data can help them understand their internal and external environment and anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later. Thus, this knowledge becomes a great asset in companies' hands. This is precisely the objective of data mining. But with the production of a large amount of data and knowledge at a faster pace, the authors are now talking about Big Data mining. For this reason, the authors’ proposed works mainly aim at solving the problem of volume, veracity, validity and velocity when classifying Big Data using distributed and parallel processing techniques. So, the problem that the authors are raising in this work is how the authors can make machine learning algorithms work in a distributed and parallel way at the same time without losing the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture that the authors designed is specially directed to handle big data processing that operates in a coherent and efficient manner with the sampling strategy proposed in this work. This architecture also helps the authors to actually verify the classification results obtained using the representative learning base (RLB). In the second part, the authors have extracted the representative learning base by sampling at two levels using the stratified random sampling method. This sampling method is also applied to extract the shared learning base (SLB) and the partial learning base for the first level (PLBL1) and the partial learning base for the second level (PLBL2). The experimental results show the efficiency of our solution that the authors provided without significant loss of the classification results. Thus, in practical terms, the system DDPML is generally dedicated to big data mining processing, and works effectively in distributed systems with a simple structure, such as client-server networks.

Findings

The authors got very satisfactory classification results.

Originality/value

DDPML system is specially designed to smoothly handle big data mining classification.

Details

Data Technologies and Applications, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2514-9288

Keywords

Click here to view access options
Article
Publication date: 19 August 2021

Hendrik Kohrs, Benjamin Rainer Auer and Frank Schuhmacher

In short-term forecasting of day-ahead electricity prices, incorporating intraday dependencies is vital for accurate predictions. However, it quickly leads to…

Abstract

Purpose

In short-term forecasting of day-ahead electricity prices, incorporating intraday dependencies is vital for accurate predictions. However, it quickly leads to dimensionality problems, i.e. ill-defined models with too many parameters, which require an adequate remedy. This study addresses this issue.

Design/methodology/approach

In an application for the German/Austrian market, this study derives variable importance scores from a random forest algorithm, feeds the identified variables into a support vector machine and compares the resulting forecasting technique to other approaches (such as dynamic factor models, penalized regressions or Bayesian shrinkage) that are commonly used to resolve dimensionality problems.

Findings

This study develops full importance profiles stating which hours of which past days have the highest predictive power for specific hours in the future. Using the profile information in the forecasting setup leads to very promising results compared to the alternatives. Furthermore, the importance profiles provide a possible explanation why some forecasting methods are more accurate for certain hours of the day than others. They also help to explain why simple forecast combination schemes tend to outperform the full battery of models considered in the comprehensive comparative study.

Originality/value

With the information contained in the variable importance scores and the results of the extensive model comparison, this study essentially provides guidelines for variable and model selection in future electricity market research.

Open Access
Article
Publication date: 28 July 2020

R. Shashikant and P. Chetankumar

Cardiac arrest is a severe heart anomaly that results in billions of annual casualties. Smoking is a specific hazard factor for cardiovascular pathology, including…

Abstract

Cardiac arrest is a severe heart anomaly that results in billions of annual casualties. Smoking is a specific hazard factor for cardiovascular pathology, including coronary heart disease, but data on smoking and heart death not earlier reviewed. The Heart Rate Variability (HRV) parameters used to predict cardiac arrest in smokers using machine learning technique in this paper. Machine learning is a method of computing experience based on automatic learning and enhances performances to increase prognosis. This study intends to compare the performance of logistical regression, decision tree, and random forest model to predict cardiac arrest in smokers. In this paper, a machine learning technique implemented on the dataset received from the data science research group MITU Skillogies Pune, India. To know the patient has a chance of cardiac arrest or not, developed three predictive models as 19 input feature of HRV indices and two output classes. These model evaluated based on their accuracy, precision, sensitivity, specificity, F1 score, and Area under the curve (AUC). The model of logistic regression has achieved an accuracy of 88.50%, precision of 83.11%, the sensitivity of 91.79%, the specificity of 86.03%, F1 score of 0.87, and AUC of 0.88. The decision tree model has arrived with an accuracy of 92.59%, precision of 97.29%, the sensitivity of 90.11%, the specificity of 97.38%, F1 score of 0.93, and AUC of 0.94. The model of the random forest has achieved an accuracy of 93.61%, precision of 94.59%, the sensitivity of 92.11%, the specificity of 95.03%, F1 score of 0.93 and AUC of 0.95. The random forest model achieved the best accuracy classification, followed by the decision tree, and logistic regression shows the lowest classification accuracy.

Details

Applied Computing and Informatics, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2634-1964

Keywords

Click here to view access options
Article
Publication date: 23 March 2021

Mostafa El Habib Daho, Nesma Settouti, Mohammed El Amine Bechar, Amina Boublenza and Mohammed Amine Chikh

Ensemble methods have been widely used in the field of pattern recognition due to the difficulty of finding a single classifier that performs well on a wide variety of…

Abstract

Purpose

Ensemble methods have been widely used in the field of pattern recognition due to the difficulty of finding a single classifier that performs well on a wide variety of problems. Despite the effectiveness of these techniques, studies have shown that ensemble methods generate a large number of hypotheses and that contain redundant classifiers in most cases. Several works proposed in the state of the art attempt to reduce all hypotheses without affecting performance.

Design/methodology/approach

In this work, the authors are proposing a pruning method that takes into consideration the correlation between classifiers/classes and each classifier with the rest of the set. The authors have used the random forest algorithm as trees-based ensemble classifiers and the pruning was made by a technique inspired by the CFS (correlation feature selection) algorithm.

Findings

The proposed method CES (correlation-based Ensemble Selection) was evaluated on ten datasets from the UCI machine learning repository, and the performances were compared to six ensemble pruning techniques. The results showed that our proposed pruning method selects a small ensemble in a smaller amount of time while improving classification rates compared to the state-of-the-art methods.

Originality/value

CES is a new ordering-based method that uses the CFS algorithm. CES selects, in a short time, a small sub-ensemble that outperforms results obtained from the whole forest and the other state-of-the-art techniques used in this study.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 14 no. 2
Type: Research Article
ISSN: 1756-378X

Keywords

Click here to view access options
Article
Publication date: 30 December 2020

Suraj Kulkarni, Suhas Suresh Ambekar and Manoj Hudnurkar

Increasing health-care costs are a major concern, especially in the USA. The purpose of this paper is to predict the hospital charges of a patient before being admitted…

Abstract

Purpose

Increasing health-care costs are a major concern, especially in the USA. The purpose of this paper is to predict the hospital charges of a patient before being admitted. This will help a patient who is getting admitted: “electively” can plan his/her finance. Also, this can be used as a tool by payers (insurance companies) to better forecast the amount that a patient might claim.

Design/methodology/approach

This research method involves secondary data collected from New York state’s patient discharges of 2017. A stratified sampling technique is used to sample the data from the population, feature engineering is done on categorical variables. Different regression techniques are being used to predict the target value “total charges.”

Findings

Total cost varies linearly with the length of stay. Among all the machine learning algorithms considered, namely, random forest, stochastic gradient descent (SGD) regressor, K nearest neighbors regressor, extreme gradient boosting regressor and gradient boosting regressor, random forest regressor had the best accuracy with R2 value 0.7753. “Age group” was the most important predictor among all the features.

Practical implications

This model can be helpful for patients who want to compare the cost at different hospitals and can plan their finances accordingly in case of “elective” admission. Insurance companies can predict how much a patient with a particular medical condition might claim by getting admitted to the hospital.

Originality/value

Health care can be a costly affair if not planned properly. This research gives patients and insurance companies a better prediction of the total cost that they might incur.

Details

International Journal of Innovation Science, vol. 13 no. 1
Type: Research Article
ISSN: 1757-2223

Keywords

Click here to view access options
Article
Publication date: 1 October 2018

Vinod Nistane and Suraj Harsha

In rotary machines, the bearing failure is one of the major causes of the breakdown of machinery. The bearing degradation monitoring is a great anxiety for the prevention…

Abstract

Purpose

In rotary machines, the bearing failure is one of the major causes of the breakdown of machinery. The bearing degradation monitoring is a great anxiety for the prevention of bearing failures. This paper aims to present a combination of the stationary wavelet decomposition and extra-trees regression (ETR) for the evaluation of bearing degradation.

Design/methodology/approach

The higher order cumulants features are extracted from the bearing vibration signals by using the stationary wavelet decomposition (stationary wavelet transform [SWT]). The extracted features are then subjected to the ETR for obtaining normal and failure state. A dominance level curve build using the dissimilarity data of test object and retained as health degradation indicator for the evaluation of bearing health.

Findings

Experiment conducts to verify and assess the effectiveness of ETR for the evaluation of performance of bearing degradation. To justify the preeminence of recommended approach, it is compared with the performance of random forest regression and multi-layer perceptron regression.

Originality/value

The experimental results indicated that the presently adopted method shows better performance for detecting the degradation more accurately at early stage. Furthermore, the diagnostics and prognostics have been getting much attention in the field of vibration, and it plays a significant role to avoid accidents.

1 – 10 of over 2000