Search results
1 – 10 of over 1000Sara Tavassoli and Hamidreza Koosha
Customer churn prediction is one of the most well-known approaches to manage and improve customer retention. Machine learning techniques, especially classification algorithms, are…
Abstract
Purpose
Customer churn prediction is one of the most well-known approaches to manage and improve customer retention. Machine learning techniques, especially classification algorithms, are very popular tools to predict the churners. In this paper, three ensemble classifiers are proposed based on bagging and boosting for customer churn prediction.
Design/methodology/approach
In this paper, three ensemble classifiers are proposed based on bagging and boosting for customer churn prediction. The first classifier, which is called boosted bagging, uses boosting for each bagging sample. In this approach, before concluding the final results in a bagging algorithm, the authors try to improve the prediction by applying a boosting algorithm for each bootstrap sample. The second proposed ensemble classifier, which is called bagged bagging, combines bagging with itself. In the other words, the authors apply bagging for each sample of bagging algorithm. Finally, the third approach uses bagging of neural network with learning based on a genetic algorithm.
Findings
To examine the performance of all proposed ensemble classifiers, they are applied to two datasets. Numerical simulations illustrate that the proposed hybrid approaches outperform the simple bagging and boosting algorithms as well as base classifiers. Especially, bagged bagging provides high accuracy and precision results.
Originality/value
In this paper, three novel ensemble classifiers are proposed based on bagging and boosting for customer churn prediction. Not only the proposed approaches can be applied for customer churn prediction but also can be used for any other binary classification algorithms.
Details
Keywords
Tae-Hwy Lee and Yang Yang
Bagging (bootstrap aggregating) is a smoothing method to improve predictive ability under the presence of parameter estimation uncertainty and model uncertainty. In Lee and Yang…
Abstract
Bagging (bootstrap aggregating) is a smoothing method to improve predictive ability under the presence of parameter estimation uncertainty and model uncertainty. In Lee and Yang (2006), we examined how (equal-weighted and BMA-weighted) bagging works for one-step-ahead binary prediction with an asymmetric cost function for time series, where we considered simple cases with particular choices of a linlin tick loss function and an algorithm to estimate a linear quantile regression model. In the present chapter, we examine how bagging predictors work with different aggregating (averaging) schemes, for multi-step forecast horizons, with a general class of tick loss functions, with different estimation algorithms, for nonlinear quantile regression models, and for different data frequencies. Bagging quantile predictors are constructed via (weighted) averaging over predictors trained on bootstrapped training samples, and bagging binary predictors are conducted via (majority) voting on predictors trained on the bootstrapped training samples. We find that median bagging and trimmed-mean bagging can alleviate the problem of extreme predictors from bootstrap samples and have better performance than equally weighted bagging predictors; that bagging works better at longer forecast horizons; that bagging works well with highly nonlinear quantile regression models (e.g., artificial neural network), and with general tick loss functions. We also find that the performance of bagging may be affected by using different quantile estimation algorithms (in small samples, even if the estimation is consistent) and by using different frequencies of time series data.
Christian Nnaemeka Egwim, Hafiz Alaka, Youlu Pan, Habeeb Balogun, Saheed Ajayi, Abdul Hye and Oluwapelumi Oluwaseun Egunjobi
The study aims to develop a multilayer high-effective ensemble of ensembles predictive model (stacking ensemble) using several hyperparameter optimized ensemble machine learning…
Abstract
Purpose
The study aims to develop a multilayer high-effective ensemble of ensembles predictive model (stacking ensemble) using several hyperparameter optimized ensemble machine learning (ML) methods (bagging and boosting ensembles) trained with high-volume data points retrieved from Internet of Things (IoT) emission sensors, time-corresponding meteorology and traffic data.
Design/methodology/approach
For a start, the study experimented big data hypothesis theory by developing sample ensemble predictive models on different data sample sizes and compared their results. Second, it developed a standalone model and several bagging and boosting ensemble models and compared their results. Finally, it used the best performing bagging and boosting predictive models as input estimators to develop a novel multilayer high-effective stacking ensemble predictive model.
Findings
Results proved data size to be one of the main determinants to ensemble ML predictive power. Second, it proved that, as compared to using a single algorithm, the cumulative result from ensemble ML algorithms is usually always better in terms of predicted accuracy. Finally, it proved stacking ensemble to be a better model for predicting PM2.5 concentration level than bagging and boosting ensemble models.
Research limitations/implications
A limitation of this study is the trade-off between performance of this novel model and the computational time required to train it. Whether this gap can be closed remains an open research question. As a result, future research should attempt to close this gap. Also, future studies can integrate this novel model to a personal air quality messaging system to inform public of pollution levels and improve public access to air quality forecast.
Practical implications
The outcome of this study will aid the public to proactively identify highly polluted areas thus potentially reducing pollution-associated/ triggered COVID-19 (and other lung diseases) deaths/ complications/ transmission by encouraging avoidance behavior and support informed decision to lock down by government bodies when integrated into an air pollution monitoring system
Originality/value
This study fills a gap in literature by providing a justification for selecting appropriate ensemble ML algorithms for PM2.5 concentration level predictive modeling. Second, it contributes to the big data hypothesis theory, which suggests that data size is one of the most important factors of ML predictive capability. Third, it supports the premise that when using ensemble ML algorithms, the cumulative output is usually always better in terms of predicted accuracy than using a single algorithm. Finally developing a novel multilayer high-performant hyperparameter optimized ensemble of ensembles predictive model that can accurately predict PM2.5 concentration levels with improved model interpretability and enhanced generalizability, as well as the provision of a novel databank of historic pollution data from IoT emission sensors that can be purchased for research, consultancy and policymaking.
Details
Keywords
Kalyan Nagaraj, Biplab Bhattacharjee, Amulyashree Sridhar and Sharvani GS
Phishing is one of the major threats affecting businesses worldwide in current times. Organizations and customers face the hazards arising out of phishing attacks because of…
Abstract
Purpose
Phishing is one of the major threats affecting businesses worldwide in current times. Organizations and customers face the hazards arising out of phishing attacks because of anonymous access to vulnerable details. Such attacks often result in substantial financial losses. Thus, there is a need for effective intrusion detection techniques to identify and possibly nullify the effects of phishing. Classifying phishing and non-phishing web content is a critical task in information security protocols, and full-proof mechanisms have yet to be implemented in practice. The purpose of the current study is to present an ensemble machine learning model for classifying phishing websites.
Design/methodology/approach
A publicly available data set comprising 10,068 instances of phishing and legitimate websites was used to build the classifier model. Feature extraction was performed by deploying a group of methods, and relevant features extracted were used for building the model. A twofold ensemble learner was developed by integrating results from random forest (RF) classifier, fed into a feedforward neural network (NN). Performance of the ensemble classifier was validated using k-fold cross-validation. The twofold ensemble learner was implemented as a user-friendly, interactive decision support system for classifying websites as phishing or legitimate ones.
Findings
Experimental simulations were performed to access and compare the performance of the ensemble classifiers. The statistical tests estimated that RF_NN model gave superior performance with an accuracy of 93.41 per cent and minimal mean squared error of 0.000026.
Research limitations/implications
The research data set used in this study is publically available and easy to analyze. Comparative analysis with other real-time data sets of recent origin must be performed to ensure generalization of the model against various security breaches. Different variants of phishing threats must be detected rather than focusing particularly toward phishing website detection.
Originality/value
The twofold ensemble model is not applied for classification of phishing websites in any previous studies as per the knowledge of authors.
Details
Keywords
This paper aims to propose the new incremental and parallel training algorithm of proximal support vector machines (Inc-Par-PSVM) tailored on the edge device (i.e. the Jetson…
Abstract
Purpose
This paper aims to propose the new incremental and parallel training algorithm of proximal support vector machines (Inc-Par-PSVM) tailored on the edge device (i.e. the Jetson Nano) to handle the large-scale ImageNet challenging problem.
Design/methodology/approach
The Inc-Par-PSVM trains in the incremental and parallel manner ensemble binary PSVM classifiers used for the One-Versus-All multiclass strategy on the Jetson Nano. The binary PSVM model is the average in bagged binary PSVM models built in undersampling training data block.
Findings
The empirical test results on the ImageNet data set show that the Inc-Par-PSVM algorithm with the Jetson Nano (Quad-core ARM A57 @ 1.43 GHz, 128-core NVIDIA Maxwell architecture-based graphics processing unit, 4 GB RAM) is faster and more accurate than the state-of-the-art linear SVM algorithm run on a PC [Intel(R) Core i7-4790 CPU, 3.6 GHz, 4 cores, 32 GB RAM].
Originality/value
The new incremental and parallel PSVM algorithm tailored on the Jetson Nano is able to efficiently handle the large-scale ImageNet challenge with 1.2 million images and 1,000 classes.
Details
Keywords
Shrawan Kumar Trivedi and Prabin Kumar Panigrahi
Email spam classification is now becoming a challenging area in the domain of text classification. Precise and robust classifiers are not only judged by classification accuracy…
Abstract
Purpose
Email spam classification is now becoming a challenging area in the domain of text classification. Precise and robust classifiers are not only judged by classification accuracy but also by sensitivity (correctly classified legitimate emails) and specificity (correctly classified unsolicited emails) towards the accurate classification, captured by both false positive and false negative rates. This paper aims to present a comparative study between various decision tree classifiers (such as AD tree, decision stump and REP tree) with/without different boosting algorithms (bagging, boosting with re-sample and AdaBoost).
Design/methodology/approach
Artificial intelligence and text mining approaches have been incorporated in this study. Each decision tree classifier in this study is tested on informative words/features selected from the two publically available data sets (SpamAssassin and LingSpam) using a greedy step-wise feature search method.
Findings
Outcomes of this study show that without boosting, the REP tree provides high performance accuracy with the AD tree ranking as the second-best performer. Decision stump is found to be the under-performing classifier of this study. However, with boosting, the combination of REP tree and AdaBoost compares favourably with other classification models. If the metrics false positive rate and performance accuracy are taken together, AD tree and REP tree with AdaBoost were both found to carry out an effective classification task. Greedy stepwise has proven its worth in this study by selecting a subset of valuable features to identify the correct class of emails.
Research limitations/implications
This research is focussed on the classification of those email spams that are written in the English language only. The proposed models work with content (words/features) of email data that is mostly found in the body of the mail. Image spam has not been included in this study. Other messages such as short message service or multi-media messaging service were not included in this study.
Practical implications
In this research, a boosted decision tree approach has been proposed and used to classify email spam and ham files; this is found to be a highly effective approach in comparison with other state-of-the-art modes used in other studies. This classifier may be tested for different applications and may provide new insights for developers and researchers.
Originality/value
A comparison of decision tree classifiers with/without ensemble has been presented for spam classification.
Details
Keywords
The immense quantity of available unstructured text documents serve as one of the largest source of information. Text classification can be an essential task for many purposes in…
Abstract
Purpose
The immense quantity of available unstructured text documents serve as one of the largest source of information. Text classification can be an essential task for many purposes in information retrieval, such as document organization, text filtering and sentiment analysis. Ensemble learning has been extensively studied to construct efficient text classification schemes with higher predictive performance and generalization ability. The purpose of this paper is to provide diversity among the classification algorithms of ensemble, which is a key issue in the ensemble design.
Design/methodology/approach
An ensemble scheme based on hybrid supervised clustering is presented for text classification. In the presented scheme, supervised hybrid clustering, which is based on cuckoo search algorithm and k-means, is introduced to partition the data samples of each class into clusters so that training subsets with higher diversities can be provided. Each classifier is trained on the diversified training subsets and the predictions of individual classifiers are combined by the majority voting rule. The predictive performance of the proposed classifier ensemble is compared to conventional classification algorithms (such as Naïve Bayes, logistic regression, support vector machines and C4.5 algorithm) and ensemble learning methods (such as AdaBoost, bagging and random subspace) using 11 text benchmarks.
Findings
The experimental results indicate that the presented classifier ensemble outperforms the conventional classification algorithms and ensemble learning methods for text classification.
Originality/value
The presented ensemble scheme is the first to use supervised clustering to obtain diverse ensemble for text classification
Details
Keywords
Shrawan Kumar Trivedi and Shubhamoy Dey
The email is an important medium for sharing information rapidly. However, spam, being a nuisance in such communication, motivates the building of a robust filtering system with…
Abstract
Purpose
The email is an important medium for sharing information rapidly. However, spam, being a nuisance in such communication, motivates the building of a robust filtering system with high classification accuracy and good sensitivity towards false positives. In that context, this paper aims to present a combined classifier technique using a committee selection mechanism where the main objective is to identify a set of classifiers so that their individual decisions can be combined by a committee selection procedure for accurate detection of spam.
Design/methodology/approach
For training and testing of the relevant machine learning classifiers, text mining approaches are used in this research. Three data sets (Enron, SpamAssassin and LingSpam) have been used to test the classifiers. Initially, pre-processing is performed to extract the features associated with the email files. In the next step, the extracted features are taken through a dimensionality reduction method where non-informative features are removed. Subsequently, an informative feature subset is selected using genetic feature search. Thereafter, the proposed classifiers are tested on those informative features and the results compared with those of other classifiers.
Findings
For building the proposed combined classifier, three different studies have been performed. The first study identifies the effect of boosting algorithms on two probabilistic classifiers: Bayesian and Naïve Bayes. In that study, AdaBoost has been found to be the best algorithm for performance boosting. The second study was on the effect of different Kernel functions on support vector machine (SVM) classifier, where SVM with normalized polynomial (NP) kernel was observed to be the best. The last study was on combining classifiers with committee selection where the committee members were the best classifiers identified by the first study i.e. Bayesian and Naïve bays with AdaBoost, and the committee president was selected from the second study i.e. SVM with NP kernel. Results show that combining of the identified classifiers to form a committee machine gives excellent performance accuracy with a low false positive rate.
Research limitations/implications
This research is focused on the classification of email spams written in English language. Only body (text) parts of the emails have been used. Image spam has not been included in this work. We have restricted our work to only emails messages. None of the other types of messages like short message service or multi-media messaging service were a part of this study.
Practical implications
This research proposes a method of dealing with the issues and challenges faced by internet service providers and organizations that use email. The proposed model provides not only better classification accuracy but also a low false positive rate.
Originality/value
The proposed combined classifier is a novel classifier designed for accurate classification of email spam.
Details
Keywords
Teng Wang, Xiaofeng Hu and Yahui Zhang
Steam turbine final assembly is a dynamic process, in which various interference events occur frequently. Currently, data transmission relies on oral presentation, while…
Abstract
Purpose
Steam turbine final assembly is a dynamic process, in which various interference events occur frequently. Currently, data transmission relies on oral presentation, while scheduling depends on the manual experience of managers. This mode has low information transmission efficiency and is difficult to timely respond to emergencies. Besides, it is difficult to consider various factors when manually adjusting the plan, which reduces assembly efficiency. The purpose of this paper is to propose a knowledge-based real-time scheduling system under cyber-physical system (CPS) environment which can improve the assembly efficiency of steam turbines.
Design/methodology/approach
First, an Internet of Things based CPS framework is proposed to achieve real-time monitoring of turbine assembly and improve the efficiency of information transmission. Second, a knowledge-based real-time scheduling system consisting of three modules is designed to replace manual experience for steam turbine assembly scheduling.
Findings
Experiments show that the scheduling results of the knowledge-based scheduling system outperform heuristic algorithms based on priority rules. Compared with manual scheduling, the delay time is reduced by 43.9%.
Originality/value
A knowledge-based real-time scheduling system under CPS environment is proposed to improve the assembly efficiency of steam turbines. This paper provides a reference paradigm for the application of the knowledge-based system and CPS in the assembly control of labor-intensive engineering-to-order products.
Details
Keywords
Oscar F. Bustinza, Luis M. Molina Fernandez and Marlene Mendoza Macías
Machine learning (ML) analytical tools are increasingly being considered as an alternative quantitative methodology in management research. This paper proposes a new approach for…
Abstract
Purpose
Machine learning (ML) analytical tools are increasingly being considered as an alternative quantitative methodology in management research. This paper proposes a new approach for uncovering the antecedents behind product and product–service innovation (PSI).
Design/methodology/approach
The ML approach is novel in the field of innovation antecedents at the country level. A sample of the Equatorian National Survey on Technology and Innovation, consisting of more than 6,000 firms, is used to rank the antecedents of innovation.
Findings
The analysis reveals that the antecedents of product and PSI are distinct, yet rooted in the principles of open innovation and competitive priorities.
Research limitations/implications
The analysis is based on a sample of Equatorian firms with the objective of showing how ML techniques are suitable for testing the antecedents of innovation in any other context.
Originality/value
The novel ML approach, in contrast to traditional quantitative analysis of the topic, can consider the full set of antecedent interactions to each of the innovations analyzed.
Details