Search results
1 – 10 of over 5000Sara Tavassoli and Hamidreza Koosha
Customer churn prediction is one of the most well-known approaches to manage and improve customer retention. Machine learning techniques, especially classification algorithms, are…
Abstract
Purpose
Customer churn prediction is one of the most well-known approaches to manage and improve customer retention. Machine learning techniques, especially classification algorithms, are very popular tools to predict the churners. In this paper, three ensemble classifiers are proposed based on bagging and boosting for customer churn prediction.
Design/methodology/approach
In this paper, three ensemble classifiers are proposed based on bagging and boosting for customer churn prediction. The first classifier, which is called boosted bagging, uses boosting for each bagging sample. In this approach, before concluding the final results in a bagging algorithm, the authors try to improve the prediction by applying a boosting algorithm for each bootstrap sample. The second proposed ensemble classifier, which is called bagged bagging, combines bagging with itself. In the other words, the authors apply bagging for each sample of bagging algorithm. Finally, the third approach uses bagging of neural network with learning based on a genetic algorithm.
Findings
To examine the performance of all proposed ensemble classifiers, they are applied to two datasets. Numerical simulations illustrate that the proposed hybrid approaches outperform the simple bagging and boosting algorithms as well as base classifiers. Especially, bagged bagging provides high accuracy and precision results.
Originality/value
In this paper, three novel ensemble classifiers are proposed based on bagging and boosting for customer churn prediction. Not only the proposed approaches can be applied for customer churn prediction but also can be used for any other binary classification algorithms.
Details
Keywords
Thiago Turchetti Maia, Antônio Pádua Braga and André F. de Carvalho
To create new hybrid algorithms that combine boosting and support vector machines to outperform other known algorithms in selected contexts of binary classification problems.
Abstract
Purpose
To create new hybrid algorithms that combine boosting and support vector machines to outperform other known algorithms in selected contexts of binary classification problems.
Design/methodology/approach
Support vector machines (SVM) are known in the literature to be one of the most efficient learning models for tackling classification problems. Boosting algorithms rely on other classification algorithms to produce different weak hypotheses which are later combined into a single strong hypothesis. In this work the authors combine boosting with support vector machines, namely the AdaBoost.M1 and sequential minimal optimization (SMO) algorithms, to create new hybrid algorithms that outperform standard SVMs in selected contexts. This is achieved by integration with different degrees of coupling, where the four algorithms proposed range from simple black‐box integration to modifications and mergers between AdaBoost.M1 and SMO components.
Findings
The results show that the proposed algorithms exhibited better performance for most problems experimented. It is possible to identify trends of behavior bound to specific properties of the problems solved, where one may hence apply the proposed algorithms in situations where it is known to succeed.
Research limitations/implications
New strategies for combining boosting and SVMs may be further developed using the principles introduced in this paper, possibly resulting in other algorithms with yet superior performance.
Practical implications
The hybrid algorithms proposed in this paper may be used in classification problems with properties that they are known to handle well, thus possibly offering better results than other known algorithms in the literature.
Originality/value
This paper introduces the concept of merging boosting and SVM training algorithms to obtain hybrid solutions with better performance than standard SVMs.
Details
Keywords
Increasing availability of the financial data has opened new opportunities for quantitative modeling. It has also exposed limitations of the existing frameworks, such as low…
Abstract
Increasing availability of the financial data has opened new opportunities for quantitative modeling. It has also exposed limitations of the existing frameworks, such as low accuracy of the simplified analytical models and insufficient interpretability and stability of the adaptive data-driven algorithms. I make the case that boosting (a novel, ensemble learning technique) can serve as a simple and robust framework for combining the best features of the analytical and data-driven models. Boosting-based frameworks for typical financial and econometric applications are outlined. The implementation of a standard boosting procedure is illustrated in the context of the problem of symbolic volatility forecasting for IBM stock time series. It is shown that the boosted collection of the generalized autoregressive conditional heteroskedastic (GARCH)-type models is systematically more accurate than both the best single model in the collection and the widely used GARCH(1,1) model.
Wen Li, Wei Wang and Wenjun Huo
Inspired by the basic idea of gradient boosting, this study aims to design a novel multivariate regression ensemble algorithm RegBoost by using multivariate linear regression as a…
Abstract
Purpose
Inspired by the basic idea of gradient boosting, this study aims to design a novel multivariate regression ensemble algorithm RegBoost by using multivariate linear regression as a weak predictor.
Design/methodology/approach
To achieve nonlinearity after combining all linear regression predictors, the training data is divided into two branches according to the prediction results using the current weak predictor. The linear regression modeling is recursively executed in two branches. In the test phase, test data is distributed to a specific branch to continue with the next weak predictor. The final result is the sum of all weak predictors across the entire path.
Findings
Through comparison experiments, it is found that the algorithm RegBoost can achieve similar performance to the gradient boosted decision tree (GBDT). The algorithm is very effective compared to linear regression.
Originality/value
This paper attempts to design a novel regression algorithm RegBoost with reference to GBDT. To the best of the knowledge, for the first time, RegBoost uses linear regression as a weak predictor, and combine with gradient boosting to build an ensemble algorithm.
Details
Keywords
Yuchuan Wu, Shengfeng Qi, Feng Hu, Shuangbao Ma, Wen Mao and Wei Li
In human action recognition based on wearable sensors, most previous studies have focused on a single type of sensor and single classifier. This study aims to use a wearable…
Abstract
Purpose
In human action recognition based on wearable sensors, most previous studies have focused on a single type of sensor and single classifier. This study aims to use a wearable sensor based on flexible sensors and a tri-axial accelerometer to collect action data of elderly people. It uses a statistical modeling approach based on the ensemble algorithm to classify actions and verify its validity.
Design/methodology/approach
Nine types of daily actions were collected by the wearable sensor device from a group of elderly volunteers, and the time-domain features of the action sequences were extracted. The dimensionality of the feature vectors was reduced by linear discriminant analysis. An ensemble learning method based on XGBoost was used to build a model of elderly action recognition. Its performance was compared with the action recognition rate of other algorithms based on the Boosting algorithm, and with the accuracy of single classifier models.
Findings
The effectiveness of the method was validated by three experiments. The results show that XGBoost is able to classify nine daily actions of the elderly and achieve an average recognition rate of 94.8 per cent, which is superior to single classifiers and to other ensemble algorithms.
Practical implications
The research could have important implications for health care, including the treatment and rehabilitation of the elderly, and the prevention of falls.
Originality/value
Instead of using a single type of sensor, this research used a wearable sensor to obtain daily action data of the elderly. The results show that, by using the appropriate method, the device can obtain detailed data of joint action at a low cost. Comparing differences in performance, it was concluded that XGBoost is the most suitable algorithm for building a model of elderly action recognition. This method, together with a wearable sensor, can provide key data and accurate feedback information to monitor the elderly in their rehabilitation activities.
Details
Keywords
Shrawan Kumar Trivedi and Prabin Kumar Panigrahi
Email spam classification is now becoming a challenging area in the domain of text classification. Precise and robust classifiers are not only judged by classification accuracy…
Abstract
Purpose
Email spam classification is now becoming a challenging area in the domain of text classification. Precise and robust classifiers are not only judged by classification accuracy but also by sensitivity (correctly classified legitimate emails) and specificity (correctly classified unsolicited emails) towards the accurate classification, captured by both false positive and false negative rates. This paper aims to present a comparative study between various decision tree classifiers (such as AD tree, decision stump and REP tree) with/without different boosting algorithms (bagging, boosting with re-sample and AdaBoost).
Design/methodology/approach
Artificial intelligence and text mining approaches have been incorporated in this study. Each decision tree classifier in this study is tested on informative words/features selected from the two publically available data sets (SpamAssassin and LingSpam) using a greedy step-wise feature search method.
Findings
Outcomes of this study show that without boosting, the REP tree provides high performance accuracy with the AD tree ranking as the second-best performer. Decision stump is found to be the under-performing classifier of this study. However, with boosting, the combination of REP tree and AdaBoost compares favourably with other classification models. If the metrics false positive rate and performance accuracy are taken together, AD tree and REP tree with AdaBoost were both found to carry out an effective classification task. Greedy stepwise has proven its worth in this study by selecting a subset of valuable features to identify the correct class of emails.
Research limitations/implications
This research is focussed on the classification of those email spams that are written in the English language only. The proposed models work with content (words/features) of email data that is mostly found in the body of the mail. Image spam has not been included in this study. Other messages such as short message service or multi-media messaging service were not included in this study.
Practical implications
In this research, a boosted decision tree approach has been proposed and used to classify email spam and ham files; this is found to be a highly effective approach in comparison with other state-of-the-art modes used in other studies. This classifier may be tested for different applications and may provide new insights for developers and researchers.
Originality/value
A comparison of decision tree classifiers with/without ensemble has been presented for spam classification.
Details
Keywords
Shrawan Kumar Trivedi and Shubhamoy Dey
The email is an important medium for sharing information rapidly. However, spam, being a nuisance in such communication, motivates the building of a robust filtering system with…
Abstract
Purpose
The email is an important medium for sharing information rapidly. However, spam, being a nuisance in such communication, motivates the building of a robust filtering system with high classification accuracy and good sensitivity towards false positives. In that context, this paper aims to present a combined classifier technique using a committee selection mechanism where the main objective is to identify a set of classifiers so that their individual decisions can be combined by a committee selection procedure for accurate detection of spam.
Design/methodology/approach
For training and testing of the relevant machine learning classifiers, text mining approaches are used in this research. Three data sets (Enron, SpamAssassin and LingSpam) have been used to test the classifiers. Initially, pre-processing is performed to extract the features associated with the email files. In the next step, the extracted features are taken through a dimensionality reduction method where non-informative features are removed. Subsequently, an informative feature subset is selected using genetic feature search. Thereafter, the proposed classifiers are tested on those informative features and the results compared with those of other classifiers.
Findings
For building the proposed combined classifier, three different studies have been performed. The first study identifies the effect of boosting algorithms on two probabilistic classifiers: Bayesian and Naïve Bayes. In that study, AdaBoost has been found to be the best algorithm for performance boosting. The second study was on the effect of different Kernel functions on support vector machine (SVM) classifier, where SVM with normalized polynomial (NP) kernel was observed to be the best. The last study was on combining classifiers with committee selection where the committee members were the best classifiers identified by the first study i.e. Bayesian and Naïve bays with AdaBoost, and the committee president was selected from the second study i.e. SVM with NP kernel. Results show that combining of the identified classifiers to form a committee machine gives excellent performance accuracy with a low false positive rate.
Research limitations/implications
This research is focused on the classification of email spams written in English language. Only body (text) parts of the emails have been used. Image spam has not been included in this work. We have restricted our work to only emails messages. None of the other types of messages like short message service or multi-media messaging service were a part of this study.
Practical implications
This research proposes a method of dealing with the issues and challenges faced by internet service providers and organizations that use email. The proposed model provides not only better classification accuracy but also a low false positive rate.
Originality/value
The proposed combined classifier is a novel classifier designed for accurate classification of email spam.
Details
Keywords
Shrawan Kumar Trivedi and Shubhamoy Dey
Email is a rapid and cheapest medium of sharing information, whereas unsolicited email (spam) is constant trouble in the email communication. The rapid growth of the spam creates…
Abstract
Purpose
Email is a rapid and cheapest medium of sharing information, whereas unsolicited email (spam) is constant trouble in the email communication. The rapid growth of the spam creates a necessity to build a reliable and robust spam classifier. This paper aims to presents a study of evolutionary classifiers (genetic algorithm [GA] and genetic programming [GP]) without/with the help of an ensemble of classifiers method. In this research, the classifiers ensemble has been developed with adaptive boosting technique.
Design/methodology/approach
Text mining methods are applied for classifying spam emails and legitimate emails. Two data sets (Enron and SpamAssassin) are taken to test the concerned classifiers. Initially, pre-processing is performed to extract the features/words from email files. Informative feature subset is selected from greedy stepwise feature subset search method. With the help of informative features, a comparative study is performed initially within the evolutionary classifiers and then with other popular machine learning classifiers (Bayesian, naive Bayes and support vector machine).
Findings
This study reveals the fact that evolutionary algorithms are promising in classification and prediction applications where genetic programing with adaptive boosting is turned out not only an accurate classifier but also a sensitive classifier. Results show that initially GA performs better than GP but after an ensemble of classifiers (a large number of iterations), GP overshoots GA with significantly higher accuracy. Amongst all classifiers, boosted GP turns out to be not only good regarding classification accuracy but also low false positive (FP) rates, which is considered to be the important criteria in email spam classification. Also, greedy stepwise feature search is found to be an effective method for feature selection in this application domain.
Research limitations/implications
The research implication of this research consists of the reduction in cost incurred because of spam/unsolicited bulk email. Email is a fundamental necessity to share information within a number of units of the organizations to be competitive with the business rivals. In addition, it is continually a hurdle for internet service providers to provide the best emailing services to their customers. Although, the organizations and the internet service providers are continuously adopting novel spam filtering approaches to reduce the number of unwanted emails, the desired effect could not be significantly seen because of the cost of installation, customizable ability and the threat of misclassification of important emails. This research deals with all the issues and challenges faced by internet service providers and organizations.
Practical implications
In this research, the proposed models have not only provided excellent performance accuracy, sensitivity with low FP rate, customizable capability but also worked on reducing the cost of spam. The same models may be used for other applications of text mining also such as sentiment analysis, blog mining, news mining or other text mining research.
Originality/value
A comparison between GP and GAs has been shown with/without ensemble in spam classification application domain.
Details
Keywords
Christian Nnaemeka Egwim, Hafiz Alaka, Youlu Pan, Habeeb Balogun, Saheed Ajayi, Abdul Hye and Oluwapelumi Oluwaseun Egunjobi
The study aims to develop a multilayer high-effective ensemble of ensembles predictive model (stacking ensemble) using several hyperparameter optimized ensemble machine learning…
Abstract
Purpose
The study aims to develop a multilayer high-effective ensemble of ensembles predictive model (stacking ensemble) using several hyperparameter optimized ensemble machine learning (ML) methods (bagging and boosting ensembles) trained with high-volume data points retrieved from Internet of Things (IoT) emission sensors, time-corresponding meteorology and traffic data.
Design/methodology/approach
For a start, the study experimented big data hypothesis theory by developing sample ensemble predictive models on different data sample sizes and compared their results. Second, it developed a standalone model and several bagging and boosting ensemble models and compared their results. Finally, it used the best performing bagging and boosting predictive models as input estimators to develop a novel multilayer high-effective stacking ensemble predictive model.
Findings
Results proved data size to be one of the main determinants to ensemble ML predictive power. Second, it proved that, as compared to using a single algorithm, the cumulative result from ensemble ML algorithms is usually always better in terms of predicted accuracy. Finally, it proved stacking ensemble to be a better model for predicting PM2.5 concentration level than bagging and boosting ensemble models.
Research limitations/implications
A limitation of this study is the trade-off between performance of this novel model and the computational time required to train it. Whether this gap can be closed remains an open research question. As a result, future research should attempt to close this gap. Also, future studies can integrate this novel model to a personal air quality messaging system to inform public of pollution levels and improve public access to air quality forecast.
Practical implications
The outcome of this study will aid the public to proactively identify highly polluted areas thus potentially reducing pollution-associated/ triggered COVID-19 (and other lung diseases) deaths/ complications/ transmission by encouraging avoidance behavior and support informed decision to lock down by government bodies when integrated into an air pollution monitoring system
Originality/value
This study fills a gap in literature by providing a justification for selecting appropriate ensemble ML algorithms for PM2.5 concentration level predictive modeling. Second, it contributes to the big data hypothesis theory, which suggests that data size is one of the most important factors of ML predictive capability. Third, it supports the premise that when using ensemble ML algorithms, the cumulative output is usually always better in terms of predicted accuracy than using a single algorithm. Finally developing a novel multilayer high-performant hyperparameter optimized ensemble of ensembles predictive model that can accurately predict PM2.5 concentration levels with improved model interpretability and enhanced generalizability, as well as the provision of a novel databank of historic pollution data from IoT emission sensors that can be purchased for research, consultancy and policymaking.
Details