Search results

1 – 10 of over 1000
Article
Publication date: 6 June 2008

Norbert Tóth and Béla Pataki

The purpose of this paper is to provide classification confidence value to every individual sample classified by decision trees and use this value to combine the classifiers.

Abstract

Purpose

The purpose of this paper is to provide classification confidence value to every individual sample classified by decision trees and use this value to combine the classifiers.

Design/methodology/approach

The proposed system is first theoretically explained, and then the use and effectiveness of the proposed system is demonstrated on sample datasets.

Findings

In this paper, a novel method is proposed to combine decision tree classifiers using calculated classification confidence values. This confidence in the classification is based on distance calculation to the relevant decision boundary (distance conditional), probability density estimation and (distance conditional) classification confidence estimation. It is shown that these values – provided by individual classification trees – can be integrated to derive a consensus decision.

Research limitations/implications

The proposed method is not limited to axis‐parallel trees, it is applicable not only to oblique trees, but also to any kind of classifier system that uses hyperplanes to cluster the input space.

Originality/value

A novel method is presented to extend decision tree like classifiers with confidence calculation and a voting system is proposed that uses this confidence information. The proposed system possesses several novelties (e.g. it not only gives class probabilities, but also classification confidences) and advantages over previous (traditional) approaches. The voting system does not require an auxiliary combiner or gating network, as in the mixture of experts structure and the method is not limited to decision trees with axis‐parallel splits; it is applicable to any kind of classifiers that use hyperplanes to cluster the input space.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 1 no. 2
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 30 October 2018

Shrawan Kumar Trivedi and Prabin Kumar Panigrahi

Email spam classification is now becoming a challenging area in the domain of text classification. Precise and robust classifiers are not only judged by classification accuracy…

Abstract

Purpose

Email spam classification is now becoming a challenging area in the domain of text classification. Precise and robust classifiers are not only judged by classification accuracy but also by sensitivity (correctly classified legitimate emails) and specificity (correctly classified unsolicited emails) towards the accurate classification, captured by both false positive and false negative rates. This paper aims to present a comparative study between various decision tree classifiers (such as AD tree, decision stump and REP tree) with/without different boosting algorithms (bagging, boosting with re-sample and AdaBoost).

Design/methodology/approach

Artificial intelligence and text mining approaches have been incorporated in this study. Each decision tree classifier in this study is tested on informative words/features selected from the two publically available data sets (SpamAssassin and LingSpam) using a greedy step-wise feature search method.

Findings

Outcomes of this study show that without boosting, the REP tree provides high performance accuracy with the AD tree ranking as the second-best performer. Decision stump is found to be the under-performing classifier of this study. However, with boosting, the combination of REP tree and AdaBoost compares favourably with other classification models. If the metrics false positive rate and performance accuracy are taken together, AD tree and REP tree with AdaBoost were both found to carry out an effective classification task. Greedy stepwise has proven its worth in this study by selecting a subset of valuable features to identify the correct class of emails.

Research limitations/implications

This research is focussed on the classification of those email spams that are written in the English language only. The proposed models work with content (words/features) of email data that is mostly found in the body of the mail. Image spam has not been included in this study. Other messages such as short message service or multi-media messaging service were not included in this study.

Practical implications

In this research, a boosted decision tree approach has been proposed and used to classify email spam and ham files; this is found to be a highly effective approach in comparison with other state-of-the-art modes used in other studies. This classifier may be tested for different applications and may provide new insights for developers and researchers.

Originality/value

A comparison of decision tree classifiers with/without ensemble has been presented for spam classification.

Details

Journal of Systems and Information Technology, vol. 20 no. 3
Type: Research Article
ISSN: 1328-7265

Keywords

Book part
Publication date: 31 January 2015

Davy Janssens and Geert Wets

Several activity-based transportation models are now becoming operational and are entering the stage of application for the modelling of travel demand. In our application, we will…

Abstract

Several activity-based transportation models are now becoming operational and are entering the stage of application for the modelling of travel demand. In our application, we will use decision rules to support the decision-making of the model instead of principles of utility maximization, which means our work can be interpreted as an application of the concept of bounded rationality in the transportation domain. In this chapter we explored a novel idea of combining decision trees and Bayesian networks to improve decision-making in order to maintain the potential advantages of both techniques. The results of this study suggest that integrated Bayesian networks and decision trees can be used for modelling the different choice facets of a travel demand model with better predictive power than CHAID decision trees. Another conclusion is that there are initial indications that the new way of integrating decision trees and Bayesian networks has produced a decision tree that is structurally more stable.

Details

Bounded Rational Choice Behaviour: Applications in Transport
Type: Book
ISBN: 978-1-78441-071-1

Keywords

Article
Publication date: 31 July 2019

Zhe Zhang and Yue Dai

For classification problems of customer relationship management (CRM), the purpose of this paper is to propose a method with interpretability of the classification results that…

Abstract

Purpose

For classification problems of customer relationship management (CRM), the purpose of this paper is to propose a method with interpretability of the classification results that combines multiple decision trees based on a genetic algorithm.

Design/methodology/approach

In the proposed method, multiple decision trees are combined in parallel. Subsequently, a genetic algorithm is used to optimize the weight matrix in the combination algorithm.

Findings

The method is applied to customer credit rating assessment and customer response behavior pattern recognition. The results demonstrate that compared to a single decision tree, the proposed combination method improves the predictive accuracy and optimizes the classification rules, while maintaining interpretability of the classification results.

Originality/value

The findings of this study contribute to research methodologies in CRM. It specifically focuses on a new method with interpretability by combining multiple decision trees based on genetic algorithms for customer classification.

Details

Asia Pacific Journal of Marketing and Logistics, vol. 32 no. 5
Type: Research Article
ISSN: 1355-5855

Keywords

Article
Publication date: 4 April 2022

Shrawan Kumar Trivedi, Amrinder Singh and Somesh Kumar Malhotra

There is a need to predict whether the consumers liked the stay in the hotel rooms or not, and to remove the aspects the customers did not like. Many customers leave a review…

Abstract

Purpose

There is a need to predict whether the consumers liked the stay in the hotel rooms or not, and to remove the aspects the customers did not like. Many customers leave a review after staying in the hotel. These reviews are mostly given on the website used to book the hotel. These reviews can be considered as a valuable data, which can be analyzed to provide better services in the hotels. The purpose of this study is to use machine learning techniques for analyzing the given data to determine different sentiment polarities of the consumers.

Design/methodology/approach

Reviews given by hotel customers on the Tripadvisor website, which were made available publicly by Kaggle. Out of 10,000 reviews in the data, a sample of 3,000 negative polarity reviews (customers with bad experiences) in the hotel and 3,000 positive polarity reviews (customers with good experiences) in the hotel is taken to prepare data set. The two-stage feature selection was applied, which first involved greedy selection method and then wrapper method to generate 37 most relevant features. An improved stacked decision tree (ISD) classifier) is built, which is further compared with state-of-the-art machine learning algorithms. All the tests are done using R-Studio.

Findings

The results showed that the new model was satisfactory overall with 80.77% accuracy after doing in-depth study with 50–50 split, 80.74% accuracy for 66–34 split and 80.25% accuracy for 80–20 split, when predicting the nature of the customers’ experience in the hotel, i.e. whether they are positive or negative.

Research limitations/implications

The implication of this research is to provide a showcase of how we can predict the polarity of potentially popular reviews. This helps the authors’ perspective to help the hotel industries to take corrective measures for the betterment of business and to promote useful positive reviews. This study also has some limitations like only English reviews are considered. This study was restricted to the data from trip-adviser website; however, a new data may be generated to test the credibility of the model. Only aspect-based sentiment classification is considered in this study.

Originality/value

Stacking machine learning techniques have been proposed. At first, state-of-the-art classifiers are tested on the given data, and then, three best performing classifiers (decision tree C5.0, random forest and support vector machine) are taken to build stack and to create ISD classifier.

Book part
Publication date: 30 September 2020

Hera Khan, Ayush Srivastav and Amit Kumar Mishra

A detailed description will be provided of all the classification algorithms that have been widely used in the domain of medical science. The foundation will be laid by giving a…

Abstract

A detailed description will be provided of all the classification algorithms that have been widely used in the domain of medical science. The foundation will be laid by giving a comprehensive overview pertaining to the background and history of the classification algorithms. This will be followed by an extensive discussion regarding various techniques of classification algorithm in machine learning (ML) hence concluding with their relevant applications in data analysis in medical science and health care. To begin with, the initials of this chapter will deal with the basic fundamentals required for a profound understanding of the classification techniques in ML which will comprise of the underlying differences between Unsupervised and Supervised Learning followed by the basic terminologies of classification and its history. Further, it will include the types of classification algorithms ranging from linear classifiers like Logistic Regression, Naïve Bayes to Nearest Neighbour, Support Vector Machine, Tree-based Classifiers, and Neural Networks, and their respective mathematics. Ensemble algorithms such as Majority Voting, Boosting, Bagging, Stacking will also be discussed at great length along with their relevant applications. Furthermore, this chapter will also incorporate comprehensive elucidation regarding the areas of application of such classification algorithms in the field of biomedicine and health care and their contribution to decision-making systems and predictive analysis. To conclude, this chapter will devote highly in the field of research and development as it will provide a thorough insight to the classification algorithms and their relevant applications used in the cases of the healthcare development sector.

Details

Big Data Analytics and Intelligence: A Perspective for Health Care
Type: Book
ISBN: 978-1-83909-099-8

Keywords

Article
Publication date: 3 May 2016

Mohammad Fathian, Yaser Hoseinpoor and Behrouz Minaei-Bidgoli

Churn management is a fundamental process in firms to keep their customers. Therefore, predicting the customer’s churn is essential to facilitate such processes. The literature…

1009

Abstract

Purpose

Churn management is a fundamental process in firms to keep their customers. Therefore, predicting the customer’s churn is essential to facilitate such processes. The literature has introduced data mining approaches for this purpose. On the other hand, results indicate that performance of classification models increases by combining two or more techniques. The purpose of this paper is to propose a combined model based on clustering and ensemble classifiers.

Design/methodology/approach

Based on churn data set in Cell2Cell, single baseline classifiers, ensemble classifiers are used for comparisons. Specifically, self-organizing map (SOM) clustering technique, and four other classifier techniques including decision tree, artificial neural networks, support vector machine, and K-nearest neighbors were used. Moreover, for reduced dimensions of the features, principal component analysis (PCA) method was employed.

Findings

As results 14 models are compared with each other regarding accuracy, sensitivity, specification, F-measure, and AUC. The results showed that combination of SOM, PCA, and heterogeneous boosting achieved the best performance comparing with other classification models.

Originality/value

This study examined the performance of classifier ensembles in predicting customers churn. In particular, heterogeneous classifier ensembles such as bagging and boosting are compared.

Details

Kybernetes, vol. 45 no. 5
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 29 October 2018

Shrawan Kumar Trivedi and Shubhamoy Dey

To be sustainable and competitive in the current business environment, it is useful to understand users’ sentiment towards products and services. This critical task can be…

Abstract

Purpose

To be sustainable and competitive in the current business environment, it is useful to understand users’ sentiment towards products and services. This critical task can be achieved via natural language processing and machine learning classifiers. This paper aims to propose a novel probabilistic committee selection classifier (PCC) to analyse and classify the sentiment polarities of movie reviews.

Design/methodology/approach

An Indian movie review corpus is assembled for this study. Another publicly available movie review polarity corpus is also involved with regard to validating the results. The greedy stepwise search method is used to extract the features/words of the reviews. The performance of the proposed classifier is measured using different metrics, such as F-measure, false positive rate, receiver operating characteristic (ROC) curve and training time. Further, the proposed classifier is compared with other popular machine-learning classifiers, such as Bayesian, Naïve Bayes, Decision Tree (J48), Support Vector Machine and Random Forest.

Findings

The results of this study show that the proposed classifier is good at predicting the positive or negative polarity of movie reviews. Its performance accuracy and the value of the ROC curve of the PCC is found to be the most suitable of all other classifiers tested in this study. This classifier is also found to be efficient at identifying positive sentiments of reviews, where it gives low false positive rates for both the Indian Movie Review and Review Polarity corpora used in this study. The training time of the proposed classifier is found to be slightly higher than that of Bayesian, Naïve Bayes and J48.

Research limitations/implications

Only movie review sentiments written in English are considered. In addition, the proposed committee selection classifier is prepared only using the committee of probabilistic classifiers; however, other classifier committees can also be built, tested and compared with the present experiment scenario.

Practical implications

In this paper, a novel probabilistic approach is proposed and used for classifying movie reviews, and is found to be highly effective in comparison with other state-of-the-art classifiers. This classifier may be tested for different applications and may provide new insights for developers and researchers.

Social implications

The proposed PCC may be used to classify different product reviews, and hence may be beneficial to organizations to justify users’ reviews about specific products or services. By using authentic positive and negative sentiments of users, the credibility of the specific product, service or event may be enhanced. PCC may also be applied to other applications, such as spam detection, blog mining, news mining and various other data-mining applications.

Originality/value

The constructed PCC is novel and was tested on Indian movie review data.

Article
Publication date: 11 June 2018

Deepika Kishor Nagthane and Archana M. Rajurkar

One of the main reasons for increase in mortality rate in woman is breast cancer. Accurate early detection of breast cancer seems to be the only solution for diagnosis. In the…

Abstract

Purpose

One of the main reasons for increase in mortality rate in woman is breast cancer. Accurate early detection of breast cancer seems to be the only solution for diagnosis. In the field of breast cancer research, many new computer-aided diagnosis systems have been developed to reduce the diagnostic test false positives because of the subtle appearance of breast cancer tissues. The purpose of this study is to develop the diagnosis technique for breast cancer using LCFS and TreeHiCARe classifier model.

Design/methodology/approach

The proposed diagnosis methodology initiates with the pre-processing procedure. Subsequently, feature extraction is performed. In feature extraction, the image features which preserve the characteristics of the breast tissues are extracted. Consequently, feature selection is performed by the proposed least-mean-square (LMS)-Cuckoo search feature selection (LCFS) algorithm. The feature selection from the vast range of the features extracted from the images is performed with the help of the optimal cut point provided by the LCS algorithm. Then, the image transaction database table is developed using the keywords of the training images and feature vectors. The transaction resembles the itemset and the association rules are generated from the transaction representation based on a priori algorithm with high conviction ratio and lift. After association rule generation, the proposed TreeHiCARe classifier model emanates in the diagnosis methodology. In TreeHICARe classifier, a new feature index is developed for the selection of a central feature for the decision tree centered on which the classification of images into normal or abnormal is performed.

Findings

The performance of the proposed method is validated over existing works using accuracy, sensitivity and specificity measures. The experimentation of proposed method on Mammographic Image Analysis Society database resulted in classification of normal and abnormal cancerous mammogram images with an accuracy of 0.8289, sensitivity of 0.9333 and specificity of 0.7273.

Originality/value

This paper proposes a new approach for the breast cancer diagnosis system by using mammogram images. The proposed method uses two new algorithms: LCFS and TreeHiCARe. LCFS is used to select optimal feature split points, and TreeHiCARe is the decision tree classifier model based on association rule agreements.

Details

Sensor Review, vol. 39 no. 1
Type: Research Article
ISSN: 0260-2288

Keywords

Article
Publication date: 21 December 2023

Majid Rahi, Ali Ebrahimnejad and Homayun Motameni

Taking into consideration the current human need for agricultural produce such as rice that requires water for growth, the optimal consumption of this valuable liquid is…

Abstract

Purpose

Taking into consideration the current human need for agricultural produce such as rice that requires water for growth, the optimal consumption of this valuable liquid is important. Unfortunately, the traditional use of water by humans for agricultural purposes contradicts the concept of optimal consumption. Therefore, designing and implementing a mechanized irrigation system is of the highest importance. This system includes hardware equipment such as liquid altimeter sensors, valves and pumps which have a failure phenomenon as an integral part, causing faults in the system. Naturally, these faults occur at probable time intervals, and the probability function with exponential distribution is used to simulate this interval. Thus, before the implementation of such high-cost systems, its evaluation is essential during the design phase.

Design/methodology/approach

The proposed approach included two main steps: offline and online. The offline phase included the simulation of the studied system (i.e. the irrigation system of paddy fields) and the acquisition of a data set for training machine learning algorithms such as decision trees to detect, locate (classification) and evaluate faults. In the online phase, C5.0 decision trees trained in the offline phase were used on a stream of data generated by the system.

Findings

The proposed approach is a comprehensive online component-oriented method, which is a combination of supervised machine learning methods to investigate system faults. Each of these methods is considered a component determined by the dimensions and complexity of the case study (to discover, classify and evaluate fault tolerance). These components are placed together in the form of a process framework so that the appropriate method for each component is obtained based on comparison with other machine learning methods. As a result, depending on the conditions under study, the most efficient method is selected in the components. Before the system implementation phase, its reliability is checked by evaluating the predicted faults (in the system design phase). Therefore, this approach avoids the construction of a high-risk system. Compared to existing methods, the proposed approach is more comprehensive and has greater flexibility.

Research limitations/implications

By expanding the dimensions of the problem, the model verification space grows exponentially using automata.

Originality/value

Unlike the existing methods that only examine one or two aspects of fault analysis such as fault detection, classification and fault-tolerance evaluation, this paper proposes a comprehensive process-oriented approach that investigates all three aspects of fault analysis concurrently.

Details

International Journal of Intelligent Computing and Cybernetics, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 1756-378X

Keywords

1 – 10 of over 1000