Search results

1 – 10 of over 2000
Article
Publication date: 14 November 2016

Shrawan Kumar Trivedi and Shubhamoy Dey

The email is an important medium for sharing information rapidly. However, spam, being a nuisance in such communication, motivates the building of a robust filtering system with…

Abstract

Purpose

The email is an important medium for sharing information rapidly. However, spam, being a nuisance in such communication, motivates the building of a robust filtering system with high classification accuracy and good sensitivity towards false positives. In that context, this paper aims to present a combined classifier technique using a committee selection mechanism where the main objective is to identify a set of classifiers so that their individual decisions can be combined by a committee selection procedure for accurate detection of spam.

Design/methodology/approach

For training and testing of the relevant machine learning classifiers, text mining approaches are used in this research. Three data sets (Enron, SpamAssassin and LingSpam) have been used to test the classifiers. Initially, pre-processing is performed to extract the features associated with the email files. In the next step, the extracted features are taken through a dimensionality reduction method where non-informative features are removed. Subsequently, an informative feature subset is selected using genetic feature search. Thereafter, the proposed classifiers are tested on those informative features and the results compared with those of other classifiers.

Findings

For building the proposed combined classifier, three different studies have been performed. The first study identifies the effect of boosting algorithms on two probabilistic classifiers: Bayesian and Naïve Bayes. In that study, AdaBoost has been found to be the best algorithm for performance boosting. The second study was on the effect of different Kernel functions on support vector machine (SVM) classifier, where SVM with normalized polynomial (NP) kernel was observed to be the best. The last study was on combining classifiers with committee selection where the committee members were the best classifiers identified by the first study i.e. Bayesian and Naïve bays with AdaBoost, and the committee president was selected from the second study i.e. SVM with NP kernel. Results show that combining of the identified classifiers to form a committee machine gives excellent performance accuracy with a low false positive rate.

Research limitations/implications

This research is focused on the classification of email spams written in English language. Only body (text) parts of the emails have been used. Image spam has not been included in this work. We have restricted our work to only emails messages. None of the other types of messages like short message service or multi-media messaging service were a part of this study.

Practical implications

This research proposes a method of dealing with the issues and challenges faced by internet service providers and organizations that use email. The proposed model provides not only better classification accuracy but also a low false positive rate.

Originality/value

The proposed combined classifier is a novel classifier designed for accurate classification of email spam.

Details

VINE Journal of Information and Knowledge Management Systems, vol. 46 no. 4
Type: Research Article
ISSN: 2059-5891

Keywords

Article
Publication date: 3 May 2016

Mohammad Fathian, Yaser Hoseinpoor and Behrouz Minaei-Bidgoli

Churn management is a fundamental process in firms to keep their customers. Therefore, predicting the customer’s churn is essential to facilitate such processes. The literature…

1009

Abstract

Purpose

Churn management is a fundamental process in firms to keep their customers. Therefore, predicting the customer’s churn is essential to facilitate such processes. The literature has introduced data mining approaches for this purpose. On the other hand, results indicate that performance of classification models increases by combining two or more techniques. The purpose of this paper is to propose a combined model based on clustering and ensemble classifiers.

Design/methodology/approach

Based on churn data set in Cell2Cell, single baseline classifiers, ensemble classifiers are used for comparisons. Specifically, self-organizing map (SOM) clustering technique, and four other classifier techniques including decision tree, artificial neural networks, support vector machine, and K-nearest neighbors were used. Moreover, for reduced dimensions of the features, principal component analysis (PCA) method was employed.

Findings

As results 14 models are compared with each other regarding accuracy, sensitivity, specification, F-measure, and AUC. The results showed that combination of SOM, PCA, and heterogeneous boosting achieved the best performance comparing with other classification models.

Originality/value

This study examined the performance of classifier ensembles in predicting customers churn. In particular, heterogeneous classifier ensembles such as bagging and boosting are compared.

Details

Kybernetes, vol. 45 no. 5
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 1 June 2005

Linh Tran Hoai and Stanislaw Osowski

This paper presents new approach to the integration of neural classifiers. Typically only the best trained network is chosen, while the rest is discarded. However, combining the…

Abstract

Purpose

This paper presents new approach to the integration of neural classifiers. Typically only the best trained network is chosen, while the rest is discarded. However, combining the trained networks helps to integrate the knowledge acquired by the component classifiers and in this way improves the accuracy of the final classification. The aim of the research is to develop and compare the methods of combining neural classifiers of the heart beat recognition.

Design/methodology/approach

Two methods of integration of the results of individual classifiers are proposed. One is based on the statistical reliability of post‐processing performance on the trained data and the second uses the least mean square method in adjusting the weights of the weighted voting integrating network.

Findings

The experimental results of the recognition of six types of arrhythmias and normal sinus rhythm have shown that the performance of individual classifiers could be improved significantly by the integration proposed in this paper.

Practical implications

The presented application should be regarded as the first step in the direction of automatic recognition of the heart rhythms on the basis of the registered ECG waveforms.

Originality/value

The results mean that instead of designing one high performance classifier one can build a number of classifiers, each of not superb performance. The appropriate combination of them may produce a performance of much higher quality.

Details

COMPEL - The international journal for computation and mathematics in electrical and electronic engineering, vol. 24 no. 2
Type: Research Article
ISSN: 0332-1649

Keywords

Article
Publication date: 29 July 2014

Chih-Fong Tsai and Chihli Hung

Credit scoring is important for financial institutions in order to accurately predict the likelihood of business failure. Related studies have shown that machine learning…

1135

Abstract

Purpose

Credit scoring is important for financial institutions in order to accurately predict the likelihood of business failure. Related studies have shown that machine learning techniques, such as neural networks, outperform many statistical approaches to solving this type of problem, and advanced machine learning techniques, such as classifier ensembles and hybrid classifiers, provide better prediction performance than single machine learning based classification techniques. However, it is not known which type of advanced classification technique performs better in terms of financial distress prediction. The paper aims to discuss these issues.

Design/methodology/approach

This paper compares neural network ensembles and hybrid neural networks over three benchmarking credit scoring related data sets, which are Australian, German, and Japanese data sets.

Findings

The experimental results show that hybrid neural networks and neural network ensembles outperform the single neural network. Although hybrid neural networks perform slightly better than neural network ensembles in terms of predication accuracy and errors with two of the data sets, there is no significant difference between the two types of prediction models.

Originality/value

The originality of this paper is in comparing two types of advanced classification techniques, i.e. hybrid and ensemble learning techniques, in terms of financial distress prediction.

Details

Kybernetes, vol. 43 no. 7
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 10 May 2022

Arghya Ray, Pradip Kumar Bala, Nripendra P. Rana and Yogesh K. Dwivedi

The widespread acceptance of various social platforms has increased the number of users posting about various services based on their experiences about the services. Finding out…

Abstract

Purpose

The widespread acceptance of various social platforms has increased the number of users posting about various services based on their experiences about the services. Finding out the intended ratings of social media (SM) posts is important for both organizations and prospective users since these posts can help in capturing the user’s perspectives. However, unlike merchant websites, the SM posts related to the service-experience cannot be rated unless explicitly mentioned in the comments. Additionally, predicting ratings can also help to build a database using recent comments for testing recommender algorithms in various scenarios.

Design/methodology/approach

In this study, the authors have predicted the ratings of SM posts using linear (Naïve Bayes, max-entropy) and non-linear (k-nearest neighbor, k-NN) classifiers utilizing combinations of different features, sentiment scores and emotion scores.

Findings

Overall, the results of this study reveal that the non-linear classifier (k-NN classifier) performed better than the linear classifiers (Naïve Bayes, Max-entropy classifier). Results also show an improvement of performance where the classifier was combined with sentiment and emotion scores. Introduction of the feature “factors of importance” or “the latent factors” also show an improvement of the classifier performance.

Originality/value

This study provides a new avenue of predicting ratings of SM feeds by the use of machine learning algorithms along with a combination of different features like emotional aspects and latent factors.

Details

Aslib Journal of Information Management, vol. 74 no. 6
Type: Research Article
ISSN: 2050-3806

Keywords

Article
Publication date: 22 November 2010

Yun‐Sheng Chung, D. Frank Hsu, Chun‐Yi Liu and Chun‐Yi Tang

Multiple classifier systems have been used widely in computing, communications, and informatics. Combining multiple classifier systems (MCS) has been shown to outperform a single…

Abstract

Purpose

Multiple classifier systems have been used widely in computing, communications, and informatics. Combining multiple classifier systems (MCS) has been shown to outperform a single classifier system. It has been demonstrated that improvement in ensemble performance depends on either the diversity among or the performance of individual systems. A variety of diversity measures and ensemble methods have been proposed and studied. However, it remains a challenging problem to estimate the ensemble performance in terms of the performance of and the diversity among individual systems. The purpose of this paper is to study the general problem of estimating ensemble performance for various combination methods using the concept of a performance distribution pattern (PDP).

Design/methodology/approach

In particular, the paper establishes upper and lower bounds for majority voting ensemble performance with disagreement diversity measure Dis, weighted majority voting performance in terms of weighted average performance and weighted disagreement diversity, and plurality voting ensemble performance with entropy diversity measure D.

Findings

Bounds for these three cases are shown to be tight using the PDP for the input set.

Originality/value

As a consequence of the authors' previous results on diversity equivalence, the results of majority voting ensemble performance can be extended to several other diversity measures. Moreover, the paper showed in the case of majority voting ensemble performance that when the average of individual systems performance P is big enough, the ensemble performance Pm resulting from a maximum (information‐theoretic) entropy PDP is an increasing function with respect to the disagreement diversity Dis. Eight experiments using data sets from various application domains are conducted to demonstrate the complexity, richness, and diverseness of the problem in estimating the ensemble performance.

Details

International Journal of Pervasive Computing and Communications, vol. 6 no. 4
Type: Research Article
ISSN: 1742-7371

Keywords

Article
Publication date: 5 December 2017

Rabeb Faleh, Sami Gomri, Mehdi Othman, Khalifa Aguir and Abdennaceur Kachouri

In this paper, a novel hybrid approach aimed at solving the problem of cross-selectivity of gases in electronic nose (E-nose) using the combination classifiers of support vector…

Abstract

Purpose

In this paper, a novel hybrid approach aimed at solving the problem of cross-selectivity of gases in electronic nose (E-nose) using the combination classifiers of support vector machine (SVM) and k-nearest neighbors (KNN) methods was proposed.

Design/methodology/approach

First, three WO3 sensors E-nose system was used for data acquisition to detect three gases, namely, ozone, ethanol and acetone. Then, two transient parameters, derivate and integral, were extracted for each gas response. Next, the principal component analysis (PCA) was been applied to extract the most relevant sensor data and dimensionality reduction. The new coordinates calculated by PCA were used as inputs for classification by the SVM method. Finally, the classification achieved by the KNN method was carried out to calculate only the support vectors (SVs), not all the data.

Findings

This work has proved that the proposed fusion method led to the highest classification rate (100 per cent) compared to the accuracy of the individual classifiers: KNN, SVM-linear, SVM-RBF, SVM-polynomial that present, respectively, 89, 75.2, 80 and 79.9 per cent as classification rate.

Originality/value

The authors propose a fusion classifier approach to improve the classification rate. In this method, the extracted features are projected into the PCA subspace to reduce the dimensionality. Then, the obtained principal components are introduced to the SVM classifier and calculated SVs which will be used in the KNN method.

Details

Sensor Review, vol. 38 no. 1
Type: Research Article
ISSN: 0260-2288

Keywords

Article
Publication date: 3 January 2023

Saleem Raja A., Sundaravadivazhagan Balasubaramanian, Pradeepa Ganesan, Justin Rajasekaran and Karthikeyan R.

The internet has completely merged into contemporary life. People are addicted to using internet services for everyday activities. Consequently, an abundance of information about…

Abstract

Purpose

The internet has completely merged into contemporary life. People are addicted to using internet services for everyday activities. Consequently, an abundance of information about people and organizations is available online, which encourages the proliferation of cybercrimes. Cybercriminals often use malicious links for large-scale cyberattacks, which are disseminated via email, SMS and social media. Recognizing malicious links online can be exceedingly challenging. The purpose of this paper is to present a strong security system that can detect malicious links in the cyberspace using natural language processing technique.

Design/methodology/approach

The researcher recommends a variety of approaches, including blacklisting and rules-based machine/deep learning, for automatically recognizing malicious links. But the approaches generally necessitate the generation of a set of features to generalize the detection process. Most of the features are generated by processing URLs and content of the web page, as well as some external features such as the ranking of the web page and domain name system information. This process of feature extraction and selection typically takes more time and demands a high level of expertise in the domain. Sometimes the generated features may not leverage the full potentials of the data set. In addition, the majority of the currently deployed systems make use of a single classifier for the classification of malicious links. However, prediction accuracy may vary widely depending on the data set and the classifier used.

Findings

To address the issue of generating feature sets, the proposed method uses natural language processing techniques (term frequency and inverse document frequency) that vectorize URLs. To build a robust system for the classification of malicious links, the proposed system implements weighted soft voting classifier, an ensemble classifier that combines predictions of base classifiers. The ability or skill of each classifier serves as the base for the weight that is assigned to it.

Originality/value

The proposed method performs better when the optimal weights are assigned. The performance of the proposed method was assessed by using two different data sets (D1 and D2) and compared performance against base machine learning classifiers and previous research results. The outcome accuracy shows that the proposed method is superior to the existing methods, offering 91.4% and 98.8% accuracy for data sets D1 and D2, respectively.

Details

International Journal of Pervasive Computing and Communications, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 1742-7371

Keywords

Article
Publication date: 23 September 2020

Z.F. Zhang, Wei Liu, Egon Ostrosi, Yongjie Tian and Jianping Yi

During the production process of steel strip, some defects may appear on the surface, that is, traditional manual inspection could not meet the requirements of low-cost and…

Abstract

Purpose

During the production process of steel strip, some defects may appear on the surface, that is, traditional manual inspection could not meet the requirements of low-cost and high-efficiency production. The purpose of this paper is to propose a method of feature selection based on filter methods combined with hidden Bayesian classifier for improving the efficiency of defect recognition and reduce the complexity of calculation. The method can select the optimal hybrid model for realizing the accurate classification of steel strip surface defects.

Design/methodology/approach

A large image feature set was initially obtained based on the discrete wavelet transform feature extraction method. Three feature selection methods (including correlation-based feature selection, consistency subset evaluator [CSE] and information gain) were then used to optimize the feature space. Parameters for the feature selection methods were based on the classification accuracy results of hidden Naive Bayes (HNB) algorithm. The selected feature subset was then applied to the traditional NB classifier and leading extended NB classifiers.

Findings

The experimental results demonstrated that the HNB model combined with feature selection approaches has better classification performance than other models of defect recognition. Among the results of this study, the proposed hybrid model of CSE + HNB is the most robust and effective and of highest classification accuracy in identifying the optimal subset of the surface defect database.

Originality/value

The main contribution of this paper is the development of a hybrid model combining feature selection and multi-class classification algorithms for steel strip surface inspection. The proposed hybrid model is primarily robust and effective for steel strip surface inspection.

Details

Engineering Computations, vol. 38 no. 4
Type: Research Article
ISSN: 0264-4401

Keywords

Article
Publication date: 15 March 2013

Zied Kechaou, Ali Wali, Mohamed Ben Ammar, Hichem Karray and Adel M. Alimi

Despite the actual prevalence of diverse types of multimedia information, research on video news is still in an early stage. Improving the accessibility of video news seems worth…

Abstract

Purpose

Despite the actual prevalence of diverse types of multimedia information, research on video news is still in an early stage. Improving the accessibility of video news seems worth investigating, therefore, the purpose of this paper is to present a new combination mode of video news text clustering and selection. This method is useful for sorting out and classifying various types of news videos and media texts based on sentiment analysis.

Design/methodology/approach

A novel system is proposed, whereby video news are identified and categorized into good or bad ones via the authors' suggested Hidden Markov Model (HMM) and Support Vector Machine (SVM) hybrid learning method. Actually, an exploratory video news sentiment analysis case study, conducted on various news databases, has proven that the feature‐selection‐combining method, encompassing the Information Gain (IG), Mutual Information (MI) and CHI‐statistic (CHI), performs the best classification, which testifies and highlights the designed framework's value.

Findings

In fact, the system turns out to be applicable to several areas, especially video news, where annotation and personal perspectives affect the accuracy aspect.

Research limitations/implications

The present work shows the way for further research pertaining to the personal attitudes and the application of different linguistic techniques during the classification.

Originality/value

The achieved results are so promising, encouraging and satisfactory, that they highlight the originality and efficiency of the authors' approach as an effective tool enabling to secure an easy access to video news and multi‐media texts.

Details

Journal of Systems and Information Technology, vol. 15 no. 1
Type: Research Article
ISSN: 1328-7265

Keywords

1 – 10 of over 2000