Search results

1 – 10 of over 3000
Article
Publication date: 29 July 2014

Chih-Fong Tsai and Chihli Hung

Credit scoring is important for financial institutions in order to accurately predict the likelihood of business failure. Related studies have shown that machine learning…

1135

Abstract

Purpose

Credit scoring is important for financial institutions in order to accurately predict the likelihood of business failure. Related studies have shown that machine learning techniques, such as neural networks, outperform many statistical approaches to solving this type of problem, and advanced machine learning techniques, such as classifier ensembles and hybrid classifiers, provide better prediction performance than single machine learning based classification techniques. However, it is not known which type of advanced classification technique performs better in terms of financial distress prediction. The paper aims to discuss these issues.

Design/methodology/approach

This paper compares neural network ensembles and hybrid neural networks over three benchmarking credit scoring related data sets, which are Australian, German, and Japanese data sets.

Findings

The experimental results show that hybrid neural networks and neural network ensembles outperform the single neural network. Although hybrid neural networks perform slightly better than neural network ensembles in terms of predication accuracy and errors with two of the data sets, there is no significant difference between the two types of prediction models.

Originality/value

The originality of this paper is in comparing two types of advanced classification techniques, i.e. hybrid and ensemble learning techniques, in terms of financial distress prediction.

Details

Kybernetes, vol. 43 no. 7
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 5 August 2014

Theodoros Anagnostopoulos and Christos Skourlas

The purpose of this paper is to understand the emotional state of a human being by capturing the speech utterances that are used during common conversation. Human beings except of…

Abstract

Purpose

The purpose of this paper is to understand the emotional state of a human being by capturing the speech utterances that are used during common conversation. Human beings except of thinking creatures are also sentimental and emotional organisms. There are six universal basic emotions plus a neutral emotion: happiness, surprise, fear, sadness, anger, disgust and neutral.

Design/methodology/approach

It is proved that, given enough acoustic evidence, the emotional state of a person can be classified by an ensemble majority voting classifier. The proposed ensemble classifier is constructed over three base classifiers: k nearest neighbors, C4.5 and support vector machine (SVM) polynomial kernel.

Findings

The proposed ensemble classifier achieves better performance than each base classifier. It is compared with two other ensemble classifiers: one-against-all (OAA) multiclass SVM with radial basis function kernels and OAA multiclass SVM with hybrid kernels. The proposed ensemble classifier achieves better performance than the other two ensemble classifiers.

Originality/value

The current paper performs emotion classification with an ensemble majority voting classifier that combines three certain types of base classifiers which are of low computational complexity. The base classifiers stem from different theoretical background to avoid bias and redundancy. It gives to the proposed ensemble classifier the ability to be generalized in the emotion domain space.

Details

Journal of Systems and Information Technology, vol. 16 no. 3
Type: Research Article
ISSN: 1328-7265

Keywords

Article
Publication date: 3 January 2023

Saleem Raja A., Sundaravadivazhagan Balasubaramanian, Pradeepa Ganesan, Justin Rajasekaran and Karthikeyan R.

The internet has completely merged into contemporary life. People are addicted to using internet services for everyday activities. Consequently, an abundance of information about…

Abstract

Purpose

The internet has completely merged into contemporary life. People are addicted to using internet services for everyday activities. Consequently, an abundance of information about people and organizations is available online, which encourages the proliferation of cybercrimes. Cybercriminals often use malicious links for large-scale cyberattacks, which are disseminated via email, SMS and social media. Recognizing malicious links online can be exceedingly challenging. The purpose of this paper is to present a strong security system that can detect malicious links in the cyberspace using natural language processing technique.

Design/methodology/approach

The researcher recommends a variety of approaches, including blacklisting and rules-based machine/deep learning, for automatically recognizing malicious links. But the approaches generally necessitate the generation of a set of features to generalize the detection process. Most of the features are generated by processing URLs and content of the web page, as well as some external features such as the ranking of the web page and domain name system information. This process of feature extraction and selection typically takes more time and demands a high level of expertise in the domain. Sometimes the generated features may not leverage the full potentials of the data set. In addition, the majority of the currently deployed systems make use of a single classifier for the classification of malicious links. However, prediction accuracy may vary widely depending on the data set and the classifier used.

Findings

To address the issue of generating feature sets, the proposed method uses natural language processing techniques (term frequency and inverse document frequency) that vectorize URLs. To build a robust system for the classification of malicious links, the proposed system implements weighted soft voting classifier, an ensemble classifier that combines predictions of base classifiers. The ability or skill of each classifier serves as the base for the weight that is assigned to it.

Originality/value

The proposed method performs better when the optimal weights are assigned. The performance of the proposed method was assessed by using two different data sets (D1 and D2) and compared performance against base machine learning classifiers and previous research results. The outcome accuracy shows that the proposed method is superior to the existing methods, offering 91.4% and 98.8% accuracy for data sets D1 and D2, respectively.

Details

International Journal of Pervasive Computing and Communications, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 1742-7371

Keywords

Article
Publication date: 30 October 2018

Shrawan Kumar Trivedi and Prabin Kumar Panigrahi

Email spam classification is now becoming a challenging area in the domain of text classification. Precise and robust classifiers are not only judged by classification accuracy…

Abstract

Purpose

Email spam classification is now becoming a challenging area in the domain of text classification. Precise and robust classifiers are not only judged by classification accuracy but also by sensitivity (correctly classified legitimate emails) and specificity (correctly classified unsolicited emails) towards the accurate classification, captured by both false positive and false negative rates. This paper aims to present a comparative study between various decision tree classifiers (such as AD tree, decision stump and REP tree) with/without different boosting algorithms (bagging, boosting with re-sample and AdaBoost).

Design/methodology/approach

Artificial intelligence and text mining approaches have been incorporated in this study. Each decision tree classifier in this study is tested on informative words/features selected from the two publically available data sets (SpamAssassin and LingSpam) using a greedy step-wise feature search method.

Findings

Outcomes of this study show that without boosting, the REP tree provides high performance accuracy with the AD tree ranking as the second-best performer. Decision stump is found to be the under-performing classifier of this study. However, with boosting, the combination of REP tree and AdaBoost compares favourably with other classification models. If the metrics false positive rate and performance accuracy are taken together, AD tree and REP tree with AdaBoost were both found to carry out an effective classification task. Greedy stepwise has proven its worth in this study by selecting a subset of valuable features to identify the correct class of emails.

Research limitations/implications

This research is focussed on the classification of those email spams that are written in the English language only. The proposed models work with content (words/features) of email data that is mostly found in the body of the mail. Image spam has not been included in this study. Other messages such as short message service or multi-media messaging service were not included in this study.

Practical implications

In this research, a boosted decision tree approach has been proposed and used to classify email spam and ham files; this is found to be a highly effective approach in comparison with other state-of-the-art modes used in other studies. This classifier may be tested for different applications and may provide new insights for developers and researchers.

Originality/value

A comparison of decision tree classifiers with/without ensemble has been presented for spam classification.

Details

Journal of Systems and Information Technology, vol. 20 no. 3
Type: Research Article
ISSN: 1328-7265

Keywords

Article
Publication date: 3 May 2016

Mohammad Fathian, Yaser Hoseinpoor and Behrouz Minaei-Bidgoli

Churn management is a fundamental process in firms to keep their customers. Therefore, predicting the customer’s churn is essential to facilitate such processes. The literature…

1009

Abstract

Purpose

Churn management is a fundamental process in firms to keep their customers. Therefore, predicting the customer’s churn is essential to facilitate such processes. The literature has introduced data mining approaches for this purpose. On the other hand, results indicate that performance of classification models increases by combining two or more techniques. The purpose of this paper is to propose a combined model based on clustering and ensemble classifiers.

Design/methodology/approach

Based on churn data set in Cell2Cell, single baseline classifiers, ensemble classifiers are used for comparisons. Specifically, self-organizing map (SOM) clustering technique, and four other classifier techniques including decision tree, artificial neural networks, support vector machine, and K-nearest neighbors were used. Moreover, for reduced dimensions of the features, principal component analysis (PCA) method was employed.

Findings

As results 14 models are compared with each other regarding accuracy, sensitivity, specification, F-measure, and AUC. The results showed that combination of SOM, PCA, and heterogeneous boosting achieved the best performance comparing with other classification models.

Originality/value

This study examined the performance of classifier ensembles in predicting customers churn. In particular, heterogeneous classifier ensembles such as bagging and boosting are compared.

Details

Kybernetes, vol. 45 no. 5
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 10 May 2022

Arghya Ray, Pradip Kumar Bala, Nripendra P. Rana and Yogesh K. Dwivedi

The widespread acceptance of various social platforms has increased the number of users posting about various services based on their experiences about the services. Finding out…

Abstract

Purpose

The widespread acceptance of various social platforms has increased the number of users posting about various services based on their experiences about the services. Finding out the intended ratings of social media (SM) posts is important for both organizations and prospective users since these posts can help in capturing the user’s perspectives. However, unlike merchant websites, the SM posts related to the service-experience cannot be rated unless explicitly mentioned in the comments. Additionally, predicting ratings can also help to build a database using recent comments for testing recommender algorithms in various scenarios.

Design/methodology/approach

In this study, the authors have predicted the ratings of SM posts using linear (Naïve Bayes, max-entropy) and non-linear (k-nearest neighbor, k-NN) classifiers utilizing combinations of different features, sentiment scores and emotion scores.

Findings

Overall, the results of this study reveal that the non-linear classifier (k-NN classifier) performed better than the linear classifiers (Naïve Bayes, Max-entropy classifier). Results also show an improvement of performance where the classifier was combined with sentiment and emotion scores. Introduction of the feature “factors of importance” or “the latent factors” also show an improvement of the classifier performance.

Originality/value

This study provides a new avenue of predicting ratings of SM feeds by the use of machine learning algorithms along with a combination of different features like emotional aspects and latent factors.

Details

Aslib Journal of Information Management, vol. 74 no. 6
Type: Research Article
ISSN: 2050-3806

Keywords

Article
Publication date: 9 March 2015

Ahmed Ahmim and Nacira Ghoualmi Zine

The purpose of this paper is to build a new hierarchical intrusion detection system (IDS) based on a binary tree of different types of classifiers. The proposed IDS model must…

Abstract

Purpose

The purpose of this paper is to build a new hierarchical intrusion detection system (IDS) based on a binary tree of different types of classifiers. The proposed IDS model must possess the following characteristics: combine a high detection rate and a low false alarm rate, and classify any connection in a specific category of network connection.

Design/methodology/approach

To build the binary tree, the authors cluster the different categories of network connections hierarchically based on the proportion of false-positives and false-negatives generated between each of the two categories. The built model is a binary tree with multi-levels. At first, the authors use the best classifier in the classification of the network connections in category A and category G2 that clusters the rest of the categories. Then, in the second level, they use the best classifier in the classification of G2 network connections in category B and category G3 that represents the different categories clustered in G2 without category B. This process is repeated until the last two categories of network connections. Note that one of these categories represents the normal connection, and the rest represent the different types of abnormal connections.

Findings

The experimentation on the labeled data set for flow-based intrusion detection, NSL-KDD and KDD’99 shows the high performance of the authors' model compared to the results obtained by some well-known classifiers and recent IDS models. The experiments’ results show that the authors' model gives a low false alarm rate and the highest detection rate. Moreover, the model is more accurate than some well-known classifiers like SVM, C4.5 decision tree, MLP neural network and naïve Bayes with accuracy equal to 83.26 per cent on NSL-KDD and equal to 99.92 per cent on the labeled data set for flow-based intrusion detection. As well, it is more accurate than the best of related works and recent IDS models with accuracy equal to 95.72 per cent on KDD’99.

Originality/value

This paper proposes a novel hierarchical IDS based on a binary tree of classifiers, where different types of classifiers are used to create a high-performance model. Therefore, it confirms the capacity of the hierarchical model to combine a high detection rate and a low false alarm rate.

Details

Information & Computer Security, vol. 23 no. 1
Type: Research Article
ISSN: 2056-4961

Keywords

Article
Publication date: 14 November 2016

Shrawan Kumar Trivedi and Shubhamoy Dey

The email is an important medium for sharing information rapidly. However, spam, being a nuisance in such communication, motivates the building of a robust filtering system with…

Abstract

Purpose

The email is an important medium for sharing information rapidly. However, spam, being a nuisance in such communication, motivates the building of a robust filtering system with high classification accuracy and good sensitivity towards false positives. In that context, this paper aims to present a combined classifier technique using a committee selection mechanism where the main objective is to identify a set of classifiers so that their individual decisions can be combined by a committee selection procedure for accurate detection of spam.

Design/methodology/approach

For training and testing of the relevant machine learning classifiers, text mining approaches are used in this research. Three data sets (Enron, SpamAssassin and LingSpam) have been used to test the classifiers. Initially, pre-processing is performed to extract the features associated with the email files. In the next step, the extracted features are taken through a dimensionality reduction method where non-informative features are removed. Subsequently, an informative feature subset is selected using genetic feature search. Thereafter, the proposed classifiers are tested on those informative features and the results compared with those of other classifiers.

Findings

For building the proposed combined classifier, three different studies have been performed. The first study identifies the effect of boosting algorithms on two probabilistic classifiers: Bayesian and Naïve Bayes. In that study, AdaBoost has been found to be the best algorithm for performance boosting. The second study was on the effect of different Kernel functions on support vector machine (SVM) classifier, where SVM with normalized polynomial (NP) kernel was observed to be the best. The last study was on combining classifiers with committee selection where the committee members were the best classifiers identified by the first study i.e. Bayesian and Naïve bays with AdaBoost, and the committee president was selected from the second study i.e. SVM with NP kernel. Results show that combining of the identified classifiers to form a committee machine gives excellent performance accuracy with a low false positive rate.

Research limitations/implications

This research is focused on the classification of email spams written in English language. Only body (text) parts of the emails have been used. Image spam has not been included in this work. We have restricted our work to only emails messages. None of the other types of messages like short message service or multi-media messaging service were a part of this study.

Practical implications

This research proposes a method of dealing with the issues and challenges faced by internet service providers and organizations that use email. The proposed model provides not only better classification accuracy but also a low false positive rate.

Originality/value

The proposed combined classifier is a novel classifier designed for accurate classification of email spam.

Details

VINE Journal of Information and Knowledge Management Systems, vol. 46 no. 4
Type: Research Article
ISSN: 2059-5891

Keywords

Article
Publication date: 1 June 2005

Linh Tran Hoai and Stanislaw Osowski

This paper presents new approach to the integration of neural classifiers. Typically only the best trained network is chosen, while the rest is discarded. However, combining the…

Abstract

Purpose

This paper presents new approach to the integration of neural classifiers. Typically only the best trained network is chosen, while the rest is discarded. However, combining the trained networks helps to integrate the knowledge acquired by the component classifiers and in this way improves the accuracy of the final classification. The aim of the research is to develop and compare the methods of combining neural classifiers of the heart beat recognition.

Design/methodology/approach

Two methods of integration of the results of individual classifiers are proposed. One is based on the statistical reliability of post‐processing performance on the trained data and the second uses the least mean square method in adjusting the weights of the weighted voting integrating network.

Findings

The experimental results of the recognition of six types of arrhythmias and normal sinus rhythm have shown that the performance of individual classifiers could be improved significantly by the integration proposed in this paper.

Practical implications

The presented application should be regarded as the first step in the direction of automatic recognition of the heart rhythms on the basis of the registered ECG waveforms.

Originality/value

The results mean that instead of designing one high performance classifier one can build a number of classifiers, each of not superb performance. The appropriate combination of them may produce a performance of much higher quality.

Details

COMPEL - The international journal for computation and mathematics in electrical and electronic engineering, vol. 24 no. 2
Type: Research Article
ISSN: 0332-1649

Keywords

Article
Publication date: 4 April 2022

Shrawan Kumar Trivedi, Amrinder Singh and Somesh Kumar Malhotra

There is a need to predict whether the consumers liked the stay in the hotel rooms or not, and to remove the aspects the customers did not like. Many customers leave a review…

Abstract

Purpose

There is a need to predict whether the consumers liked the stay in the hotel rooms or not, and to remove the aspects the customers did not like. Many customers leave a review after staying in the hotel. These reviews are mostly given on the website used to book the hotel. These reviews can be considered as a valuable data, which can be analyzed to provide better services in the hotels. The purpose of this study is to use machine learning techniques for analyzing the given data to determine different sentiment polarities of the consumers.

Design/methodology/approach

Reviews given by hotel customers on the Tripadvisor website, which were made available publicly by Kaggle. Out of 10,000 reviews in the data, a sample of 3,000 negative polarity reviews (customers with bad experiences) in the hotel and 3,000 positive polarity reviews (customers with good experiences) in the hotel is taken to prepare data set. The two-stage feature selection was applied, which first involved greedy selection method and then wrapper method to generate 37 most relevant features. An improved stacked decision tree (ISD) classifier) is built, which is further compared with state-of-the-art machine learning algorithms. All the tests are done using R-Studio.

Findings

The results showed that the new model was satisfactory overall with 80.77% accuracy after doing in-depth study with 50–50 split, 80.74% accuracy for 66–34 split and 80.25% accuracy for 80–20 split, when predicting the nature of the customers’ experience in the hotel, i.e. whether they are positive or negative.

Research limitations/implications

The implication of this research is to provide a showcase of how we can predict the polarity of potentially popular reviews. This helps the authors’ perspective to help the hotel industries to take corrective measures for the betterment of business and to promote useful positive reviews. This study also has some limitations like only English reviews are considered. This study was restricted to the data from trip-adviser website; however, a new data may be generated to test the credibility of the model. Only aspect-based sentiment classification is considered in this study.

Originality/value

Stacking machine learning techniques have been proposed. At first, state-of-the-art classifiers are tested on the given data, and then, three best performing classifiers (decision tree C5.0, random forest and support vector machine) are taken to build stack and to create ISD classifier.

1 – 10 of over 3000