Search results

1 – 10 of 800
Article
Publication date: 14 November 2016

Shrawan Kumar Trivedi and Shubhamoy Dey

The email is an important medium for sharing information rapidly. However, spam, being a nuisance in such communication, motivates the building of a robust filtering system with…

Abstract

Purpose

The email is an important medium for sharing information rapidly. However, spam, being a nuisance in such communication, motivates the building of a robust filtering system with high classification accuracy and good sensitivity towards false positives. In that context, this paper aims to present a combined classifier technique using a committee selection mechanism where the main objective is to identify a set of classifiers so that their individual decisions can be combined by a committee selection procedure for accurate detection of spam.

Design/methodology/approach

For training and testing of the relevant machine learning classifiers, text mining approaches are used in this research. Three data sets (Enron, SpamAssassin and LingSpam) have been used to test the classifiers. Initially, pre-processing is performed to extract the features associated with the email files. In the next step, the extracted features are taken through a dimensionality reduction method where non-informative features are removed. Subsequently, an informative feature subset is selected using genetic feature search. Thereafter, the proposed classifiers are tested on those informative features and the results compared with those of other classifiers.

Findings

For building the proposed combined classifier, three different studies have been performed. The first study identifies the effect of boosting algorithms on two probabilistic classifiers: Bayesian and Naïve Bayes. In that study, AdaBoost has been found to be the best algorithm for performance boosting. The second study was on the effect of different Kernel functions on support vector machine (SVM) classifier, where SVM with normalized polynomial (NP) kernel was observed to be the best. The last study was on combining classifiers with committee selection where the committee members were the best classifiers identified by the first study i.e. Bayesian and Naïve bays with AdaBoost, and the committee president was selected from the second study i.e. SVM with NP kernel. Results show that combining of the identified classifiers to form a committee machine gives excellent performance accuracy with a low false positive rate.

Research limitations/implications

This research is focused on the classification of email spams written in English language. Only body (text) parts of the emails have been used. Image spam has not been included in this work. We have restricted our work to only emails messages. None of the other types of messages like short message service or multi-media messaging service were a part of this study.

Practical implications

This research proposes a method of dealing with the issues and challenges faced by internet service providers and organizations that use email. The proposed model provides not only better classification accuracy but also a low false positive rate.

Originality/value

The proposed combined classifier is a novel classifier designed for accurate classification of email spam.

Details

VINE Journal of Information and Knowledge Management Systems, vol. 46 no. 4
Type: Research Article
ISSN: 2059-5891

Keywords

Article
Publication date: 29 October 2018

Shrawan Kumar Trivedi and Shubhamoy Dey

To be sustainable and competitive in the current business environment, it is useful to understand users’ sentiment towards products and services. This critical task can be…

Abstract

Purpose

To be sustainable and competitive in the current business environment, it is useful to understand users’ sentiment towards products and services. This critical task can be achieved via natural language processing and machine learning classifiers. This paper aims to propose a novel probabilistic committee selection classifier (PCC) to analyse and classify the sentiment polarities of movie reviews.

Design/methodology/approach

An Indian movie review corpus is assembled for this study. Another publicly available movie review polarity corpus is also involved with regard to validating the results. The greedy stepwise search method is used to extract the features/words of the reviews. The performance of the proposed classifier is measured using different metrics, such as F-measure, false positive rate, receiver operating characteristic (ROC) curve and training time. Further, the proposed classifier is compared with other popular machine-learning classifiers, such as Bayesian, Naïve Bayes, Decision Tree (J48), Support Vector Machine and Random Forest.

Findings

The results of this study show that the proposed classifier is good at predicting the positive or negative polarity of movie reviews. Its performance accuracy and the value of the ROC curve of the PCC is found to be the most suitable of all other classifiers tested in this study. This classifier is also found to be efficient at identifying positive sentiments of reviews, where it gives low false positive rates for both the Indian Movie Review and Review Polarity corpora used in this study. The training time of the proposed classifier is found to be slightly higher than that of Bayesian, Naïve Bayes and J48.

Research limitations/implications

Only movie review sentiments written in English are considered. In addition, the proposed committee selection classifier is prepared only using the committee of probabilistic classifiers; however, other classifier committees can also be built, tested and compared with the present experiment scenario.

Practical implications

In this paper, a novel probabilistic approach is proposed and used for classifying movie reviews, and is found to be highly effective in comparison with other state-of-the-art classifiers. This classifier may be tested for different applications and may provide new insights for developers and researchers.

Social implications

The proposed PCC may be used to classify different product reviews, and hence may be beneficial to organizations to justify users’ reviews about specific products or services. By using authentic positive and negative sentiments of users, the credibility of the specific product, service or event may be enhanced. PCC may also be applied to other applications, such as spam detection, blog mining, news mining and various other data-mining applications.

Originality/value

The constructed PCC is novel and was tested on Indian movie review data.

Article
Publication date: 25 October 2018

Shrawan Kumar Trivedi, Shubhamoy Dey and Anil Kumar

Sentiment analysis and opinion mining are emerging areas of research for analyzing Web data and capturing users’ sentiments. This research aims to present sentiment analysis of an…

Abstract

Purpose

Sentiment analysis and opinion mining are emerging areas of research for analyzing Web data and capturing users’ sentiments. This research aims to present sentiment analysis of an Indian movie review corpus using natural language processing and various machine learning classifiers.

Design/methodology/approach

In this paper, a comparative study between three machine learning classifiers (Bayesian, naïve Bayesian and support vector machine [SVM]) was performed. All the classifiers were trained on the words/features of the corpus extracted, using five different feature selection algorithms (Chi-square, info-gain, gain ratio, one-R and relief-F [RF] attributes), and a comparative study was performed between them. The classifiers and feature selection approaches were evaluated using different metrics (F-value, false-positive [FP] rate and training time).

Findings

The results of this study show that, for the maximum number of features, the RF feature selection approach was found to be the best, with better F-values, a low FP rate and less time needed to train the classifiers, whereas for the least number of features, one-R was better than RF. When the evaluation was performed for machine learning classifiers, SVM was found to be superior, although the Bayesian classifier was comparable with SVM.

Originality/value

This is a novel research where Indian review data were collected and then a classification model for sentiment polarity (positive/negative) was constructed.

Details

The Electronic Library, vol. 36 no. 4
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 1 November 2019

Shrawan Kumar Trivedi and Shubhamoy Dey

Email is a rapid and cheapest medium of sharing information, whereas unsolicited email (spam) is constant trouble in the email communication. The rapid growth of the spam creates…

Abstract

Purpose

Email is a rapid and cheapest medium of sharing information, whereas unsolicited email (spam) is constant trouble in the email communication. The rapid growth of the spam creates a necessity to build a reliable and robust spam classifier. This paper aims to presents a study of evolutionary classifiers (genetic algorithm [GA] and genetic programming [GP]) without/with the help of an ensemble of classifiers method. In this research, the classifiers ensemble has been developed with adaptive boosting technique.

Design/methodology/approach

Text mining methods are applied for classifying spam emails and legitimate emails. Two data sets (Enron and SpamAssassin) are taken to test the concerned classifiers. Initially, pre-processing is performed to extract the features/words from email files. Informative feature subset is selected from greedy stepwise feature subset search method. With the help of informative features, a comparative study is performed initially within the evolutionary classifiers and then with other popular machine learning classifiers (Bayesian, naive Bayes and support vector machine).

Findings

This study reveals the fact that evolutionary algorithms are promising in classification and prediction applications where genetic programing with adaptive boosting is turned out not only an accurate classifier but also a sensitive classifier. Results show that initially GA performs better than GP but after an ensemble of classifiers (a large number of iterations), GP overshoots GA with significantly higher accuracy. Amongst all classifiers, boosted GP turns out to be not only good regarding classification accuracy but also low false positive (FP) rates, which is considered to be the important criteria in email spam classification. Also, greedy stepwise feature search is found to be an effective method for feature selection in this application domain.

Research limitations/implications

The research implication of this research consists of the reduction in cost incurred because of spam/unsolicited bulk email. Email is a fundamental necessity to share information within a number of units of the organizations to be competitive with the business rivals. In addition, it is continually a hurdle for internet service providers to provide the best emailing services to their customers. Although, the organizations and the internet service providers are continuously adopting novel spam filtering approaches to reduce the number of unwanted emails, the desired effect could not be significantly seen because of the cost of installation, customizable ability and the threat of misclassification of important emails. This research deals with all the issues and challenges faced by internet service providers and organizations.

Practical implications

In this research, the proposed models have not only provided excellent performance accuracy, sensitivity with low FP rate, customizable capability but also worked on reducing the cost of spam. The same models may be used for other applications of text mining also such as sentiment analysis, blog mining, news mining or other text mining research.

Originality/value

A comparison between GP and GAs has been shown with/without ensemble in spam classification application domain.

Open Access
Article
Publication date: 11 December 2020

Balamurugan Souprayen, Ayyasamy Ayyanar and Suresh Joseph K

The purpose of the food traceability is used to retain the good quality of raw material supply, diminish the loss and reduced system complexity.

1221

Abstract

Purpose

The purpose of the food traceability is used to retain the good quality of raw material supply, diminish the loss and reduced system complexity.

Design/methodology/approach

The proposed hybrid algorithm is for food traceability to make accurate predictions and enhanced period data. The operation of the internet of things is addressed to track and trace the food quality to check the data acquired from manufacturers and consumers.

Findings

In order to survive with the existing financial circumstances and the development of global food supply chain, the authors propose efficient food traceability techniques using the internet of things and obtain a solution for data prediction.

Originality/value

The operation of the internet of things is addressed to track and trace the food quality to check the data acquired from manufacturers and consumers. The experimental analysis depicts that proposed algorithm has high accuracy rate, less execution time and error rate.

Details

Modern Supply Chain Research and Applications, vol. 3 no. 1
Type: Research Article
ISSN: 2631-3871

Keywords

Article
Publication date: 7 June 2021

Carol K.H. Hon, Chenjunyan Sun, Bo Xia, Nerina L. Jimmieson, Kïrsten A. Way and Paul Pao-Yen Wu

Bayesian approaches have been widely applied in construction management (CM) research due to their capacity to deal with uncertain and complicated problems. However, to date…

Abstract

Purpose

Bayesian approaches have been widely applied in construction management (CM) research due to their capacity to deal with uncertain and complicated problems. However, to date, there has been no systematic review of applications of Bayesian approaches in existing CM studies. This paper systematically reviews applications of Bayesian approaches in CM research and provides insights into potential benefits of this technique for driving innovation and productivity in the construction industry.

Design/methodology/approach

A total of 148 articles were retrieved for systematic review through two literature selection rounds.

Findings

Bayesian approaches have been widely applied to safety management and risk management. The Bayesian network (BN) was the most frequently employed Bayesian method. Elicitation from expert knowledge and case studies were the primary methods for BN development and validation, respectively. Prediction was the most popular type of reasoning with BNs. Research limitations in existing studies mainly related to not fully realizing the potential of Bayesian approaches in CM functional areas, over-reliance on expert knowledge for BN model development and lacking guides on BN model validation, together with pertinent recommendations for future research.

Originality/value

This systematic review contributes to providing a comprehensive understanding of the application of Bayesian approaches in CM research and highlights implications for future research and practice.

Details

Engineering, Construction and Architectural Management, vol. 29 no. 5
Type: Research Article
ISSN: 0969-9988

Keywords

Article
Publication date: 9 March 2015

Ahmed Ahmim and Nacira Ghoualmi Zine

The purpose of this paper is to build a new hierarchical intrusion detection system (IDS) based on a binary tree of different types of classifiers. The proposed IDS model must…

Abstract

Purpose

The purpose of this paper is to build a new hierarchical intrusion detection system (IDS) based on a binary tree of different types of classifiers. The proposed IDS model must possess the following characteristics: combine a high detection rate and a low false alarm rate, and classify any connection in a specific category of network connection.

Design/methodology/approach

To build the binary tree, the authors cluster the different categories of network connections hierarchically based on the proportion of false-positives and false-negatives generated between each of the two categories. The built model is a binary tree with multi-levels. At first, the authors use the best classifier in the classification of the network connections in category A and category G2 that clusters the rest of the categories. Then, in the second level, they use the best classifier in the classification of G2 network connections in category B and category G3 that represents the different categories clustered in G2 without category B. This process is repeated until the last two categories of network connections. Note that one of these categories represents the normal connection, and the rest represent the different types of abnormal connections.

Findings

The experimentation on the labeled data set for flow-based intrusion detection, NSL-KDD and KDD’99 shows the high performance of the authors' model compared to the results obtained by some well-known classifiers and recent IDS models. The experiments’ results show that the authors' model gives a low false alarm rate and the highest detection rate. Moreover, the model is more accurate than some well-known classifiers like SVM, C4.5 decision tree, MLP neural network and naïve Bayes with accuracy equal to 83.26 per cent on NSL-KDD and equal to 99.92 per cent on the labeled data set for flow-based intrusion detection. As well, it is more accurate than the best of related works and recent IDS models with accuracy equal to 95.72 per cent on KDD’99.

Originality/value

This paper proposes a novel hierarchical IDS based on a binary tree of classifiers, where different types of classifiers are used to create a high-performance model. Therefore, it confirms the capacity of the hierarchical model to combine a high detection rate and a low false alarm rate.

Details

Information & Computer Security, vol. 23 no. 1
Type: Research Article
ISSN: 2056-4961

Keywords

Article
Publication date: 13 October 2020

Bijitaswa Chakraborty and Titas Bhattacharjee

The purpose of this paper is to give a comprehensive review and synthesis of automated textual analysis of corporate disclosure to show how the accuracy of disclosure tone has…

1321

Abstract

Purpose

The purpose of this paper is to give a comprehensive review and synthesis of automated textual analysis of corporate disclosure to show how the accuracy of disclosure tone has been incremented with the evolution of developed automated methods that have been used to calculate tone in prior studies.

Design/methodology/approach

This study have conducted the survey on “automated textual analysis of corporate disclosure and its impact” by searching at Google Scholar and Scopus research database after the year 2000 to prepare the list of papers. After classifying the prior literature into a dictionary-based and machine learning-based approach, this study have again sub-classified those papers according to two other dimensions, namely, information sources of disclosure and the impact of tone on the market.

Findings

This study found literature on how value relevance of tone is varied with the use of different automated methods and using different information sources. This study also found literature on the impact of such tone on market. These are contributing to help investor’s decision-making and earnings and returns prediction by researchers. The literature survey shows that the research gap lies in the development of methodologies toward the calculation of tone more accurately. This study also mention how different information sources and methodologies can influence the change in disclosure tone for the same firm, which, in turn, may change market performance. The research gap also lies in finding the determinants of disclosure tone with large scale data.

Originality/value

After reviewing some papers based on automated textual analysis of corporate disclosure, this study shows how the accuracy of the result is incrementing according to the evolution of automated methodology. Apart from the methodological research gaps, this study also identify some other research gaps related to determinants (corporate governance, firm-level, macroeconomic factors, etc.) and transparency or credibility of disclosure which could stimulate new research agendas in the areas of automated textual analysis of corporate disclosure.

Details

Journal of Financial Reporting and Accounting, vol. 18 no. 4
Type: Research Article
ISSN: 1985-2517

Keywords

Article
Publication date: 6 September 2021

Sivaraman Eswaran, Vakula Rani, Daniel D., Jayabrabu Ramakrishnan and Sadhana Selvakumar

In the recent era, banking infrastructure constructs various remotely handled platforms for users. However, the security risk toward the banking sector has also elevated, as it is…

Abstract

Purpose

In the recent era, banking infrastructure constructs various remotely handled platforms for users. However, the security risk toward the banking sector has also elevated, as it is visible from the rising number of reported attacks against these security systems. Intelligence shows that cyberattacks of the crawlers are increasing. Malicious crawlers can crawl the Web pages, crack the passwords and reap the private data of the users. Besides, intrusion detection systems in a dynamic environment provide more false positives. The purpose of this research paper is to propose an efficient methodology to sense the attacks for creating low levels of false positives.

Design/methodology/approach

In this research, the authors have developed an efficient approach for malicious crawler detection and correlated the security alerts. The behavioral features of the crawlers are examined for the recognition of the malicious crawlers, and a novel methodology is proposed to improvise the bank user portal security. The authors have compared various machine learning strategies including Bayesian network, support sector machine (SVM) and decision tree.

Findings

This proposed work stretches in various aspects. Initially, the outcomes are stated for the mixture of different kinds of log files. Then, distinct sites of various log files are selected for the construction of the acceptable data sets. Session identification, attribute extraction, session labeling and classification were held. Moreover, this approach clustered the meta-alerts into higher level meta-alerts for fusing multistages of attacks and the various types of attacks.

Originality/value

This methodology used incremental clustering techniques and analyzed the probability of existing topologies in SVM classifiers for more deterministic classification. It also enhanced the taxonomy for various domains.

Details

International Journal of Pervasive Computing and Communications, vol. 18 no. 1
Type: Research Article
ISSN: 1742-7371

Keywords

Article
Publication date: 12 June 2017

Ali Hasan Alsaffar

The purpose of this paper is to present an empirical study on the effect of two synthetic attributes to popular classification algorithms on data originating from student…

Abstract

Purpose

The purpose of this paper is to present an empirical study on the effect of two synthetic attributes to popular classification algorithms on data originating from student transcripts. The attributes represent past performance achievements in a course, which are defined as global performance (GP) and local performance (LP). GP of a course is an aggregated performance achieved by all students who have taken this course, and LP of a course is an aggregated performance achieved in the prerequisite courses by the student taking the course.

Design/methodology/approach

The paper uses Educational Data Mining techniques to predict student performance in courses, where it identifies the relevant attributes that are the most key influencers for predicting the final grade (performance) and reports the effect of the two suggested attributes on the classification algorithms. As a research paradigm, the paper follows Cross-Industry Standard Process for Data Mining using RapidMiner Studio software tool. Six classification algorithms are experimented: C4.5 and CART Decision Trees, Naive Bayes, k-neighboring, rule-based induction and support vector machines.

Findings

The outcomes of the paper show that the synthetic attributes have positively improved the performance of the classification algorithms, and also they have been highly ranked according to their influence to the target variable.

Originality/value

This paper proposes two synthetic attributes that are integrated into real data set. The key motivation is to improve the quality of the data and make classification algorithms perform better. The paper also presents empirical results showing the effect of these attributes on selected classification algorithms.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 10 no. 2
Type: Research Article
ISSN: 1756-378X

Keywords

1 – 10 of 800