Search results

1 – 10 of 11
Article
Publication date: 30 October 2018

Shrawan Kumar Trivedi and Prabin Kumar Panigrahi

Email spam classification is now becoming a challenging area in the domain of text classification. Precise and robust classifiers are not only judged by classification accuracy…

Abstract

Purpose

Email spam classification is now becoming a challenging area in the domain of text classification. Precise and robust classifiers are not only judged by classification accuracy but also by sensitivity (correctly classified legitimate emails) and specificity (correctly classified unsolicited emails) towards the accurate classification, captured by both false positive and false negative rates. This paper aims to present a comparative study between various decision tree classifiers (such as AD tree, decision stump and REP tree) with/without different boosting algorithms (bagging, boosting with re-sample and AdaBoost).

Design/methodology/approach

Artificial intelligence and text mining approaches have been incorporated in this study. Each decision tree classifier in this study is tested on informative words/features selected from the two publically available data sets (SpamAssassin and LingSpam) using a greedy step-wise feature search method.

Findings

Outcomes of this study show that without boosting, the REP tree provides high performance accuracy with the AD tree ranking as the second-best performer. Decision stump is found to be the under-performing classifier of this study. However, with boosting, the combination of REP tree and AdaBoost compares favourably with other classification models. If the metrics false positive rate and performance accuracy are taken together, AD tree and REP tree with AdaBoost were both found to carry out an effective classification task. Greedy stepwise has proven its worth in this study by selecting a subset of valuable features to identify the correct class of emails.

Research limitations/implications

This research is focussed on the classification of those email spams that are written in the English language only. The proposed models work with content (words/features) of email data that is mostly found in the body of the mail. Image spam has not been included in this study. Other messages such as short message service or multi-media messaging service were not included in this study.

Practical implications

In this research, a boosted decision tree approach has been proposed and used to classify email spam and ham files; this is found to be a highly effective approach in comparison with other state-of-the-art modes used in other studies. This classifier may be tested for different applications and may provide new insights for developers and researchers.

Originality/value

A comparison of decision tree classifiers with/without ensemble has been presented for spam classification.

Details

Journal of Systems and Information Technology, vol. 20 no. 3
Type: Research Article
ISSN: 1328-7265

Keywords

Article
Publication date: 25 October 2018

Shrawan Kumar Trivedi, Shubhamoy Dey and Anil Kumar

Sentiment analysis and opinion mining are emerging areas of research for analyzing Web data and capturing users’ sentiments. This research aims to present sentiment analysis of an…

Abstract

Purpose

Sentiment analysis and opinion mining are emerging areas of research for analyzing Web data and capturing users’ sentiments. This research aims to present sentiment analysis of an Indian movie review corpus using natural language processing and various machine learning classifiers.

Design/methodology/approach

In this paper, a comparative study between three machine learning classifiers (Bayesian, naïve Bayesian and support vector machine [SVM]) was performed. All the classifiers were trained on the words/features of the corpus extracted, using five different feature selection algorithms (Chi-square, info-gain, gain ratio, one-R and relief-F [RF] attributes), and a comparative study was performed between them. The classifiers and feature selection approaches were evaluated using different metrics (F-value, false-positive [FP] rate and training time).

Findings

The results of this study show that, for the maximum number of features, the RF feature selection approach was found to be the best, with better F-values, a low FP rate and less time needed to train the classifiers, whereas for the least number of features, one-R was better than RF. When the evaluation was performed for machine learning classifiers, SVM was found to be superior, although the Bayesian classifier was comparable with SVM.

Originality/value

This is a novel research where Indian review data were collected and then a classification model for sentiment polarity (positive/negative) was constructed.

Details

The Electronic Library, vol. 36 no. 4
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 1 November 2019

Shrawan Kumar Trivedi and Shubhamoy Dey

Email is a rapid and cheapest medium of sharing information, whereas unsolicited email (spam) is constant trouble in the email communication. The rapid growth of the spam creates…

Abstract

Purpose

Email is a rapid and cheapest medium of sharing information, whereas unsolicited email (spam) is constant trouble in the email communication. The rapid growth of the spam creates a necessity to build a reliable and robust spam classifier. This paper aims to presents a study of evolutionary classifiers (genetic algorithm [GA] and genetic programming [GP]) without/with the help of an ensemble of classifiers method. In this research, the classifiers ensemble has been developed with adaptive boosting technique.

Design/methodology/approach

Text mining methods are applied for classifying spam emails and legitimate emails. Two data sets (Enron and SpamAssassin) are taken to test the concerned classifiers. Initially, pre-processing is performed to extract the features/words from email files. Informative feature subset is selected from greedy stepwise feature subset search method. With the help of informative features, a comparative study is performed initially within the evolutionary classifiers and then with other popular machine learning classifiers (Bayesian, naive Bayes and support vector machine).

Findings

This study reveals the fact that evolutionary algorithms are promising in classification and prediction applications where genetic programing with adaptive boosting is turned out not only an accurate classifier but also a sensitive classifier. Results show that initially GA performs better than GP but after an ensemble of classifiers (a large number of iterations), GP overshoots GA with significantly higher accuracy. Amongst all classifiers, boosted GP turns out to be not only good regarding classification accuracy but also low false positive (FP) rates, which is considered to be the important criteria in email spam classification. Also, greedy stepwise feature search is found to be an effective method for feature selection in this application domain.

Research limitations/implications

The research implication of this research consists of the reduction in cost incurred because of spam/unsolicited bulk email. Email is a fundamental necessity to share information within a number of units of the organizations to be competitive with the business rivals. In addition, it is continually a hurdle for internet service providers to provide the best emailing services to their customers. Although, the organizations and the internet service providers are continuously adopting novel spam filtering approaches to reduce the number of unwanted emails, the desired effect could not be significantly seen because of the cost of installation, customizable ability and the threat of misclassification of important emails. This research deals with all the issues and challenges faced by internet service providers and organizations.

Practical implications

In this research, the proposed models have not only provided excellent performance accuracy, sensitivity with low FP rate, customizable capability but also worked on reducing the cost of spam. The same models may be used for other applications of text mining also such as sentiment analysis, blog mining, news mining or other text mining research.

Originality/value

A comparison between GP and GAs has been shown with/without ensemble in spam classification application domain.

Article
Publication date: 29 October 2018

Shrawan Kumar Trivedi and Shubhamoy Dey

To be sustainable and competitive in the current business environment, it is useful to understand users’ sentiment towards products and services. This critical task can be…

Abstract

Purpose

To be sustainable and competitive in the current business environment, it is useful to understand users’ sentiment towards products and services. This critical task can be achieved via natural language processing and machine learning classifiers. This paper aims to propose a novel probabilistic committee selection classifier (PCC) to analyse and classify the sentiment polarities of movie reviews.

Design/methodology/approach

An Indian movie review corpus is assembled for this study. Another publicly available movie review polarity corpus is also involved with regard to validating the results. The greedy stepwise search method is used to extract the features/words of the reviews. The performance of the proposed classifier is measured using different metrics, such as F-measure, false positive rate, receiver operating characteristic (ROC) curve and training time. Further, the proposed classifier is compared with other popular machine-learning classifiers, such as Bayesian, Naïve Bayes, Decision Tree (J48), Support Vector Machine and Random Forest.

Findings

The results of this study show that the proposed classifier is good at predicting the positive or negative polarity of movie reviews. Its performance accuracy and the value of the ROC curve of the PCC is found to be the most suitable of all other classifiers tested in this study. This classifier is also found to be efficient at identifying positive sentiments of reviews, where it gives low false positive rates for both the Indian Movie Review and Review Polarity corpora used in this study. The training time of the proposed classifier is found to be slightly higher than that of Bayesian, Naïve Bayes and J48.

Research limitations/implications

Only movie review sentiments written in English are considered. In addition, the proposed committee selection classifier is prepared only using the committee of probabilistic classifiers; however, other classifier committees can also be built, tested and compared with the present experiment scenario.

Practical implications

In this paper, a novel probabilistic approach is proposed and used for classifying movie reviews, and is found to be highly effective in comparison with other state-of-the-art classifiers. This classifier may be tested for different applications and may provide new insights for developers and researchers.

Social implications

The proposed PCC may be used to classify different product reviews, and hence may be beneficial to organizations to justify users’ reviews about specific products or services. By using authentic positive and negative sentiments of users, the credibility of the specific product, service or event may be enhanced. PCC may also be applied to other applications, such as spam detection, blog mining, news mining and various other data-mining applications.

Originality/value

The constructed PCC is novel and was tested on Indian movie review data.

Article
Publication date: 4 April 2022

Shrawan Kumar Trivedi, Amrinder Singh and Somesh Kumar Malhotra

There is a need to predict whether the consumers liked the stay in the hotel rooms or not, and to remove the aspects the customers did not like. Many customers leave a review…

Abstract

Purpose

There is a need to predict whether the consumers liked the stay in the hotel rooms or not, and to remove the aspects the customers did not like. Many customers leave a review after staying in the hotel. These reviews are mostly given on the website used to book the hotel. These reviews can be considered as a valuable data, which can be analyzed to provide better services in the hotels. The purpose of this study is to use machine learning techniques for analyzing the given data to determine different sentiment polarities of the consumers.

Design/methodology/approach

Reviews given by hotel customers on the Tripadvisor website, which were made available publicly by Kaggle. Out of 10,000 reviews in the data, a sample of 3,000 negative polarity reviews (customers with bad experiences) in the hotel and 3,000 positive polarity reviews (customers with good experiences) in the hotel is taken to prepare data set. The two-stage feature selection was applied, which first involved greedy selection method and then wrapper method to generate 37 most relevant features. An improved stacked decision tree (ISD) classifier) is built, which is further compared with state-of-the-art machine learning algorithms. All the tests are done using R-Studio.

Findings

The results showed that the new model was satisfactory overall with 80.77% accuracy after doing in-depth study with 50–50 split, 80.74% accuracy for 66–34 split and 80.25% accuracy for 80–20 split, when predicting the nature of the customers’ experience in the hotel, i.e. whether they are positive or negative.

Research limitations/implications

The implication of this research is to provide a showcase of how we can predict the polarity of potentially popular reviews. This helps the authors’ perspective to help the hotel industries to take corrective measures for the betterment of business and to promote useful positive reviews. This study also has some limitations like only English reviews are considered. This study was restricted to the data from trip-adviser website; however, a new data may be generated to test the credibility of the model. Only aspect-based sentiment classification is considered in this study.

Originality/value

Stacking machine learning techniques have been proposed. At first, state-of-the-art classifiers are tested on the given data, and then, three best performing classifiers (decision tree C5.0, random forest and support vector machine) are taken to build stack and to create ISD classifier.

Article
Publication date: 14 November 2016

Shrawan Kumar Trivedi and Shubhamoy Dey

The email is an important medium for sharing information rapidly. However, spam, being a nuisance in such communication, motivates the building of a robust filtering system with…

Abstract

Purpose

The email is an important medium for sharing information rapidly. However, spam, being a nuisance in such communication, motivates the building of a robust filtering system with high classification accuracy and good sensitivity towards false positives. In that context, this paper aims to present a combined classifier technique using a committee selection mechanism where the main objective is to identify a set of classifiers so that their individual decisions can be combined by a committee selection procedure for accurate detection of spam.

Design/methodology/approach

For training and testing of the relevant machine learning classifiers, text mining approaches are used in this research. Three data sets (Enron, SpamAssassin and LingSpam) have been used to test the classifiers. Initially, pre-processing is performed to extract the features associated with the email files. In the next step, the extracted features are taken through a dimensionality reduction method where non-informative features are removed. Subsequently, an informative feature subset is selected using genetic feature search. Thereafter, the proposed classifiers are tested on those informative features and the results compared with those of other classifiers.

Findings

For building the proposed combined classifier, three different studies have been performed. The first study identifies the effect of boosting algorithms on two probabilistic classifiers: Bayesian and Naïve Bayes. In that study, AdaBoost has been found to be the best algorithm for performance boosting. The second study was on the effect of different Kernel functions on support vector machine (SVM) classifier, where SVM with normalized polynomial (NP) kernel was observed to be the best. The last study was on combining classifiers with committee selection where the committee members were the best classifiers identified by the first study i.e. Bayesian and Naïve bays with AdaBoost, and the committee president was selected from the second study i.e. SVM with NP kernel. Results show that combining of the identified classifiers to form a committee machine gives excellent performance accuracy with a low false positive rate.

Research limitations/implications

This research is focused on the classification of email spams written in English language. Only body (text) parts of the emails have been used. Image spam has not been included in this work. We have restricted our work to only emails messages. None of the other types of messages like short message service or multi-media messaging service were a part of this study.

Practical implications

This research proposes a method of dealing with the issues and challenges faced by internet service providers and organizations that use email. The proposed model provides not only better classification accuracy but also a low false positive rate.

Originality/value

The proposed combined classifier is a novel classifier designed for accurate classification of email spam.

Details

VINE Journal of Information and Knowledge Management Systems, vol. 46 no. 4
Type: Research Article
ISSN: 2059-5891

Keywords

Article
Publication date: 18 May 2020

Shrawan Kumar Trivedi and Mohit Yadav

Research on online businesses has focused on the adoption of e-commerce and initial purchase behavior; repurchase intention and its antecedents remain underresearched. The present…

5452

Abstract

Purpose

Research on online businesses has focused on the adoption of e-commerce and initial purchase behavior; repurchase intention and its antecedents remain underresearched. The present study develops an empirical model to explore the extent to which trust and e-satisfaction mediate the effect of vendor-specific attributes and customer intention to repurchase from the same online platform.

Design/methodology/approach

The proposed model is tested and validated in the context of Generation Y in India. A self-administrated online survey was employed, and the students aged between 20 and 35 at universities in Northern India are selected as subject. The data is analyzed using SPSS 20.0 and AMOS 20.0, where structural equation modeling is used to examine the model and test the hypothesis.

Findings

The results of this study suggest that trust mediates fully between security concerns, privacy concerns, and repurchase intention. E-satisfaction mediates between security and ease of use (EOU).

Practical implications

This study reveals the fact that security, EOU, and privacy concerns are the critical determinants that have the most impact on consumer's purchasing behavior. Gen Y consumers of India need some strong security features, an easy-to-use interface, a trusted privacy policy. Furthermore, it may be beneficial to observe e-satisfaction and trust as a mediator when identifying potential problems; online satisfaction is essential for the group in this study, and the results show that it impacts on the relation between repurchase intention and some determinant of repurchase intentions.

Originality/value

This research determines the impact of security, privacy concerns, EOU on the online repurchasing behavior of Gen Y in India. The mediation effect of e-satisfaction and trust has also been determined.

Details

Marketing Intelligence & Planning, vol. 38 no. 4
Type: Research Article
ISSN: 0263-4503

Keywords

Article
Publication date: 25 October 2021

Shrawan Kumar Trivedi, Pradipta Patra and Saumya Singh

Social media sites are one of the vital technological developments of WEB 2.0. This study aims to emphasize on building an empirical model to investigate the impending…

Abstract

Purpose

Social media sites are one of the vital technological developments of WEB 2.0. This study aims to emphasize on building an empirical model to investigate the impending determinants of users’ intention to use social media sites in higher education. Depending on the existing theories such as the social media acceptance model, e-learning acceptance model, unified theory of acceptance and use of technology and existing literature, determinants such as “performance,” “communication functionality” and “self” have been identified to test. Further, the mediation effect of “peer influence” on the relationship has also been tested.

Design/methodology/approach

A total of 310 students of different private and public Indian institutions have participated in an online survey. Exploratory factor analysis and multiple linear regression analysis are performed and results were analyzed.

Findings

The results of this study demonstrated substantial evidence of the impact of performance, communication functionality and self on the intention to use social media in higher education. The mediation of peer influence has also been seen between all the relations.

Originality/value

An empirical model of intention to use social media in higher education is built. The median of peer influence is tested.

Details

Global Knowledge, Memory and Communication, vol. 71 no. 1/2
Type: Research Article
ISSN: 2514-9342

Keywords

Article
Publication date: 13 August 2018

Shrawan Kumar Trivedi and Mohit Yadav

Shopping online is a fast-growing phenomenon. A look into the rapid exponential growth of the primary players in this sector shows huge market potential for e-commerce. Given the…

4712

Abstract

Purpose

Shopping online is a fast-growing phenomenon. A look into the rapid exponential growth of the primary players in this sector shows huge market potential for e-commerce. Given the convenience of internet shopping, e-commerce is seen as an emerging trend among consumers, specifically the younger generation (Gen Y). The popularity of e-commerce and online shopping has captured the attention of e-retailers, encouraging researchers to focus on this area. This paper aims to examine the relationship between online repurchase intention and other variables such as security, privacy concerns, trust and ease of use (EOU), mediated by e-satisfaction.

Design/methodology/approach

A self-administered survey method is used, and students aged between 20 and 35 years at universities in northern India are selected as subjects. To test the hypotheses of this study, an online questionnaire is distributed to participants, with 309 legitimate responses received. The data are analyzed using SPSS version 20.0 and AMOS version 20.0. Structural equation modeling is used to examine the model and to test the hypotheses.

Findings

The results of this study show that security, privacy concerns, trust and EOU have a positive significant relationship with repurchase intention. The findings also reveal that e-satisfaction has a full mediation effect between security and repurchase intention and also between trust and repurchase intention. In addition, a partial mediation effect of e-satisfaction is noted between EOU and repurchase intention and between privacy concerns and repurchase intention.

Practical implications

The results show that security, trust, EOU and privacy concerns are the factors that have most impact on consumer purchasing behavior. In terms of the repurchase intention of Gen Y consumers, what is needed are strong security features, an easy-to-use interface, a trusted privacy policy and the creation of trust. Furthermore, it may be beneficial to observe e-satisfaction as a mediator when identifying potential problems; online satisfaction is important for the group in this study, and the results show that it impacts on the relation between repurchase intention and other factors.

Social implications

In terms of the repurchase intention of Gen Y consumers, what is needed are strong security features, an easy-to-use interface, a trusted privacy policy and the creation of trust. Furthermore, it may be beneficial to observe e-satisfaction as a mediator when identifying potential problems; online satisfaction is important for the group in this study, and the results show that it impacts on the relation between repurchase intention and other factors.

Originality/value

This research determines the impact of security, privacy concerns, EOU and trust on the online repurchasing behavior of Gen Y in India. The mediation effect of e-satisfaction is also determined.

Details

VINE Journal of Information and Knowledge Management Systems, vol. 48 no. 3
Type: Research Article
ISSN: 2059-5891

Keywords

Article
Publication date: 26 February 2021

Shrawan Kumar Trivedi and Amrinder Singh

There is a strong need for companies to monitor customer-generated content of social media, not only about themselves but also about competitors, to deal with competition and to…

1813

Abstract

Purpose

There is a strong need for companies to monitor customer-generated content of social media, not only about themselves but also about competitors, to deal with competition and to assess competitive environment of the business. The purpose of this paper is to help companies with social media competitive analysis and transformation of social media data into knowledge creation for decision-makers, specifically for app-based food delivery companies.

Design/methodology/approach

Three online app-based food delivery companies, i.e. Swiggy, Zomato and UberEats, were considered in this study. Twitter was used as the data collection platform where customer’s tweets related to all three companies are fetched using R-Studio and Lexicon-based sentiment analysis method is applied on the tweets fetched for the companies. A descriptive analytical method is used to compute the score of different sentiments. A negative and positive sentiment word list is created to match the word present on the tweets and based on the matching positive, negative and neutral sentiments score are decided. The sentiment analysis is a best method to analyze consumer’s text sentiment. Lexicon-based sentiment classification is always preferable than machine learning or other model because it gives flexibility to make your own sentiment dictionary to classify emotions. To perform tweets sentiment analysis, lexicon-based classification method and text mining were performed on R-Studio platform.

Findings

Results suggest that Zomato (26% positive sentiments) has received more positive sentiments as compared to the other two companies (25% positive sentiments for Swiggy and 24% positive sentiments for UberEats). Negative sentiments for the Zomato was also low (12% negative sentiments) compared to Swiggy and UberEats (13% negative sentiments for both). Further, based on negative sentiments concerning all the three food delivery companies, tweets were analyzed and recommendations for business provided.

Research limitations/implications

The results of this study reveal the value of social media competitive analysis and show the power of text mining and sentiment analysis in extracting business value and competitive advantage. Suggestions, business and research implications are also provided to help companies in developing a social media competitive analysis strategy.

Originality/value

Twitter analysis of food-based companies has been performed.

Details

Global Knowledge, Memory and Communication, vol. 70 no. 8/9
Type: Research Article
ISSN: 2514-9342

Keywords

1 – 10 of 11