Search results

1 – 10 of over 1000
Open Access
Article
Publication date: 9 May 2022

Khalid Iqbal and Muhammad Shehrayar Khan

In this digital era, email is the most pervasive form of communication between people. Many users become a victim of spam emails and their data have been exposed.

9082

Abstract

Purpose

In this digital era, email is the most pervasive form of communication between people. Many users become a victim of spam emails and their data have been exposed.

Design/methodology/approach

Researchers contribute to solving this problem by a focus on advanced machine learning algorithms and improved models for detecting spam emails but there is still a gap in features. To achieve good results, features also play an important role. To evaluate the performance of applied classifiers, 10-fold cross-validation is used.

Findings

The results approve that the spam emails are correctly classified with the accuracy of 98.00% for the Support Vector Machine and 98.06% for the Artificial Neural Network as compared to other applied machine learning classifiers.

Originality/value

In this paper, Point-Biserial correlation is applied to each feature concerning the class label of the University of California Irvine (UCI) spambase email dataset to select the best features. Extensive experiments are conducted on selected features by training the different classifiers.

Details

Applied Computing and Informatics, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2634-1964

Keywords

Article
Publication date: 1 November 2019

Shrawan Kumar Trivedi and Shubhamoy Dey

Email is a rapid and cheapest medium of sharing information, whereas unsolicited email (spam) is constant trouble in the email communication. The rapid growth of the spam creates…

Abstract

Purpose

Email is a rapid and cheapest medium of sharing information, whereas unsolicited email (spam) is constant trouble in the email communication. The rapid growth of the spam creates a necessity to build a reliable and robust spam classifier. This paper aims to presents a study of evolutionary classifiers (genetic algorithm [GA] and genetic programming [GP]) without/with the help of an ensemble of classifiers method. In this research, the classifiers ensemble has been developed with adaptive boosting technique.

Design/methodology/approach

Text mining methods are applied for classifying spam emails and legitimate emails. Two data sets (Enron and SpamAssassin) are taken to test the concerned classifiers. Initially, pre-processing is performed to extract the features/words from email files. Informative feature subset is selected from greedy stepwise feature subset search method. With the help of informative features, a comparative study is performed initially within the evolutionary classifiers and then with other popular machine learning classifiers (Bayesian, naive Bayes and support vector machine).

Findings

This study reveals the fact that evolutionary algorithms are promising in classification and prediction applications where genetic programing with adaptive boosting is turned out not only an accurate classifier but also a sensitive classifier. Results show that initially GA performs better than GP but after an ensemble of classifiers (a large number of iterations), GP overshoots GA with significantly higher accuracy. Amongst all classifiers, boosted GP turns out to be not only good regarding classification accuracy but also low false positive (FP) rates, which is considered to be the important criteria in email spam classification. Also, greedy stepwise feature search is found to be an effective method for feature selection in this application domain.

Research limitations/implications

The research implication of this research consists of the reduction in cost incurred because of spam/unsolicited bulk email. Email is a fundamental necessity to share information within a number of units of the organizations to be competitive with the business rivals. In addition, it is continually a hurdle for internet service providers to provide the best emailing services to their customers. Although, the organizations and the internet service providers are continuously adopting novel spam filtering approaches to reduce the number of unwanted emails, the desired effect could not be significantly seen because of the cost of installation, customizable ability and the threat of misclassification of important emails. This research deals with all the issues and challenges faced by internet service providers and organizations.

Practical implications

In this research, the proposed models have not only provided excellent performance accuracy, sensitivity with low FP rate, customizable capability but also worked on reducing the cost of spam. The same models may be used for other applications of text mining also such as sentiment analysis, blog mining, news mining or other text mining research.

Originality/value

A comparison between GP and GAs has been shown with/without ensemble in spam classification application domain.

Article
Publication date: 25 October 2018

Shrawan Kumar Trivedi, Shubhamoy Dey and Anil Kumar

Sentiment analysis and opinion mining are emerging areas of research for analyzing Web data and capturing users’ sentiments. This research aims to present sentiment analysis of an…

Abstract

Purpose

Sentiment analysis and opinion mining are emerging areas of research for analyzing Web data and capturing users’ sentiments. This research aims to present sentiment analysis of an Indian movie review corpus using natural language processing and various machine learning classifiers.

Design/methodology/approach

In this paper, a comparative study between three machine learning classifiers (Bayesian, naïve Bayesian and support vector machine [SVM]) was performed. All the classifiers were trained on the words/features of the corpus extracted, using five different feature selection algorithms (Chi-square, info-gain, gain ratio, one-R and relief-F [RF] attributes), and a comparative study was performed between them. The classifiers and feature selection approaches were evaluated using different metrics (F-value, false-positive [FP] rate and training time).

Findings

The results of this study show that, for the maximum number of features, the RF feature selection approach was found to be the best, with better F-values, a low FP rate and less time needed to train the classifiers, whereas for the least number of features, one-R was better than RF. When the evaluation was performed for machine learning classifiers, SVM was found to be superior, although the Bayesian classifier was comparable with SVM.

Originality/value

This is a novel research where Indian review data were collected and then a classification model for sentiment polarity (positive/negative) was constructed.

Details

The Electronic Library, vol. 36 no. 4
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 18 October 2018

Kalyan Nagaraj, Biplab Bhattacharjee, Amulyashree Sridhar and Sharvani GS

Phishing is one of the major threats affecting businesses worldwide in current times. Organizations and customers face the hazards arising out of phishing attacks because of…

Abstract

Purpose

Phishing is one of the major threats affecting businesses worldwide in current times. Organizations and customers face the hazards arising out of phishing attacks because of anonymous access to vulnerable details. Such attacks often result in substantial financial losses. Thus, there is a need for effective intrusion detection techniques to identify and possibly nullify the effects of phishing. Classifying phishing and non-phishing web content is a critical task in information security protocols, and full-proof mechanisms have yet to be implemented in practice. The purpose of the current study is to present an ensemble machine learning model for classifying phishing websites.

Design/methodology/approach

A publicly available data set comprising 10,068 instances of phishing and legitimate websites was used to build the classifier model. Feature extraction was performed by deploying a group of methods, and relevant features extracted were used for building the model. A twofold ensemble learner was developed by integrating results from random forest (RF) classifier, fed into a feedforward neural network (NN). Performance of the ensemble classifier was validated using k-fold cross-validation. The twofold ensemble learner was implemented as a user-friendly, interactive decision support system for classifying websites as phishing or legitimate ones.

Findings

Experimental simulations were performed to access and compare the performance of the ensemble classifiers. The statistical tests estimated that RF_NN model gave superior performance with an accuracy of 93.41 per cent and minimal mean squared error of 0.000026.

Research limitations/implications

The research data set used in this study is publically available and easy to analyze. Comparative analysis with other real-time data sets of recent origin must be performed to ensure generalization of the model against various security breaches. Different variants of phishing threats must be detected rather than focusing particularly toward phishing website detection.

Originality/value

The twofold ensemble model is not applied for classification of phishing websites in any previous studies as per the knowledge of authors.

Details

Journal of Systems and Information Technology, vol. 20 no. 3
Type: Research Article
ISSN: 1328-7265

Keywords

Article
Publication date: 29 September 2020

Hari Hara Krishna Kumar Viswanathan, Punniyamoorthy Murugesan, Sundar Rengasamy and Lavanya Vilvanathan

The purpose of this study is to compare the classification learning ability of our algorithm based on boosted support vector machine (B-SVM), against other classification…

Abstract

Purpose

The purpose of this study is to compare the classification learning ability of our algorithm based on boosted support vector machine (B-SVM), against other classification techniques in predicting the credit ratings of banks. The key feature of this study is the usage of an imbalanced dataset (in the response variable/rating) with a smaller number of observations (number of banks).

Design/methodology/approach

In general, datasets in banking sector are small and imbalanced too. In this study, 23 Scheduled Commercial Banks (SCBs) have been chosen (in India), and their corresponding corporate ratings have been collated from the Indian subsidiary of reputed global rating agency. The top management of the rating agency provided 12 input (quantitative) variables that are considered essential for rating a bank within India. In order to overcome the challenge of dataset being imbalanced and having small number of observations, this study uses an algorithm, namely “Modified Boosted Support Vector Machines” (MBSVMs) proposed by Punniyamoorthy Murugesan and Sundar Rengasamy. This study also compares the classification ability of the aforementioned algorithm against other classification techniques such as multi-class SVM, back propagation neural networks, multi-class linear discriminant analysis (LDA) and k-nearest neighbors (k-NN) classification, on the basis of geometric mean (GM).

Findings

The performances of each algorithm have been compared based on one metric—the geometric mean, also known as GMean (GM). This metric typically indicates the class-wise sensitivity by using the values of products. The findings of the study prove that the proposed MBSVM technique outperforms the other techniques.

Research limitations/implications

This study provides an algorithm to predict ratings of banks where the dataset is small and imbalanced. One of the limitations of this research study is that subjective factors have not been included in our model; the sole focus is on the results generated by the models (driven by quantitative parameters). In future, studies may be conducted which may include subjective parameters (proxied by relevant and quantifiable variables).

Practical implications

Various stakeholders such as investors, regulators and central banks can predict the credit ratings of banks by themselves, by inputting appropriate data to the model.

Originality/value

In the process of rating banks, the usage of an imbalanced dataset can lessen the performance of the soft-computing techniques. In order to overcome this, the authors have come up with a novel classification approach based on “MBSVMs”, which can be used as a yardstick for such imbalanced datasets. For this purpose, through primary research, 12 features have been identified that are considered essential by the credit rating agencies.

Details

Benchmarking: An International Journal, vol. 28 no. 1
Type: Research Article
ISSN: 1463-5771

Keywords

Article
Publication date: 1 May 2007

Fuchun Peng and Xiangji Huang

The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word…

Abstract

Purpose

The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task.

Design/methodology/approach

Naïve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation‐based approach was compared with the non‐segmentation‐based approach.

Findings

There were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy.

Practical implications

Apply the findings to real web text classification is ongoing work.

Originality/value

The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.

Details

Journal of Documentation, vol. 63 no. 3
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 8 June 2021

Jyoti Godara, Rajni Aron and Mohammad Shabaz

Sentiment analysis has observed a nascent interest over the past decade in the field of social media analytics. With major advances in the volume, rationality and veracity of…

Abstract

Purpose

Sentiment analysis has observed a nascent interest over the past decade in the field of social media analytics. With major advances in the volume, rationality and veracity of social networking data, the misunderstanding, uncertainty and inaccuracy within the data have multiplied. In the textual data, the location of sarcasm is a challenging task. It is a different way of expressing sentiments, in which people write or says something different than what they actually intended to. So, the researchers are showing interest to develop various techniques for the detection of sarcasm in the texts to boost the performance of sentiment analysis. This paper aims to overview the sentiment analysis, sarcasm and related work for sarcasm detection. Further, this paper provides training to health-care professionals to make the decision on the patient’s sentiments.

Design/methodology/approach

This paper has compared the performance of five different classifierssupport vector machine, naïve Bayes classifier, decision tree classifier, AdaBoost classifier and K-nearest neighbour on the Twitter data set.

Findings

This paper has observed that naïve Bayes has performed the best having the highest accuracy of 61.18%, and decision tree performed the worst with an accuracy of 54.27%. Accuracy of AdaBoost, K-nearest neighbour and support vector machine measured were 56.13%, 54.81% and 59.55%, respectively.

Originality/value

This research work is original.

Details

World Journal of Engineering, vol. 19 no. 1
Type: Research Article
ISSN: 1708-5284

Keywords

Article
Publication date: 29 December 2023

Thanh-Nghi Do and Minh-Thu Tran-Nguyen

This study aims to propose novel edge device-tailored federated learning algorithms of local classifiers (stochastic gradient descent, support vector machines), namely, FL-lSGD…

Abstract

Purpose

This study aims to propose novel edge device-tailored federated learning algorithms of local classifiers (stochastic gradient descent, support vector machines), namely, FL-lSGD and FL-lSVM. These algorithms are designed to address the challenge of large-scale ImageNet classification.

Design/methodology/approach

The authors’ FL-lSGD and FL-lSVM trains in a parallel and incremental manner to build an ensemble local classifier on Raspberry Pis without requiring data exchange. The algorithms load small data blocks of the local training subset stored on the Raspberry Pi sequentially to train the local classifiers. The data block is split into k partitions using the k-means algorithm, and models are trained in parallel on each data partition to enable local data classification.

Findings

Empirical test results on the ImageNet data set show that the authors’ FL-lSGD and FL-lSVM algorithms with 4 Raspberry Pis (Quad core Cortex-A72, ARM v8, 64-bit SoC @ 1.5GHz, 4GB RAM) are faster than the state-of-the-art LIBLINEAR algorithm run on a PC (Intel(R) Core i7-4790 CPU, 3.6 GHz, 4 cores, 32GB RAM).

Originality/value

Efficiently addressing the challenge of large-scale ImageNet classification, the authors’ novel federated learning algorithms of local classifiers have been tailored to work on the Raspberry Pi. These algorithms can handle 1,281,167 images and 1,000 classes effectively.

Details

International Journal of Web Information Systems, vol. 20 no. 1
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 14 December 2021

Arijit Maji and Indrajit Mukherjee

The purpose of this study is to propose an effective unsupervised one-class-classifier (OCC) support vector machine (SVM)-based single multivariate control chart (OCC-SVM) to…

Abstract

Purpose

The purpose of this study is to propose an effective unsupervised one-class-classifier (OCC) support vector machine (SVM)-based single multivariate control chart (OCC-SVM) to simultaneously monitor “location” and “scale” shifts of a manufacturing process.

Design/methodology/approach

The step-by-step approach to developing, implementing and fine-tuning the intrinsic parameters of the OCC-SVM chart is demonstrated based on simulation and two real-life case examples.

Findings

A comparative study, considering varied known and unknown response distributions, indicates that the OCC-SVM is highly effective in detecting process shifts of samples with individual observations. OCC-SVM chart also shows promising results for samples with a rational subgroup of observations. In addition, the results also indicate that the performance of OCC-SVM is unaffected by the small reference sample size.

Research limitations/implications

The sample responses are considered identically distributed with no significant multivariate autocorrelation between sample observations.

Practical implications

The proposed easy-to-implement chart shows satisfactory performance to detect an out-of-control signal with known or unknown response distributions.

Originality/value

Various multivariate (e.g. parametric or nonparametric) control chart(s) are recommended to monitor the mean (e.g. location) and variance (e.g. scale) of multiple correlated responses in a manufacturing process. However, real-life implementation of a parametric control chart may be complex due to its restrictive response distribution assumptions. There is no evidence of work in the open literature that demonstrates the suitability of an unsupervised OCC-SVM chart to simultaneously monitor “location” and “scale” shifts of multivariate responses. Thus, a new efficient OCC-SVM single chart approach is proposed to address this gap to monitor a multivariate manufacturing process with unknown response distributions.

Details

International Journal of Quality & Reliability Management, vol. 40 no. 2
Type: Research Article
ISSN: 0265-671X

Keywords

Article
Publication date: 28 November 2018

Ning Zhang, Ruru Pan, Lei Wang, Shanshan Wang, Jun Xiang and Weidong Gao

The purpose of this paper is to propose a novel method using support vector machine (SVM) classifiers for objective seam pucker evaluation. Features are extracted using wavelet…

Abstract

Purpose

The purpose of this paper is to propose a novel method using support vector machine (SVM) classifiers for objective seam pucker evaluation. Features are extracted using wavelet analysis and gray-level co-occurrence matrix (GLCM), and the samples are evaluated using SVM classifiers. The study aims to solve the problem of inappropriate parameters and large required samples in objective seam pucker evaluation.

Design/methodology/approach

Initially, seam pucker image was captured, and Edge detection and Hough transform were utilized to normalize the seam position and orientation. After cropping the image, the intensity was adjusted to the same identical level through histogram specification. Then, the standard deviations of the horizontal image and diagonal image, reconstructed using wavelet decomposition and reconstruction, were calculated based on parameter optimization. Meanwhile, GLCM was extracted from the restructured horizontal detail image, then the contrast and correlation of GLCM were calculated. Finally, these four features were imported to SVM classifiers based on genetic algorithm for evaluation.

Findings

The four extracted features reflected linear relationships among five grades. The experimental results showed that the classification accuracy was 96 percent, which catches up to the performance of human vision, and resolves ambiguity and subjective of the manual evaluation.

Originality/value

There are large required samples in current research. This paper provides a novel method using finite samples, and the parameters of the methods were discussed for parameter optimization. The evaluation results can provide references for analyzing the reason of wrinkles during garment manufacturing.

Details

International Journal of Clothing Science and Technology, vol. 31 no. 1
Type: Research Article
ISSN: 0955-6222

Keywords

1 – 10 of over 1000