To read this content please select one of the options below:

Exploring the effectiveness of word embedding based deep learning model for improving email classification

Deepak Suresh Asudani (Department of Computer Science and Engineering, National Institute of Technology Raipur, Raipur, India)
Naresh Kumar Nagwani (Department of Computer Science and Engineering, National Institute of Technology Raipur, Raipur, India)
Pradeep Singh (Department of Computer Science and Engineering, National Institute of Technology Raipur, Raipur, India)

Data Technologies and Applications

ISSN: 2514-9288

Article publication date: 2 February 2022

Issue publication date: 23 August 2022

370

Abstract

Purpose

Classifying emails as ham or spam based on their content is essential. Determining the semantic and syntactic meaning of words and putting them in a high-dimensional feature vector form for processing is the most difficult challenge in email categorization. The purpose of this paper is to examine the effectiveness of the pre-trained embedding model for the classification of emails using deep learning classifiers such as the long short-term memory (LSTM) model and convolutional neural network (CNN) model.

Design/methodology/approach

In this paper, global vectors (GloVe) and Bidirectional Encoder Representations Transformers (BERT) pre-trained word embedding are used to identify relationships between words, which helps to classify emails into their relevant categories using machine learning and deep learning models. Two benchmark datasets, SpamAssassin and Enron, are used in the experimentation.

Findings

In the first set of experiments, machine learning classifiers, the support vector machine (SVM) model, perform better than other machine learning methodologies. The second set of experiments compares the deep learning model performance without embedding, GloVe and BERT embedding. The experiments show that GloVe embedding can be helpful for faster execution with better performance on large-sized datasets.

Originality/value

The experiment reveals that the CNN model with GloVe embedding gives slightly better accuracy than the model with BERT embedding and traditional machine learning algorithms to classify an email as ham or spam. It is concluded that the word embedding models improve email classifiers accuracy.

Keywords

Acknowledgements

The authors thank the editor and the anonymous reviewers for their insightful and valuable comments and suggestions. The authors gratefully acknowledge the National Institute of Technology, Raipur, for providing the GPU server used for this research.

Citation

Asudani, D.S., Nagwani, N.K. and Singh, P. (2022), "Exploring the effectiveness of word embedding based deep learning model for improving email classification", Data Technologies and Applications, Vol. 56 No. 4, pp. 483-505. https://doi.org/10.1108/DTA-07-2021-0191

Publisher

:

Emerald Publishing Limited

Copyright © 2022, Emerald Publishing Limited

Related articles