Search results
1 – 10 of 720
Shrawan Kumar Trivedi and Shubhamoy Dey
The email is an important medium for sharing information rapidly. However, spam, being a nuisance in such communication, motivates the building of a robust filtering system with…
Abstract
Purpose
The email is an important medium for sharing information rapidly. However, spam, being a nuisance in such communication, motivates the building of a robust filtering system with high classification accuracy and good sensitivity towards false positives. In that context, this paper aims to present a combined classifier technique using a committee selection mechanism where the main objective is to identify a set of classifiers so that their individual decisions can be combined by a committee selection procedure for accurate detection of spam.
Design/methodology/approach
For training and testing of the relevant machine learning classifiers, text mining approaches are used in this research. Three data sets (Enron, SpamAssassin and LingSpam) have been used to test the classifiers. Initially, pre-processing is performed to extract the features associated with the email files. In the next step, the extracted features are taken through a dimensionality reduction method where non-informative features are removed. Subsequently, an informative feature subset is selected using genetic feature search. Thereafter, the proposed classifiers are tested on those informative features and the results compared with those of other classifiers.
Findings
For building the proposed combined classifier, three different studies have been performed. The first study identifies the effect of boosting algorithms on two probabilistic classifiers: Bayesian and Naïve Bayes. In that study, AdaBoost has been found to be the best algorithm for performance boosting. The second study was on the effect of different Kernel functions on support vector machine (SVM) classifier, where SVM with normalized polynomial (NP) kernel was observed to be the best. The last study was on combining classifiers with committee selection where the committee members were the best classifiers identified by the first study i.e. Bayesian and Naïve bays with AdaBoost, and the committee president was selected from the second study i.e. SVM with NP kernel. Results show that combining of the identified classifiers to form a committee machine gives excellent performance accuracy with a low false positive rate.
Research limitations/implications
This research is focused on the classification of email spams written in English language. Only body (text) parts of the emails have been used. Image spam has not been included in this work. We have restricted our work to only emails messages. None of the other types of messages like short message service or multi-media messaging service were a part of this study.
Practical implications
This research proposes a method of dealing with the issues and challenges faced by internet service providers and organizations that use email. The proposed model provides not only better classification accuracy but also a low false positive rate.
Originality/value
The proposed combined classifier is a novel classifier designed for accurate classification of email spam.
Details
Keywords
Hari Hara Krishna Kumar Viswanathan, Punniyamoorthy Murugesan, Sundar Rengasamy and Lavanya Vilvanathan
The purpose of this study is to compare the classification learning ability of our algorithm based on boosted support vector machine (B-SVM), against other classification…
Abstract
Purpose
The purpose of this study is to compare the classification learning ability of our algorithm based on boosted support vector machine (B-SVM), against other classification techniques in predicting the credit ratings of banks. The key feature of this study is the usage of an imbalanced dataset (in the response variable/rating) with a smaller number of observations (number of banks).
Design/methodology/approach
In general, datasets in banking sector are small and imbalanced too. In this study, 23 Scheduled Commercial Banks (SCBs) have been chosen (in India), and their corresponding corporate ratings have been collated from the Indian subsidiary of reputed global rating agency. The top management of the rating agency provided 12 input (quantitative) variables that are considered essential for rating a bank within India. In order to overcome the challenge of dataset being imbalanced and having small number of observations, this study uses an algorithm, namely “Modified Boosted Support Vector Machines” (MBSVMs) proposed by Punniyamoorthy Murugesan and Sundar Rengasamy. This study also compares the classification ability of the aforementioned algorithm against other classification techniques such as multi-class SVM, back propagation neural networks, multi-class linear discriminant analysis (LDA) and k-nearest neighbors (k-NN) classification, on the basis of geometric mean (GM).
Findings
The performances of each algorithm have been compared based on one metric—the geometric mean, also known as GMean (GM). This metric typically indicates the class-wise sensitivity by using the values of products. The findings of the study prove that the proposed MBSVM technique outperforms the other techniques.
Research limitations/implications
This study provides an algorithm to predict ratings of banks where the dataset is small and imbalanced. One of the limitations of this research study is that subjective factors have not been included in our model; the sole focus is on the results generated by the models (driven by quantitative parameters). In future, studies may be conducted which may include subjective parameters (proxied by relevant and quantifiable variables).
Practical implications
Various stakeholders such as investors, regulators and central banks can predict the credit ratings of banks by themselves, by inputting appropriate data to the model.
Originality/value
In the process of rating banks, the usage of an imbalanced dataset can lessen the performance of the soft-computing techniques. In order to overcome this, the authors have come up with a novel classification approach based on “MBSVMs”, which can be used as a yardstick for such imbalanced datasets. For this purpose, through primary research, 12 features have been identified that are considered essential by the credit rating agencies.
Details
Keywords
Syntax-based text classification (TC) mechanisms have been overtly replaced by semantic-based systems in recent years. Semantic-based TC systems are particularly useful in those…
Abstract
Purpose
Syntax-based text classification (TC) mechanisms have been overtly replaced by semantic-based systems in recent years. Semantic-based TC systems are particularly useful in those scenarios where similarity among documents is computed considering semantic relationships among their terms. Kernel functions have received major attention because of the unprecedented popularity of SVMs in the field of TC. Most of the kernel functions exploit syntactic structures of the text, but quite a few also use a priori semantic information for knowledge extraction. The purpose of this paper is to investigate semantic kernel functions in the context of TC.
Design/methodology/approach
This work presents performance and accuracy analysis of seven semantic kernel functions (Semantic Smoothing Kernel, Latent Semantic Kernel, Semantic WordNet-based Kernel, Semantic Smoothing Kernel having Implicit Superconcept Expansions, Compactness-based Disambiguation Kernel Function, Omiotis-based S-VSM semantic kernel function and Top-k S-VSM semantic kernel) being implemented with SVM as kernel method. All seven semantic kernels are implemented in SVM-Light tool.
Findings
Performance and accuracy parameters of seven semantic kernel functions have been evaluated and compared. The experimental results show that Top-k S-VSM semantic kernel has the highest performance and accuracy among all the evaluated kernel functions which make it a preferred building block for kernel methods for TC and retrieval.
Research limitations/implications
A combination of semantic kernel function with syntactic kernel function needs to be investigated as there is a scope of further improvement in terms of accuracy and performance in all the seven semantic kernel functions.
Practical implications
This research provides an insight into TC using a priori semantic knowledge. Three commonly used data sets are being exploited. It will be quite interesting to explore these kernel functions on live web data which may test their actual utility in real business scenarios.
Originality/value
Comparison of performance and accuracy parameters is the novel point of this research paper. To the best of the authors’ knowledge, this type of comparison has not been done previously.
Details
Keywords
The purpose of this paper is to present a new pattern recognition‐based algorithm to detect high‐impedance faults (HIFs), including only with broken conductor and arcs, in…
Abstract
Purpose
The purpose of this paper is to present a new pattern recognition‐based algorithm to detect high‐impedance faults (HIFs), including only with broken conductor and arcs, in distribution networks.
Design/methodology/approach
In the proposed method, using discrete wavelet transform, the time‐frequency‐based features of the current waveform are calculated. Then, to extract the best feature set of the generated time‐frequency features, principle components analysis (PCA) is applied and finally support vector machines (SVM) is used as a classifier to distinguish between the HIFs, including only with broken conductor and arcs, and other similar phenomena such as capacitor banks switching, no load transformer switching, load switching, insulator leakage current and harmonic loads.
Findings
The experimental results have shown that using SVM with PCA as the feature extraction method and radial basis function (RBF) as the kernel function has acceptable security and dependability performances in distinguishing HIFs, including only with broken conductor and arcs, from other similar phenomena and is superior to the Bayes and multi‐layer perceptron neural network classifiers.
Originality/value
Using new combination of time‐frequency‐based features with SVM provides a new algorithm to detect HIFs, including only with broken conductor and arcs, that has acceptable security and dependability.
Details
Keywords
Abdullah Alharbi, Wajdi Alhakami, Sami Bourouis, Fatma Najar and Nizar Bouguila
We propose in this paper a novel reliable detection method to recognize forged inpainting images. Detecting potential forgeries and authenticating the content of digital images is…
Abstract
We propose in this paper a novel reliable detection method to recognize forged inpainting images. Detecting potential forgeries and authenticating the content of digital images is extremely challenging and important for many applications. The proposed approach involves developing new probabilistic support vector machines (SVMs) kernels from a flexible generative statistical model named “bounded generalized Gaussian mixture model”. The developed learning framework has the advantage to combine properly the benefits of both discriminative and generative models and to include prior knowledge about the nature of data. It can effectively recognize if an image is a tampered one and also to identify both forged and authentic images. The obtained results confirmed that the developed framework has good performance under numerous inpainted images.
Details
Keywords
Intelligent prediction of node localization in wireless sensor networks (WSNs) is a major concern for researchers. The huge amount of data generated by modern sensor array systems…
Abstract
Purpose
Intelligent prediction of node localization in wireless sensor networks (WSNs) is a major concern for researchers. The huge amount of data generated by modern sensor array systems required computationally efficient calibration techniques. This paper aims to improve localization accuracy by identifying obstacles in the optimization process and network scenarios.
Design/methodology/approach
The proposed method is used to incorporate distance estimation between nodes and packet transmission hop counts. This estimation is used in the proposed support vector machine (SVM) to find the network path using a time difference of arrival (TDoA)-based SVM. However, if the data set is noisy, SVM is prone to poor optimization, which leads to overlapping of target classes and the pathways through TDoA. The enhanced gray wolf optimization (EGWO) technique is introduced to eliminate overlapping target classes in the SVM.
Findings
The performance and efficacy of the model using existing TDoA methodologies are analyzed. The simulation results show that the proposed TDoA-EGWO achieves a higher rate of detection efficiency of 98% and control overhead of 97.8% and a better packet delivery ratio than other traditional methods.
Originality/value
The proposed method is successful in detecting the unknown position of the sensor node with a detection rate greater than that of other methods.
Details
Keywords
Xiaoguang Tian, Robert Pavur, Henry Han and Lili Zhang
Studies on mining text and generating intelligence on human resource documents are rare. This research aims to use artificial intelligence and machine learning techniques to…
Abstract
Purpose
Studies on mining text and generating intelligence on human resource documents are rare. This research aims to use artificial intelligence and machine learning techniques to facilitate the employee selection process through latent semantic analysis (LSA), bidirectional encoder representations from transformers (BERT) and support vector machines (SVM). The research also compares the performance of different machine learning, text vectorization and sampling approaches on the human resource (HR) resume data.
Design/methodology/approach
LSA and BERT are used to discover and understand the hidden patterns from a textual resume dataset, and SVM is applied to build the screening model and improve performance.
Findings
Based on the results of this study, LSA and BERT are proved useful in retrieving critical topics, and SVM can optimize the prediction model performance with the help of cross-validation and variable selection strategies.
Research limitations/implications
The technique and its empirical conclusions provide a practical, theoretical basis and reference for HR research.
Practical implications
The novel methods proposed in the study can assist HR practitioners in designing and improving their existing recruitment process. The topic detection techniques used in the study provide HR practitioners insights to identify the skill set of a particular recruiting position.
Originality/value
To the best of the authors’ knowledge, this research is the first study that uses LSA, BERT, SVM and other machine learning models in human resource management and resume classification. Compared with the existing machine learning-based resume screening system, the proposed system can provide more interpretable insights for HR professionals to understand the recommendation results through the topics extracted from the resumes. The findings of this study can also help organizations to find a better and effective approach for resume screening and evaluation.
Details
Keywords
D. K. Malhotra, Kunal Malhotra and Rashmi Malhotra
Traditionally, loan officers use different credit scoring models to complement judgmental methods to classify consumer loan applications. This study explores the use of decision…
Abstract
Traditionally, loan officers use different credit scoring models to complement judgmental methods to classify consumer loan applications. This study explores the use of decision trees, AdaBoost, and support vector machines (SVMs) to identify potential bad loans. Our results show that AdaBoost does provide an improvement over simple decision trees as well as SVM models in predicting good credit clients and bad credit clients. To cross-validate our results, we use k-fold classification methodology.
Details
Keywords
Rabeb Faleh, Sami Gomri, Khalifa Aguir and Abdennaceur Kachouri
The purpose of this paper is to deal with the classification improvement of pollutant using WO3 gases sensors. To evaluate the discrimination capacity, some experiments were…
Abstract
Purpose
The purpose of this paper is to deal with the classification improvement of pollutant using WO3 gases sensors. To evaluate the discrimination capacity, some experiments were achieved using three gases: ozone, ethanol, acetone and a mixture of ozone and ethanol via four WO3 sensors.
Design/methodology/approach
To improve the classification accuracy and enhance selectivity, some combined features that were configured through the principal component analysis were used. First, evaluate the discrimination capacity; some experiments were performed using three gases: ozone, ethanol, acetone and a mixture of ozone and ethanol, via four WO3 sensors. To this end, three features that are derivate, integral and the time corresponding to the peak derivate have been extracted from each transient sensor response according to four WO3 gas sensors used. Then these extracted parameters were used in a combined array.
Findings
The results show that the proposed feature extraction method could extract robust information. The Extreme Learning Machine (ELM) was used to identify the studied gases. In addition, ELM was compared with the Support Vector Machine (SVM). The experimental results prove the superiority of the combined features method in our E-nose application, as this method achieves the highest classification rate of 90% using the ELM and 93.03% using the SVM based on Radial Basis Kernel Function SVM-RBF.
Originality/value
Combined features have been configured from transient response to improve the classification accuracy. The achieved results show that the proposed feature extraction method could extract robust information. The ELM and SVM were used to identify the studied gases.
Details