Search results
1 – 10 of 604Deepti Sisodia and Dilip Singh Sisodia
The problem of choosing the utmost useful features from hundreds of features from time-series user click data arises in online advertising toward fraudulent publisher's…
Abstract
Purpose
The problem of choosing the utmost useful features from hundreds of features from time-series user click data arises in online advertising toward fraudulent publisher's classification. Selecting feature subsets is a key issue in such classification tasks. Practically, the use of filter approaches is common; however, they neglect the correlations amid features. Conversely, wrapper approaches could not be applied due to their complexities. Moreover, in particular, existing feature selection methods could not handle such data, which is one of the major causes of instability of feature selection.
Design/methodology/approach
To overcome such issues, a majority voting-based hybrid feature selection method, namely feature distillation and accumulated selection (FDAS), is proposed to investigate the optimal subset of relevant features for analyzing the publisher's fraudulent conduct. FDAS works in two phases: (1) feature distillation, where significant features from standard filter and wrapper feature selection methods are obtained using majority voting; (2) accumulated selection, where we enumerated an accumulated evaluation of relevant feature subset to search for an optimal feature subset using effective machine learning (ML) models.
Findings
Empirical results prove enhanced classification performance with proposed features in average precision, recall, f1-score and AUC in publisher identification and classification.
Originality/value
The FDAS is evaluated on FDMA2012 user-click data and nine other benchmark datasets to gauge its generalizing characteristics, first, considering original features, second, with relevant feature subsets selected by feature selection (FS) methods, third, with optimal feature subset obtained by the proposed approach. ANOVA significance test is conducted to demonstrate significant differences between independent features.
Details
Keywords
Syed Haroon Abdul Gafoor and Padma Theagarajan
Conventional diagnostic techniques, on the other hand, may be prone to subjectivity since they depend on assessment of motions that are often subtle to individual eyes and hence…
Abstract
Purpose
Conventional diagnostic techniques, on the other hand, may be prone to subjectivity since they depend on assessment of motions that are often subtle to individual eyes and hence hard to classify, potentially resulting in misdiagnosis. Meanwhile, early nonmotor signs of Parkinson’s disease (PD) can be mild and may be due to variety of other conditions. As a result, these signs are usually ignored, making early PD diagnosis difficult. Machine learning approaches for PD classification and healthy controls or individuals with similar medical symptoms have been introduced to solve these problems and to enhance the diagnostic and assessment processes of PD (like, movement disorders or other Parkinsonian syndromes).
Design/methodology/approach
Medical observations and evaluation of medical symptoms, including characterization of a wide range of motor indications, are commonly used to diagnose PD. The quantity of the data being processed has grown in the last five years; feature selection has become a prerequisite before any classification. This study introduces a feature selection method based on the score-based artificial fish swarm algorithm (SAFSA) to overcome this issue.
Findings
This study adds to the accuracy of PD identification by reducing the amount of chosen vocal features while to use the most recent and largest publicly accessible database. Feature subset selection in PD detection techniques starts by eliminating features that are not relevant or redundant. According to a few objective functions, features subset chosen should provide the best performance.
Research limitations/implications
In many situations, this is an Nondeterministic Polynomial Time (NP-Hard) issue. This method enhances the PD detection rate by selecting the most essential features from the database. To begin, the data set's dimensionality is reduced using Singular Value Decomposition dimensionality technique. Next, Biogeography-Based Optimization (BBO) for feature selection; the weight value is a vital parameter for finding the best features in PD classification.
Originality/value
PD classification is done by using ensemble learning classification approaches such as hybrid classifier of fuzzy K-nearest neighbor, kernel support vector machines, fuzzy convolutional neural network and random forest. The suggested classifiers are trained using data from UCI ML repository, and their results are verified using leave-one-person-out cross validation. The measures employed to assess the classifier efficiency include accuracy, F-measure, Matthews correlation coefficient.
Details
Keywords
Chandra Sekhar Kolli and Uma Devi Tatavarthi
Fraud transaction detection has become a significant factor in the communication technologies and electronic commerce systems, as it affects the usage of electronic payment. Even…
Abstract
Purpose
Fraud transaction detection has become a significant factor in the communication technologies and electronic commerce systems, as it affects the usage of electronic payment. Even though, various fraud detection methods are developed, enhancing the performance of electronic payment by detecting the fraudsters results in a great challenge in the bank transaction.
Design/methodology/approach
This paper aims to design the fraud detection mechanism using the proposed Harris water optimization-based deep recurrent neural network (HWO-based deep RNN). The proposed fraud detection strategy includes three different phases, namely, pre-processing, feature selection and fraud detection. Initially, the input transactional data is subjected to the pre-processing phase, where the data is pre-processed using the Box-Cox transformation to remove the redundant and noise values from data. The pre-processed data is passed to the feature selection phase, where the essential and the suitable features are selected using the wrapper model. The selected feature makes the classifier to perform better detection performance. Finally, the selected features are fed to the detection phase, where the deep recurrent neural network classifier is used to achieve the fraud detection process such that the training process of the classifier is done by the proposed Harris water optimization algorithm, which is the integration of water wave optimization and Harris hawks optimization.
Findings
Moreover, the proposed HWO-based deep RNN obtained better performance in terms of the metrics, such as accuracy, sensitivity and specificity with the values of 0.9192, 0.7642 and 0.9943.
Originality/value
An effective fraud detection method named HWO-based deep RNN is designed to detect the frauds in the bank transaction. The optimal features selected using the wrapper model enable the classifier to find fraudulent activities more efficiently. However, the accurate detection result is evaluated through the optimization model based on the fitness measure such that the function with the minimal error value is declared as the best solution, as it yields better detection results.
Details
Keywords
Farshid Abdi, Kaveh Khalili-Damghani and Shaghayegh Abolmakarem
Customer insurance coverage sales plan problem, in which the loyal customers are recognized and offered some special plans, is an essential problem facing insurance companies. On…
Abstract
Purpose
Customer insurance coverage sales plan problem, in which the loyal customers are recognized and offered some special plans, is an essential problem facing insurance companies. On the other hand, the loyal customers who have enough potential to renew their insurance contracts at the end of the contract term should be persuaded to repurchase or renew their contracts. The aim of this paper is to propose a three-stage data-mining approach to recognize high-potential loyal insurance customers and to predict/plan special insurance coverage sales.
Design/methodology/approach
The first stage addresses data cleansing. In the second stage, several filter and wrapper methods are implemented to select proper features. In the third stage, K-nearest neighbor algorithm is used to cluster the customers. The approach aims to select a compact feature subset with the maximal prediction capability. The proposed approach can detect the customers who are more likely to buy a specific insurance coverage at the end of a contract term.
Findings
The proposed approach has been applied in a real case study of insurance company in Iran. On the basis of the findings, the proposed approach is capable of recognizing the customer clusters and planning a suitable insurance coverage sales plans for loyal customers with proper accuracy level. Therefore, the proposed approach can be useful for the insurance company which helps them to identify their potential clients. Consequently, insurance managers can consider appropriate marketing tactics and appropriate resource allocation of the insurance company to their high-potential loyal customers and prevent switching them to competitors.
Originality/value
Despite the importance of recognizing high-potential loyal insurance customers, little study has been done in this area. In this paper, data-mining techniques were developed for the prediction of special insurance coverage sales on the basis of customers’ characteristics. The method allows the insurance company to prioritize their customers and focus their attention on high-potential loyal customers. Using the outputs of the proposed approach, the insurance companies can offer the most productive/economic insurance coverage contracts to their customers. The approach proposed by this study be customized and may be used in other service companies.
Details
Keywords
Taehoon Ko, Je Hyuk Lee, Hyunchang Cho, Sungzoon Cho, Wounjoo Lee and Miji Lee
Quality management of products is an important part of manufacturing process. One way to manage and assure product quality is to use machine learning algorithms based on…
Abstract
Purpose
Quality management of products is an important part of manufacturing process. One way to manage and assure product quality is to use machine learning algorithms based on relationship among various process steps. The purpose of this paper is to integrate manufacturing, inspection and after-sales service data to make full use of machine learning algorithms for estimating the products’ quality in a supervised fashion. Proposed frameworks and methods are applied to actual data associated with heavy machinery engines.
Design/methodology/approach
By following Lenzerini’s formula, manufacturing, inspection and after-sales service data from various sources are integrated. The after-sales service data are used to label each engine as normal or abnormal. In this study, one-class classification algorithms are used due to class imbalance problem. To address multi-dimensionality of time series data, the symbolic aggregate approximation algorithm is used for data segmentation. Then, binary genetic algorithm-based wrapper approach is applied to segmented data to find the optimal feature subset.
Findings
By employing machine learning-based anomaly detection models, an anomaly score for each engine is calculated. Experimental results show that the proposed method can detect defective engines with a high probability before they are shipped.
Originality/value
Through data integration, the actual customer-perceived quality from after-sales service is linked to data from manufacturing and inspection process. In terms of business application, data integration and machine learning-based anomaly detection can help manufacturers establish quality management policies that reflect the actual customer-perceived quality by predicting defective engines.
Details
Keywords
G.L. Infant Cyril and J.P. Ananth
The bank is termed as an imperative part of the marketing economy. The failure or success of an institution relies on the ability of industries to compute the credit risk. The…
Abstract
Purpose
The bank is termed as an imperative part of the marketing economy. The failure or success of an institution relies on the ability of industries to compute the credit risk. The loan eligibility prediction model utilizes analysis method that adapts past and current information of credit user to make prediction. However, precise loan prediction with risk and assessment analysis is a major challenge in loan eligibility prediction.
Design/methodology/approach
This aim of the research technique is to present a new method, namely Social Border Collie Optimization (SBCO)-based deep neuro fuzzy network for loan eligibility prediction. In this method, box cox transformation is employed on input loan data to create the data apt for further processing. The transformed data utilize the wrapper-based feature selection to choose suitable features to boost the performance of loan eligibility calculation. Once the features are chosen, the naive Bayes (NB) is adapted for feature fusion. In NB training, the classifier builds probability index table with the help of input data features and groups values. Here, the testing of NB classifier is done using posterior probability ratio considering conditional probability of normalization constant with class evidence. Finally, the loan eligibility prediction is achieved by deep neuro fuzzy network, which is trained with designed SBCO. Here, the SBCO is devised by combining the social ski driver (SSD) algorithm and Border Collie Optimization (BCO) to produce the most precise result.
Findings
The analysis is achieved by accuracy, sensitivity and specificity parameter by. The designed method performs with the highest accuracy of 95%, sensitivity and specificity of 95.4 and 97.3%, when compared to the existing methods, such as fuzzy neural network (Fuzzy NN), multiple partial least squares regression model (Multi_PLS), instance-based entropy fuzzy support vector machine (IEFSVM), deep recurrent neural network (Deep RNN), whale social optimization algorithm-based deep RNN (WSOA-based Deep RNN).
Originality/value
This paper devises SBCO-based deep neuro fuzzy network for predicting loan eligibility. Here, the deep neuro fuzzy network is trained with proposed SBCO, which is devised by combining the SSD and BCO to produce most precise result for loan eligibility prediction.
Details
Keywords
Guan Yuan, Zhaohui Wang, Fanrong Meng, Qiuyan Yan and Shixiong Xia
Currently, ubiquitous smartphones embedded with various sensors provide a convenient way to collect raw sequence data. These data bridges the gap between human activity and…
Abstract
Purpose
Currently, ubiquitous smartphones embedded with various sensors provide a convenient way to collect raw sequence data. These data bridges the gap between human activity and multiple sensors. Human activity recognition has been widely used in quite a lot of aspects in our daily life, such as medical security, personal safety, living assistance and so on.
Design/methodology/approach
To provide an overview, the authors survey and summarize some important technologies and involved key issues of human activity recognition, including activity categorization, feature engineering as well as typical algorithms presented in recent years. In this paper, the authors first introduce the character of embedded sensors and dsiscuss their features, as well as survey some data labeling strategies to get ground truth label. Then, following the process of human activity recognition, the authors discuss the methods and techniques of raw data preprocessing and feature extraction, and summarize some popular algorithms used in model training and activity recognizing. Third, they introduce some interesting application scenarios of human activity recognition and provide some available data sets as ground truth data to validate proposed algorithms.
Findings
The authors summarize their viewpoints on human activity recognition, discuss the main challenges and point out some potential research directions.
Originality/value
It is hoped that this work will serve as the steppingstone for those interested in advancing human activity recognition.
Details
Keywords
Yong Gui and Lanxin Zhang
Influenced by the constantly changing manufacturing environment, no single dispatching rule (SDR) can consistently obtain better scheduling results than other rules for the…
Abstract
Purpose
Influenced by the constantly changing manufacturing environment, no single dispatching rule (SDR) can consistently obtain better scheduling results than other rules for the dynamic job-shop scheduling problem (DJSP). Although the dynamic SDR selection classifier (DSSC) mined by traditional data-mining-based scheduling method has shown some improvement in comparison to an SDR, the enhancement is not significant since the rule selected by DSSC is still an SDR.
Design/methodology/approach
This paper presents a novel data-mining-based scheduling method for the DJSP with machine failure aiming at minimizing the makespan. Firstly, a scheduling priority relation model (SPRM) is constructed to determine the appropriate priority relation between two operations based on the production system state and the difference between their priority values calculated using multiple SDRs. Subsequently, a training sample acquisition mechanism based on the optimal scheduling schemes is proposed to acquire training samples for the SPRM. Furthermore, feature selection and machine learning are conducted using the genetic algorithm and extreme learning machine to mine the SPRM.
Findings
Results from numerical experiments demonstrate that the SPRM, mined by the proposed method, not only achieves better scheduling results in most manufacturing environments but also maintains a higher level of stability in diverse manufacturing environments than an SDR and the DSSC.
Originality/value
This paper constructs a SPRM and mines it based on data mining technologies to obtain better results than an SDR and the DSSC in various manufacturing environments.
Details
Keywords
S B Kotsiantis and P E Pintelas
Machine Learning algorithms fed with data sets which include information such as attendance data, test scores and other student information can provide tutors with powerful tools…
Abstract
Machine Learning algorithms fed with data sets which include information such as attendance data, test scores and other student information can provide tutors with powerful tools for decision‐making. Until now, much of the research has been limited to the relation between single variables and student performance. Combining multiple variables as possible predictors of dropout has generally been overlooked. The aim of this work is to present a high level architecture and a case study for a prototype machine learning tool which can automatically recognize dropout‐prone students in university level distance learning classes. Tracking student progress is a time‐consuming job which can be handled automatically by such a tool. While the tutors will still have an essential role in monitoring and evaluating student progress, the tool can compile the data required for reasonable and efficient monitoring. What is more, the application of the tool is not restricted to predicting drop‐out prone students: it can be also used for the prediction of students’ marks, for the prediction of how many students will submit a written assignment, etc. It can also help tutors explore data and build models for prediction, forecasting and classification. Finally, the underlying architecture is independent of the data set and as such it can be used to develop other similar tools
Details
Keywords
Shrawan Kumar Trivedi, Amrinder Singh and Somesh Kumar Malhotra
There is a need to predict whether the consumers liked the stay in the hotel rooms or not, and to remove the aspects the customers did not like. Many customers leave a review…
Abstract
Purpose
There is a need to predict whether the consumers liked the stay in the hotel rooms or not, and to remove the aspects the customers did not like. Many customers leave a review after staying in the hotel. These reviews are mostly given on the website used to book the hotel. These reviews can be considered as a valuable data, which can be analyzed to provide better services in the hotels. The purpose of this study is to use machine learning techniques for analyzing the given data to determine different sentiment polarities of the consumers.
Design/methodology/approach
Reviews given by hotel customers on the Tripadvisor website, which were made available publicly by Kaggle. Out of 10,000 reviews in the data, a sample of 3,000 negative polarity reviews (customers with bad experiences) in the hotel and 3,000 positive polarity reviews (customers with good experiences) in the hotel is taken to prepare data set. The two-stage feature selection was applied, which first involved greedy selection method and then wrapper method to generate 37 most relevant features. An improved stacked decision tree (ISD) classifier) is built, which is further compared with state-of-the-art machine learning algorithms. All the tests are done using R-Studio.
Findings
The results showed that the new model was satisfactory overall with 80.77% accuracy after doing in-depth study with 50–50 split, 80.74% accuracy for 66–34 split and 80.25% accuracy for 80–20 split, when predicting the nature of the customers’ experience in the hotel, i.e. whether they are positive or negative.
Research limitations/implications
The implication of this research is to provide a showcase of how we can predict the polarity of potentially popular reviews. This helps the authors’ perspective to help the hotel industries to take corrective measures for the betterment of business and to promote useful positive reviews. This study also has some limitations like only English reviews are considered. This study was restricted to the data from trip-adviser website; however, a new data may be generated to test the credibility of the model. Only aspect-based sentiment classification is considered in this study.
Originality/value
Stacking machine learning techniques have been proposed. At first, state-of-the-art classifiers are tested on the given data, and then, three best performing classifiers (decision tree C5.0, random forest and support vector machine) are taken to build stack and to create ISD classifier.
Details