Search results

1 – 10 of 15
Article
Publication date: 15 March 2021

Putta Hemalatha and Geetha Mary Amalanathan

Adequate resources for learning and training the data are an important constraint to develop an efficient classifier with outstanding performance. The data usually follows a…

Abstract

Purpose

Adequate resources for learning and training the data are an important constraint to develop an efficient classifier with outstanding performance. The data usually follows a biased distribution of classes that reflects an unequal distribution of classes within a dataset. This issue is known as the imbalance problem, which is one of the most common issues occurring in real-time applications. Learning of imbalanced datasets is a ubiquitous challenge in the field of data mining. Imbalanced data degrades the performance of the classifier by producing inaccurate results.

Design/methodology/approach

In the proposed work, a novel fuzzy-based Gaussian synthetic minority oversampling (FG-SMOTE) algorithm is proposed to process the imbalanced data. The mechanism of the Gaussian SMOTE technique is based on finding the nearest neighbour concept to balance the ratio between minority and majority class datasets. The ratio of the datasets belonging to the minority and majority class is balanced using a fuzzy-based Levenshtein distance measure technique.

Findings

The performance and the accuracy of the proposed algorithm is evaluated using the deep belief networks classifier and the results showed the efficiency of the fuzzy-based Gaussian SMOTE technique achieved an AUC: 93.7%. F1 Score Prediction: 94.2%, Geometric Mean Score: 93.6% predicted from confusion matrix.

Research limitations/implications

The proposed research still retains some of the challenges that need to be focused such as application FG-SMOTE to multiclass imbalanced dataset and to evaluate dataset imbalance problem in a distributed environment.

Originality/value

The proposed algorithm fundamentally solves the data imbalance issues and challenges involved in handling the imbalanced data. FG-SMOTE has aided in balancing minority and majority class datasets.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 14 no. 2
Type: Research Article
ISSN: 1756-378X

Keywords

Book part
Publication date: 18 July 2022

Yakub Kayode Saheed, Usman Ahmad Baba and Mustafa Ayobami Raji

Purpose: This chapter aims to examine machine learning (ML) models for predicting credit card fraud (CCF).Need for the study: With the advance of technology, the world is…

Abstract

Purpose: This chapter aims to examine machine learning (ML) models for predicting credit card fraud (CCF).

Need for the study: With the advance of technology, the world is increasingly relying on credit cards rather than cash in daily life. This creates a slew of new opportunities for fraudulent individuals to abuse these cards. As of December 2020, global card losses reached $28.65billion, up 2.9% from $27.85 billion in 2018, according to the Nilson 2019 research. To safeguard the safety of credit card users, the credit card issuer should include a service that protects customers from potential risks. CCF has become a severe threat as internet buying has grown. To this goal, various studies in the field of automatic and real-time fraud detection are required. Due to their advantageous properties, the most recent ones employ a variety of ML algorithms and techniques to construct a well-fitting model to detect fraudulent transactions. When it comes to recognising credit card risk is huge and high-dimensional data, feature selection (FS) is critical for improving classification accuracy and fraud detection.

Methodology/design/approach: The objectives of this chapter are to construct a new model for credit card fraud detection (CCFD) based on principal component analysis (PCA) for FS and using supervised ML techniques such as K-nearest neighbour (KNN), ridge classifier, gradient boosting, quadratic discriminant analysis, AdaBoost, and random forest for classification of fraudulent and legitimate transactions. When compared to earlier experiments, the suggested approach demonstrates a high capacity for detecting fraudulent transactions. To be more precise, our model’s resilience is constructed by integrating the power of PCA for determining the most useful predictive features. The experimental analysis was performed on German credit card and Taiwan credit card data sets.

Findings: The experimental findings revealed that the KNN achieved an accuracy of 96.29%, recall of 100%, and precision of 96.29%, which is the best performing model on the German data set. While the ridge classifier was the best performing model on Taiwan Credit data with an accuracy of 81.75%, recall of 34.89, and precision of 66.61%.

Practical implications: The poor performance of the models on the Taiwan data revealed that it is an imbalanced credit card data set. The comparison of our proposed models with state-of-the-art credit card ML models showed that our results were competitive.

Article
Publication date: 22 October 2018

Sihem Khemakhem, Fatma Ben Said and Younes Boujelbene

Credit scoring datasets are generally unbalanced. The number of repaid loans is higher than that of defaulted ones. Therefore, the classification of these data is biased toward…

1045

Abstract

Purpose

Credit scoring datasets are generally unbalanced. The number of repaid loans is higher than that of defaulted ones. Therefore, the classification of these data is biased toward the majority class, which practically means that it tends to attribute a mistaken “good borrower” status even to “very risky borrowers”. In addition to the use of statistics and machine learning classifiers, this paper aims to explore the relevance and performance of sampling models combined with statistical prediction and artificial intelligence techniques to predict and quantify the default probability based on real-world credit data.

Design/methodology/approach

A real database from a Tunisian commercial bank was used and unbalanced data issues were addressed by the random over-sampling (ROS) and synthetic minority over-sampling technique (SMOTE). Performance was evaluated in terms of the confusion matrix and the receiver operating characteristic curve.

Findings

The results indicated that the combination of intelligent and statistical techniques and re-sampling approaches are promising for the default rate management and provide accurate credit risk estimates.

Originality/value

This paper empirically investigates the effectiveness of ROS and SMOTE in combination with logistic regression, artificial neural networks and support vector machines. The authors address the role of sampling strategies in the Tunisian credit market and its impact on credit risk. These sampling strategies may help financial institutions to reduce the erroneous classification costs in comparison with the unbalanced original data and may serve as a means for improving the bank’s performance and competitiveness.

Details

Journal of Modelling in Management, vol. 13 no. 4
Type: Research Article
ISSN: 1746-5664

Keywords

Article
Publication date: 5 April 2021

Seungpeel Lee, Honggeun Ji, Jina Kim and Eunil Park

With the rapid increase in internet use, most people tend to purchase books through online stores. Several such stores also provide book recommendations for buyer convenience, and…

1041

Abstract

Purpose

With the rapid increase in internet use, most people tend to purchase books through online stores. Several such stores also provide book recommendations for buyer convenience, and both collaborative and content-based filtering approaches have been widely used for building these recommendation systems. However, both approaches have significant limitations, including cold start and data sparsity. To overcome these limitations, this study aims to investigate whether user satisfaction can be predicted based on easily accessible book descriptions.

Design/methodology/approach

The authors collected a large-scale Kindle Books data set containing book descriptions and ratings, and calculated whether a specific book will receive a high rating. For this purpose, several feature representation methods (bag-of-words, term frequency–inverse document frequency [TF-IDF] and Word2vec) and machine learning classifiers (logistic regression, random forest, naive Bayes and support vector machine) were used.

Findings

The used classifiers show substantial accuracy in predicting reader satisfaction. Among them, the random forest classifier combined with the TF-IDF feature representation method exhibited the highest accuracy at 96.09%.

Originality/value

This study revealed that user satisfaction can be predicted based on book descriptions and shed light on the limitations of existing recommendation systems. Further, both practical and theoretical implications have been discussed.

Details

The Electronic Library , vol. 39 no. 1
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 3 July 2023

James L. Sullivan, David Novak, Eric Hernandez and Nick Van Den Berg

This paper introduces a novel quality measure, the percent-within-distribution, or PWD, for acceptance and payment in a quality control/quality assurance (QC/QA) performance…

Abstract

Purpose

This paper introduces a novel quality measure, the percent-within-distribution, or PWD, for acceptance and payment in a quality control/quality assurance (QC/QA) performance specification (PS).

Design/methodology/approach

The new quality measure takes any sample size or distribution and uses a Bayesian updating process to re-estimate parameters of a design distribution as sample observations are fed through the algorithm. This methodology can be employed in a wide range of applications, but the authors demonstrate the use of the measure for a QC/QA PS with upper and lower bounds on 28-day compressive strength of in-place concrete for bridge decks.

Findings

The authors demonstrate the use of this new quality measure to illustrate how it addresses the shortcomings of the percent-within-limits (PWL), which is the current industry standard quality measure. The authors then use the PWD to develop initial pay factors through simulation regimes. The PWD is shown to function better than the PWL with realistic sample lots simulated to represent a variety of industry responses to a new QC/QA PS.

Originality/value

The analytical contribution of this work is the introduction of the new quality measure. However, the practical and managerial contributions of this work are of equal significance.

Details

International Journal of Quality & Reliability Management, vol. 41 no. 2
Type: Research Article
ISSN: 0265-671X

Keywords

Article
Publication date: 11 November 2021

Sandeep Kumar Hegde and Monica R. Mundada

Chronic diseases are considered as one of the serious concerns and threats to public health across the globe. Diseases such as chronic diabetes mellitus (CDM), cardio…

Abstract

Purpose

Chronic diseases are considered as one of the serious concerns and threats to public health across the globe. Diseases such as chronic diabetes mellitus (CDM), cardio vasculardisease (CVD) and chronic kidney disease (CKD) are major chronic diseases responsible for millions of death. Each of these diseases is considered as a risk factor for the other two diseases. Therefore, noteworthy attention is being paid to reduce the risk of these diseases. A gigantic amount of medical data is generated in digital form from smart healthcare appliances in the current era. Although numerous machine learning (ML) algorithms are proposed for the early prediction of chronic diseases, these algorithmic models are neither generalized nor adaptive when the model is imposed on new disease datasets. Hence, these algorithms have to process a huge amount of disease data iteratively until the model converges. This limitation may make it difficult for ML models to fit and produce imprecise results. A single algorithm may not yield accurate results. Nonetheless, an ensemble of classifiers built from multiple models, that works based on a voting principle has been successfully applied to solve many classification tasks. The purpose of this paper is to make early prediction of chronic diseases using hybrid generative regression based deep intelligence network (HGRDIN) model.

Design/methodology/approach

In the proposed paper generative regression (GR) model is used in combination with deep neural network (DNN) for the early prediction of chronic disease. The GR model will obtain prior knowledge about the labelled data by analyzing the correlation between features and class labels. Hence, the weight assignment process of DNN is influenced by the relationship between attributes rather than random assignment. The knowledge obtained through these processes is passed as input to the DNN network for further prediction. Since the inference about the input data instances is drawn at the DNN through the GR model, the model is named as hybrid generative regression-based deep intelligence network (HGRDIN).

Findings

The credibility of the implemented approach is rigorously validated using various parameters such as accuracy, precision, recall, F score and area under the curve (AUC) score. During the training phase, the proposed algorithm is constantly regularized using the elastic net regularization technique and also hyper-tuned using the various parameters such as momentum and learning rate to minimize the misprediction rate. The experimental results illustrate that the proposed approach predicted the chronic disease with a minimal error by avoiding the possible overfitting and local minima problems. The result obtained with the proposed approach is also compared with the various traditional approaches.

Research limitations/implications

Usually, the diagnostic data are multi-dimension in nature where the performance of the ML algorithm will degrade due to the data overfitting, curse of dimensionality issues. The result obtained through the experiment has achieved an average accuracy of 95%. Hence, analysis can be made further to improve predictive accuracy by overcoming the curse of dimensionality issues.

Practical implications

The proposed ML model can mimic the behavior of the doctor's brain. These algorithms have the capability to replace clinical tasks. The accurate result obtained through the innovative algorithms can free the physician from the mundane care and practices so that the physician can focus more on the complex issues.

Social implications

Utilizing the proposed predictive model at the decision-making level for the early prediction of the disease is considered as a promising change towards the healthcare sector. The global burden of chronic disease can be reduced at an exceptional level through these approaches.

Originality/value

In the proposed HGRDIN model, the concept of transfer learning approach is used where the knowledge acquired through the GR process is applied on DNN that identified the possible relationship between the dependent and independent feature variables by mapping the chronic data instances to its corresponding target class before it is being passed as input to the DNN network. Hence, the result of the experiments illustrated that the proposed approach obtained superior performance in terms of various validation parameters than the existing conventional techniques.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 15 no. 1
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 24 July 2020

Angelica Lo Duca and Andrea Marchetti

Ship route prediction (SRP) is a quite complicated task, which enables the determination of the next position of a ship after a given period of time, given its current position…

Abstract

Purpose

Ship route prediction (SRP) is a quite complicated task, which enables the determination of the next position of a ship after a given period of time, given its current position. This paper aims to describe a study, which compares five families of multiclass classification algorithms to perform SRP.

Design/methodology/approach

Tested algorithm families include: Naive Bayes (NB), nearest neighbors, decision trees, linear algorithms and extension from binary. A common structure for all the algorithm families was implemented and adapted to the specific case, according to the test to be done. The tests were done on one month of real data extracted from automatic identification system messages, collected around the island of Malta.

Findings

Experiments show that K-nearest neighbors and decision trees algorithms outperform all the other algorithms. Experiments also demonstrate that linear algorithms and NB have a very poor performance.

Research limitations/implications

This study is limited to the area surrounding Malta. Thus, findings cannot be generalized to every context. However, the methodology presented is general and can help other researchers in this area to choose appropriate methods for their problems.

Practical implications

The results of this study can be exploited by applications for maritime surveillance to build decision support systems to monitor and predict ship routes in a given area. For example, to protect the marine environment, the use of SRP techniques could be used to protect areas at risk such as marine protected areas, from illegal fishing.

Originality/value

The paper proposes a solid methodology to perform tests on SRP, based on a series of important machine learning algorithms for the prediction.

Details

Journal of Systems and Information Technology, vol. 22 no. 3
Type: Research Article
ISSN: 1328-7265

Keywords

Article
Publication date: 28 September 2021

Nageswara Rao Eluri, Gangadhara Rao Kancharla, Suresh Dara and Venkatesulu Dondeti

Gene selection is considered as the fundamental process in the bioinformatics field. The existing methodologies pertain to cancer classification are mostly clinical basis, and its…

Abstract

Purpose

Gene selection is considered as the fundamental process in the bioinformatics field. The existing methodologies pertain to cancer classification are mostly clinical basis, and its diagnosis capability is limited. Nowadays, the significant problems of cancer diagnosis are solved by the utilization of gene expression data. The researchers have been introducing many possibilities to diagnose cancer appropriately and effectively. This paper aims to develop the cancer data classification using gene expression data.

Design/methodology/approach

The proposed classification model involves three main phases: “(1) Feature extraction, (2) Optimal Feature Selection and (3) Classification”. Initially, five benchmark gene expression datasets are collected. From the collected gene expression data, the feature extraction is performed. To diminish the length of the feature vectors, optimal feature selection is performed, for which a new meta-heuristic algorithm termed as quantum-inspired immune clone optimization algorithm (QICO) is used. Once the relevant features are selected, the classification is performed by a deep learning model called recurrent neural network (RNN). Finally, the experimental analysis reveals that the proposed QICO-based feature selection model outperforms the other heuristic-based feature selection and optimized RNN outperforms the other machine learning methods.

Findings

The proposed QICO-RNN is acquiring the best outcomes at any learning percentage. On considering the learning percentage 85, the accuracy of the proposed QICO-RNN was 3.2% excellent than RNN, 4.3% excellent than RF, 3.8% excellent than NB and 2.1% excellent than KNN for Dataset 1. For Dataset 2, at learning percentage 35, the accuracy of the proposed QICO-RNN was 13.3% exclusive than RNN, 8.9% exclusive than RF and 14.8% exclusive than NB and KNN. Hence, the developed QICO algorithm is performing well in classifying the cancer data using gene expression data accurately.

Originality/value

This paper introduces a new optimal feature selection model using QICO and QICO-based RNN for effective classification of cancer data using gene expression data. This is the first work that utilizes an optimal feature selection model using QICO and QICO-RNN for effective classification of cancer data using gene expression data.

Article
Publication date: 1 February 2022

Yaotan Xie and Fei Xiang

This study aimed to adapt existing text-mining techniques and propose a novel topic recognition approach for textual patient reviews.

Abstract

Purpose

This study aimed to adapt existing text-mining techniques and propose a novel topic recognition approach for textual patient reviews.

Design/methodology/approach

The authors first transformed multilabel samples for adapting model training forms. Then, an improved method was proposed based on dynamic mixed sampling and transfer learning to improve the learning problem caused by imbalanced samples. Specifically, the training of our model was based on the framework of a convolutional neural network and self-trained Word2Vector on large-scale corpora.

Findings

Compared with the SVM and other CNN-based models, the CNN+ DMS + TL model proposed in this study has made significant improvement in F1 score.

Originality/value

The improved methods based on dynamic mixed sampling and transfer learning can adequately manage the learning problem caused by the skewed distribution of samples and achieve the effective and automatic topic recognition of textual patient reviews.

Peer review

The peer-review history for this article is available at: https://publons.com/publon/10.1108/OIR-01-2021-0059.

Details

Online Information Review, vol. 46 no. 6
Type: Research Article
ISSN: 1468-4527

Keywords

Article
Publication date: 1 March 2022

Larissa Arakawa Martins, Veronica Soebarto, Terence Williamson and Dino Pisaniello

This paper presents the development of personal thermal comfort models for older adults and assesses the models’ performance compared to aggregate approaches. This is necessary as…

297

Abstract

Purpose

This paper presents the development of personal thermal comfort models for older adults and assesses the models’ performance compared to aggregate approaches. This is necessary as individual thermal preferences can vary widely between older adults, and the use of aggregate thermal comfort models can result in thermal dissatisfaction for a significant number of older occupants. Personalised thermal comfort models hold the promise of a more targeted and accurate approach.

Design/methodology/approach

Twenty-eight personal comfort models have been developed, using deep learning and environmental and personal parameters. The data were collected through a nine-month monitoring study of people aged 65 and over in South Australia, who lived independently. Modelling comprised dataset balancing and normalisation, followed by model tuning to test and select the best hyperparameters’ sets. Finally, models were evaluated with an unseen dataset. Accuracy, Cohen’s Kappa Coefficient and Area Under the Receiver Operating Characteristic Curve (AUC) were used to measure models’ performance.

Findings

On average, the individualised models present an accuracy of 74%, a Cohen’s Kappa Coefficient of 0.61 and an AUC of 0.83, representing a significant improvement in predictive performance when compared to similar studies and the “Converted” Predicted Mean Vote (PMVc) model.

Originality/value

While current literature on personal comfort models have focussed solely on younger adults and offices, this study explored a methodology for older people and their dwellings. Additionally, it introduced health perception as a predictor of thermal preference – a variable often overseen by architectural sciences and building engineering. The study also provided insights on the use of deep learning for future studies.

Details

Smart and Sustainable Built Environment, vol. 11 no. 2
Type: Research Article
ISSN: 2046-6099

Keywords

1 – 10 of 15