Search results

1 – 10 of 298
Book part
Publication date: 1 September 2021

Son Nguyen, Phyllis Schumacher, Alan Olinsky and John Quinn

We study the performances of various predictive models including decision trees, random forests, neural networks, and linear discriminant analysis on an imbalanced data set of…

Abstract

We study the performances of various predictive models including decision trees, random forests, neural networks, and linear discriminant analysis on an imbalanced data set of home loan applications. During the process, we propose our undersampling algorithm to cope with the issues created by the imbalance of the data. Our technique is shown to work competitively against popular resampling techniques such as random oversampling, undersampling, synthetic minority oversampling technique (SMOTE), and random oversampling examples (ROSE). We also investigate the relation between the true positive rate, true negative rate, and the imbalance of the data.

Article
Publication date: 4 December 2018

Zhongyi Hu, Raymond Chiong, Ilung Pranata, Yukun Bao and Yuqing Lin

Malicious web domain identification is of significant importance to the security protection of internet users. With online credibility and performance data, the purpose of this…

Abstract

Purpose

Malicious web domain identification is of significant importance to the security protection of internet users. With online credibility and performance data, the purpose of this paper to investigate the use of machine learning techniques for malicious web domain identification by considering the class imbalance issue (i.e. there are more benign web domains than malicious ones).

Design/methodology/approach

The authors propose an integrated resampling approach to handle class imbalance by combining the synthetic minority oversampling technique (SMOTE) and particle swarm optimisation (PSO), a population-based meta-heuristic algorithm. The authors use the SMOTE for oversampling and PSO for undersampling.

Findings

By applying eight well-known machine learning classifiers, the proposed integrated resampling approach is comprehensively examined using several imbalanced web domain data sets with different imbalance ratios. Compared to five other well-known resampling approaches, experimental results confirm that the proposed approach is highly effective.

Practical implications

This study not only inspires the practical use of online credibility and performance data for identifying malicious web domains but also provides an effective resampling approach for handling the class imbalance issue in the area of malicious web domain identification.

Originality/value

Online credibility and performance data are applied to build malicious web domain identification models using machine learning techniques. An integrated resampling approach is proposed to address the class imbalance issue. The performance of the proposed approach is confirmed based on real-world data sets with different imbalance ratios.

Article
Publication date: 29 September 2020

G. Sreeram, S. Pradeep, K. Sreenivasa Rao, B. Deevana Raju and Parveen Nikhat

The paper aims to precise and fast categorization on to transaction evolves into indispensible. The effective capacity difficulty of all the IDS simulates today at below discovery…

Abstract

Purpose

The paper aims to precise and fast categorization on to transaction evolves into indispensible. The effective capacity difficulty of all the IDS simulates today at below discovery amount of fewer regular barrage associations and therefore the next warning rate.

Design/methodology/approach

The reticulum perception is that the methods which examine and determine the scheme of contact on unearths toward number of dangerous and perchance fateful interchanges occurring toward the system. Within character of guaran-teeing the slumberous, opening and uprightness count of to socialize for professional. The precise and fast categorization on to transaction evolves into indispensible. The effective capacity difficulty of all the intrusion detection simulation (IDS) simulates today at below discovery amount of fewer regular barrage associations and therefore the next warning rate. The container with systems of connections are reproduction everything beacon subject to the series of actions to achieve results accepts exists a contemporary well-known method. At the indicated motivation a hybrid methodology supported pairing distinct ripple transformation and human intelligence artificial neural network (ANN) for IDS is projected. The lack of balance of the situation traversing the space beyond information range was eliminated through synthetic minority oversampling technique-based oversampling have low regular object and irregular below examine of the dominant object. We are binding with three layer ANN is being used for classification, and thus the experimental results on knowledge discovery databases are being used for the facts in occurrence of accuracy rate and disclosure estimation toward identical period. True and false made up accepted.

Findings

At the indicated motivation a hybrid methodology supported pairing distinct ripple transformation and human intelligence ANN for IDS is projected. The lack of balance of the situation traversing the space beyond information range was eliminated through synthetic minority oversampling technique-based oversampling have low regular object and irregular below examine of the dominant object.

Originality/value

Chain interruption discovery is the series of actions for the results knowing the familiarity opening and honor number associate order, the scientific categorization undertaking become necessary. The capacity issues of invasion discovery is the order to determine and examine. The arrangement of simulations at the occasion under discovery estimation for low regular aggression associations and above made up feeling sudden panic amount.

Details

International Journal of Pervasive Computing and Communications, vol. 17 no. 1
Type: Research Article
ISSN: 1742-7371

Keywords

Article
Publication date: 28 February 2019

Gabrijela Dimic, Dejan Rancic, Nemanja Macek, Petar Spalevic and Vida Drasute

This paper aims to deal with the previously unknown prediction accuracy of students’ activity pattern in a blended learning environment.

Abstract

Purpose

This paper aims to deal with the previously unknown prediction accuracy of students’ activity pattern in a blended learning environment.

Design/methodology/approach

To extract the most relevant activity feature subset, different feature-selection methods were applied. For different cardinality subsets, classification models were used in the comparison.

Findings

Experimental evaluation oppose the hypothesis that feature vector dimensionality reduction leads to prediction accuracy increasing.

Research limitations/implications

Improving prediction accuracy in a described learning environment was based on applying synthetic minority oversampling technique, which had affected results on correlation-based feature-selection method.

Originality/value

The major contribution of the research is the proposed methodology for selecting the optimal low-cardinal subset of students’ activities and significant prediction accuracy improvement in a blended learning environment.

Details

Information Discovery and Delivery, vol. 47 no. 2
Type: Research Article
ISSN: 2398-6247

Keywords

Article
Publication date: 29 November 2021

Ziming Zeng, Tingting Li, Shouqiang Sun, Jingjing Sun and Jie Yin

Twitter fake accounts refer to bot accounts created by third-party organizations to influence public opinion, commercial propaganda or impersonate others. The effective…

Abstract

Purpose

Twitter fake accounts refer to bot accounts created by third-party organizations to influence public opinion, commercial propaganda or impersonate others. The effective identification of bot accounts is conducive to accurately judge the disseminated information for the public. However, in actual fake account identification, it is expensive and inefficient to manually label Twitter accounts, and the labeled data are usually unbalanced in classes. To this end, the authors propose a novel framework to solve these problems.

Design/methodology/approach

In the proposed framework, the authors introduce the concept of semi-supervised self-training learning and apply it to the real Twitter account data set from Kaggle. Specifically, the authors first train the classifier in the initial small amount of labeled account data, then use the trained classifier to automatically label large-scale unlabeled account data. Next, iteratively select high confidence instances from unlabeled data to expand the labeled data. Finally, an expanded Twitter account training set is obtained. It is worth mentioning that the resampling technique is integrated into the self-training process, and the data class is balanced at the initial stage of the self-training iteration.

Findings

The proposed framework effectively improves labeling efficiency and reduces the influence of class imbalance. It shows excellent identification results on 6 different base classifiers, especially for the initial small-scale labeled Twitter accounts.

Originality/value

This paper provides novel insights in identifying Twitter fake accounts. First, the authors take the lead in introducing a self-training method to automatically label Twitter accounts from the semi-supervised background. Second, the resampling technique is integrated into the self-training process to effectively reduce the influence of class imbalance on the identification effect.

Details

Data Technologies and Applications, vol. 56 no. 3
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 5 March 2024

Sana Ramzan and Mark Lokanan

This study aims to objectively synthesize the volume of accounting literature on financial statement fraud (FSF) using a systematic literature review research method (SLRRM). This…

Abstract

Purpose

This study aims to objectively synthesize the volume of accounting literature on financial statement fraud (FSF) using a systematic literature review research method (SLRRM). This paper analyzes the vast FSF literature based on inclusion and exclusion criteria. These criteria filter articles that are present in the accounting fraud domain and are published in peer-reviewed quality journals based on Australian Business Deans Council (ABDC) journal ranking. Lastly, a reverse search, analyzing the articles' abstracts, further narrows the search to 88 peer-reviewed articles. After examining these 88 articles, the results imply that the current literature is shifting from traditional statistical approaches towards computational methods, specifically machine learning (ML), for predicting and detecting FSF. This evolution of the literature is influenced by the impact of micro and macro variables on FSF and the inadequacy of audit procedures to detect red flags of fraud. The findings also concluded that A* peer-reviewed journals accepted articles that showed a complete picture of performance measures of computational techniques in their results. Therefore, this paper contributes to the literature by providing insights to researchers about why ML articles on fraud do not make it to top accounting journals and which computational techniques are the best algorithms for predicting and detecting FSF.

Design/methodology/approach

This paper chronicles the cluster of narratives surrounding the inadequacy of current accounting and auditing practices in preventing and detecting Financial Statement Fraud. The primary objective of this study is to objectively synthesize the volume of accounting literature on financial statement fraud. More specifically, this study will conduct a systematic literature review (SLR) to examine the evolution of financial statement fraud research and the emergence of new computational techniques to detect fraud in the accounting and finance literature.

Findings

The storyline of this study illustrates how the literature has evolved from conventional fraud detection mechanisms to computational techniques such as artificial intelligence (AI) and machine learning (ML). The findings also concluded that A* peer-reviewed journals accepted articles that showed a complete picture of performance measures of computational techniques in their results. Therefore, this paper contributes to the literature by providing insights to researchers about why ML articles on fraud do not make it to top accounting journals and which computational techniques are the best algorithms for predicting and detecting FSF.

Originality/value

This paper contributes to the literature by providing insights to researchers about why the evolution of accounting fraud literature from traditional statistical methods to machine learning algorithms in fraud detection and prediction.

Details

Journal of Accounting Literature, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 0737-4607

Keywords

Article
Publication date: 26 August 2014

Bilal M’hamed Abidine, Belkacem Fergani, Mourad Oussalah and Lamya Fergani

The task of identifying activity classes from sensor information in smart home is very challenging because of the imbalanced nature of such data set where some activities occur…

Abstract

Purpose

The task of identifying activity classes from sensor information in smart home is very challenging because of the imbalanced nature of such data set where some activities occur more frequently than others. Typically probabilistic models such as Hidden Markov Model (HMM) and Conditional Random Fields (CRF) are known as commonly employed for such purpose. The paper aims to discuss these issues.

Design/methodology/approach

In this work, the authors propose a robust strategy combining the Synthetic Minority Over-sampling Technique (SMOTE) with Cost Sensitive Support Vector Machines (CS-SVM) with an adaptive tuning of cost parameter in order to handle imbalanced data problem.

Findings

The results have demonstrated the usefulness of the approach through comparison with state of art of approaches including HMM, CRF, the traditional C-Support vector machines (C-SVM) and the Cost-Sensitive-SVM (CS-SVM) for classifying the activities using binary and ubiquitous sensors.

Originality/value

Performance metrics in the experiment/simulation include Accuracy, Precision/Recall and F measure.

Details

Kybernetes, vol. 43 no. 8
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 8 February 2021

Xin Tian, Jing Selena He and Meng Han

This paper aims to explore the latest study of the emerging data-driven approach in the area of FinTech. This paper attempts to provide comprehensive comparisons, including the…

Abstract

Purpose

This paper aims to explore the latest study of the emerging data-driven approach in the area of FinTech. This paper attempts to provide comprehensive comparisons, including the advantages and disadvantages of different data-driven algorithms applied to FinTech. This paper also attempts to point out the future directions of data-driven approaches in the FinTech domain.

Design/methodology/approach

This paper explores and summarizes the latest data-driven approaches and algorithms applied in FinTech to the following categories: risk management, data privacy protection, portfolio management, and sentiment analysis.

Findings

This paper details out comparison between different existed works in FinTech with traditional data analytics techniques and the latest development. The framework for the analysis process is developed, and insights regarding the implementation, regulation and workforce development are provided in this area.

Originality/value

To the best of the authors’ knowledge, this paper is first to consider broad aspects of data-driven approaches in the application of FinTech industry to explore the potential, challenges and limitations of this area. This study provides a valuable reference for both the current and future participants.

Details

Information Discovery and Delivery, vol. 49 no. 2
Type: Research Article
ISSN: 2398-6247

Keywords

Book part
Publication date: 26 October 2017

Son Nguyen, John Quinn and Alan Olinsky

We propose an oversampling technique to increase the true positive rate (sensitivity) in classifying imbalanced datasets (i.e., those with a value for the target variable that…

Abstract

We propose an oversampling technique to increase the true positive rate (sensitivity) in classifying imbalanced datasets (i.e., those with a value for the target variable that occurs with a small frequency) and hence boost the overall performance measurements such as balanced accuracy, G-mean and area under the receiver operating characteristic (ROC) curve, AUC. This oversampling method is based on the idea of applying the Synthetic Minority Oversampling Technique (SMOTE) on only a selective portion of the dataset instead of the entire dataset. We demonstrate the effectiveness of our oversampling method with four real and simulated datasets generated from three models.

Details

Advances in Business and Management Forecasting
Type: Book
ISBN: 978-1-78743-069-3

Keywords

Article
Publication date: 30 November 2021

Minh Thanh Vo, Anh H. Vo and Tuong Le

Medical images are increasingly popular; therefore, the analysis of these images based on deep learning helps diagnose diseases become more and more essential and necessary…

Abstract

Purpose

Medical images are increasingly popular; therefore, the analysis of these images based on deep learning helps diagnose diseases become more and more essential and necessary. Recently, the shoulder implant X-ray image classification (SIXIC) dataset that includes X-ray images of implanted shoulder prostheses produced by four manufacturers was released. The implant's model detection helps to select the correct equipment and procedures in the upcoming surgery.

Design/methodology/approach

This study proposes a robust model named X-Net to improve the predictability for shoulder implants X-ray image classification in the SIXIC dataset. The X-Net model utilizes the Squeeze and Excitation (SE) block integrated into Residual Network (ResNet) module. The SE module aims to weigh each feature map extracted from ResNet, which aids in improving the performance. The feature extraction process of X-Net model is performed by both modules: ResNet and SE modules. The final feature is obtained by incorporating the extracted features from the above steps, which brings more important characteristics of X-ray images in the input dataset. Next, X-Net uses this fine-grained feature to classify the input images into four classes (Cofield, Depuy, Zimmer and Tornier) in the SIXIC dataset.

Findings

Experiments are conducted to show the proposed approach's effectiveness compared with other state-of-the-art methods for SIXIC. The experimental results indicate that the approach outperforms the various experimental methods in terms of several performance metrics. In addition, the proposed approach provides the new state of the art results in all performance metrics, such as accuracy, precision, recall, F1-score and area under the curve (AUC), for the experimental dataset.

Originality/value

The proposed method with high predictive performance can be used to assist in the treatment of injured shoulder joints.

Details

Data Technologies and Applications, vol. 56 no. 3
Type: Research Article
ISSN: 2514-9288

Keywords

1 – 10 of 298