Search results
1 – 10 of over 5000Putta Hemalatha and Geetha Mary Amalanathan
Adequate resources for learning and training the data are an important constraint to develop an efficient classifier with outstanding performance. The data usually follows a…
Abstract
Purpose
Adequate resources for learning and training the data are an important constraint to develop an efficient classifier with outstanding performance. The data usually follows a biased distribution of classes that reflects an unequal distribution of classes within a dataset. This issue is known as the imbalance problem, which is one of the most common issues occurring in real-time applications. Learning of imbalanced datasets is a ubiquitous challenge in the field of data mining. Imbalanced data degrades the performance of the classifier by producing inaccurate results.
Design/methodology/approach
In the proposed work, a novel fuzzy-based Gaussian synthetic minority oversampling (FG-SMOTE) algorithm is proposed to process the imbalanced data. The mechanism of the Gaussian SMOTE technique is based on finding the nearest neighbour concept to balance the ratio between minority and majority class datasets. The ratio of the datasets belonging to the minority and majority class is balanced using a fuzzy-based Levenshtein distance measure technique.
Findings
The performance and the accuracy of the proposed algorithm is evaluated using the deep belief networks classifier and the results showed the efficiency of the fuzzy-based Gaussian SMOTE technique achieved an AUC: 93.7%. F1 Score Prediction: 94.2%, Geometric Mean Score: 93.6% predicted from confusion matrix.
Research limitations/implications
The proposed research still retains some of the challenges that need to be focused such as application FG-SMOTE to multiclass imbalanced dataset and to evaluate dataset imbalance problem in a distributed environment.
Originality/value
The proposed algorithm fundamentally solves the data imbalance issues and challenges involved in handling the imbalanced data. FG-SMOTE has aided in balancing minority and majority class datasets.
Details
Keywords
Ziming Zeng, Tingting Li, Shouqiang Sun, Jingjing Sun and Jie Yin
Twitter fake accounts refer to bot accounts created by third-party organizations to influence public opinion, commercial propaganda or impersonate others. The effective…
Abstract
Purpose
Twitter fake accounts refer to bot accounts created by third-party organizations to influence public opinion, commercial propaganda or impersonate others. The effective identification of bot accounts is conducive to accurately judge the disseminated information for the public. However, in actual fake account identification, it is expensive and inefficient to manually label Twitter accounts, and the labeled data are usually unbalanced in classes. To this end, the authors propose a novel framework to solve these problems.
Design/methodology/approach
In the proposed framework, the authors introduce the concept of semi-supervised self-training learning and apply it to the real Twitter account data set from Kaggle. Specifically, the authors first train the classifier in the initial small amount of labeled account data, then use the trained classifier to automatically label large-scale unlabeled account data. Next, iteratively select high confidence instances from unlabeled data to expand the labeled data. Finally, an expanded Twitter account training set is obtained. It is worth mentioning that the resampling technique is integrated into the self-training process, and the data class is balanced at the initial stage of the self-training iteration.
Findings
The proposed framework effectively improves labeling efficiency and reduces the influence of class imbalance. It shows excellent identification results on 6 different base classifiers, especially for the initial small-scale labeled Twitter accounts.
Originality/value
This paper provides novel insights in identifying Twitter fake accounts. First, the authors take the lead in introducing a self-training method to automatically label Twitter accounts from the semi-supervised background. Second, the resampling technique is integrated into the self-training process to effectively reduce the influence of class imbalance on the identification effect.
Details
Keywords
Sihem Khemakhem, Fatma Ben Said and Younes Boujelbene
Credit scoring datasets are generally unbalanced. The number of repaid loans is higher than that of defaulted ones. Therefore, the classification of these data is biased toward…
Abstract
Purpose
Credit scoring datasets are generally unbalanced. The number of repaid loans is higher than that of defaulted ones. Therefore, the classification of these data is biased toward the majority class, which practically means that it tends to attribute a mistaken “good borrower” status even to “very risky borrowers”. In addition to the use of statistics and machine learning classifiers, this paper aims to explore the relevance and performance of sampling models combined with statistical prediction and artificial intelligence techniques to predict and quantify the default probability based on real-world credit data.
Design/methodology/approach
A real database from a Tunisian commercial bank was used and unbalanced data issues were addressed by the random over-sampling (ROS) and synthetic minority over-sampling technique (SMOTE). Performance was evaluated in terms of the confusion matrix and the receiver operating characteristic curve.
Findings
The results indicated that the combination of intelligent and statistical techniques and re-sampling approaches are promising for the default rate management and provide accurate credit risk estimates.
Originality/value
This paper empirically investigates the effectiveness of ROS and SMOTE in combination with logistic regression, artificial neural networks and support vector machines. The authors address the role of sampling strategies in the Tunisian credit market and its impact on credit risk. These sampling strategies may help financial institutions to reduce the erroneous classification costs in comparison with the unbalanced original data and may serve as a means for improving the bank’s performance and competitiveness.
Details
Keywords
M'hamed Bilal Abidine, Mourad Oussalah, Belkacem Fergani and Hakim Lounis
Mobile phone-based human activity recognition (HAR) consists of inferring user’s activity type from the analysis of the inertial mobile sensor data. This paper aims to mainly…
Abstract
Purpose
Mobile phone-based human activity recognition (HAR) consists of inferring user’s activity type from the analysis of the inertial mobile sensor data. This paper aims to mainly introduce a new classification approach called adaptive k-nearest neighbors (AKNN) for intelligent HAR using smartphone inertial sensors with a potential real-time implementation on smartphone platform.
Design/methodology/approach
The proposed method puts forward several modification on AKNN baseline by using kernel discriminant analysis for feature reduction and hybridizing weighted support vector machines and KNN to tackle imbalanced class data set.
Findings
Extensive experiments on a five large scale daily activity recognition data set have been performed to demonstrate the effectiveness of the method in terms of error rate, recall, precision, F1-score and computational/memory resources, with several comparison with state-of-the art methods and other hybridization modes. The results showed that the proposed method can achieve more than 50% improvement in error rate metric and up to 5.6% in F1-score. The training phase is also shown to be reduced by a factor of six compared to baseline, which provides solid assets for smartphone implementation.
Practical implications
This work builds a bridge to already growing work in machine learning related to learning with small data set. Besides, the availability of systems that are able to perform on flight activity recognition on smartphone will have a significant impact in the field of pervasive health care, supporting a variety of practical applications such as elderly care, ambient assisted living and remote monitoring.
Originality/value
The purpose of this study is to build and test an accurate offline model by using only a compact training data that can reduce the computational and memory complexity of the system. This provides grounds for developing new innovative hybridization modes in the context of daily activity recognition and smartphone-based implementation. This study demonstrates that the new AKNN is able to classify the data without any training step because it does not use any model for fitting and only uses memory resources to store the corresponding support vectors.
Details
Keywords
Femi Emmanuel Ayo, Olusegun Folorunso, Friday Thomas Ibharalu and Idowu Ademola Osinuga
Hate speech is an expression of intense hatred. Twitter has become a popular analytical tool for the prediction and monitoring of abusive behaviors. Hate speech detection with…
Abstract
Purpose
Hate speech is an expression of intense hatred. Twitter has become a popular analytical tool for the prediction and monitoring of abusive behaviors. Hate speech detection with social media data has witnessed special research attention in recent studies, hence, the need to design a generic metadata architecture and efficient feature extraction technique to enhance hate speech detection.
Design/methodology/approach
This study proposes a hybrid embeddings enhanced with a topic inference method and an improved cuckoo search neural network for hate speech detection in Twitter data. The proposed method uses a hybrid embeddings technique that includes Term Frequency-Inverse Document Frequency (TF-IDF) for word-level feature extraction and Long Short Term Memory (LSTM) which is a variant of recurrent neural networks architecture for sentence-level feature extraction. The extracted features from the hybrid embeddings then serve as input into the improved cuckoo search neural network for the prediction of a tweet as hate speech, offensive language or neither.
Findings
The proposed method showed better results when tested on the collected Twitter datasets compared to other related methods. In order to validate the performances of the proposed method, t-test and post hoc multiple comparisons were used to compare the significance and means of the proposed method with other related methods for hate speech detection. Furthermore, Paired Sample t-Test was also conducted to validate the performances of the proposed method with other related methods.
Research limitations/implications
Finally, the evaluation results showed that the proposed method outperforms other related methods with mean F1-score of 91.3.
Originality/value
The main novelty of this study is the use of an automatic topic spotting measure based on naïve Bayes model to improve features representation.
Details
Keywords
Vishakha Pareek, Santanu Chaudhury and Sanjay Singh
The electronic nose is an array of chemical or gas sensors and associated with a pattern-recognition framework competent in identifying and classifying odorant or non-odorant and…
Abstract
Purpose
The electronic nose is an array of chemical or gas sensors and associated with a pattern-recognition framework competent in identifying and classifying odorant or non-odorant and simple or complex gases. Despite more than 30 years of research, the robust e-nose device is still limited. Most of the challenges towards reliable e-nose devices are associated with the non-stationary environment and non-stationary sensor behaviour. Data distribution of sensor array response evolves with time, referred to as non-stationarity. The purpose of this paper is to provide a comprehensive introduction to challenges related to non-stationarity in e-nose design and to review the existing literature from an application, system and algorithm perspective to provide an integrated and practical view.
Design/methodology/approach
The authors discuss the non-stationary data in general and the challenges related to the non-stationarity environment in e-nose design or non-stationary sensor behaviour. The challenges are categorised and discussed with the perspective of learning with data obtained from the sensor systems. Later, the e-nose technology is reviewed with the system, application and algorithmic point of view to discuss the current status.
Findings
The discussed challenges in e-nose design will be beneficial for researchers, as well as practitioners as it presents a comprehensive view on multiple aspects of non-stationary learning, system, algorithms and applications for e-nose. The paper presents a review of the pattern-recognition techniques, public data sets that are commonly referred to as olfactory research. Generic techniques for learning in the non-stationary environment are also presented. The authors discuss the future direction of research and major open problems related to handling non-stationarity in e-nose design.
Originality/value
The authors first time review the existing literature related to learning with e-nose in a non-stationary environment and existing generic pattern-recognition algorithms for learning in the non-stationary environment to bridge the gap between these two. The authors also present details of publicly available sensor array data sets, which will benefit the upcoming researchers in this field. The authors further emphasise several open problems and future directions, which should be considered to provide efficient solutions that can handle non-stationarity to make e-nose the next everyday device.
Details
Keywords
Waqar Ahmed Khan, S.H. Chung, Muhammad Usman Awan and Xin Wen
The purpose of this paper is to conduct a comprehensive review of the noteworthy contributions made in the area of the Feedforward neural network (FNN) to improve its…
Abstract
Purpose
The purpose of this paper is to conduct a comprehensive review of the noteworthy contributions made in the area of the Feedforward neural network (FNN) to improve its generalization performance and convergence rate (learning speed); to identify new research directions that will help researchers to design new, simple and efficient algorithms and users to implement optimal designed FNNs for solving complex problems; and to explore the wide applications of the reviewed FNN algorithms in solving real-world management, engineering and health sciences problems and demonstrate the advantages of these algorithms in enhancing decision making for practical operations.
Design/methodology/approach
The FNN has gained much popularity during the last three decades. Therefore, the authors have focused on algorithms proposed during the last three decades. The selected databases were searched with popular keywords: “generalization performance,” “learning rate,” “overfitting” and “fixed and cascade architecture.” Combinations of the keywords were also used to get more relevant results. Duplicated articles in the databases, non-English language, and matched keywords but out of scope, were discarded.
Findings
The authors studied a total of 80 articles and classified them into six categories according to the nature of the algorithms proposed in these articles which aimed at improving the generalization performance and convergence rate of FNNs. To review and discuss all the six categories would result in the paper being too long. Therefore, the authors further divided the six categories into two parts (i.e. Part I and Part II). The current paper, Part I, investigates two categories that focus on learning algorithms (i.e. gradient learning algorithms for network training and gradient-free learning algorithms). Furthermore, the remaining four categories which mainly explore optimization techniques are reviewed in Part II (i.e. optimization algorithms for learning rate, bias and variance (underfitting and overfitting) minimization algorithms, constructive topology neural networks and metaheuristic search algorithms). For the sake of simplicity, the paper entitled “Machine learning facilitated business intelligence (Part II): Neural networks optimization techniques and applications” is referred to as Part II. This results in a division of 80 articles into 38 and 42 for Part I and Part II, respectively. After discussing the FNN algorithms with their technical merits and limitations, along with real-world management, engineering and health sciences applications for each individual category, the authors suggest seven (three in Part I and other four in Part II) new future directions which can contribute to strengthening the literature.
Research limitations/implications
The FNN contributions are numerous and cannot be covered in a single study. The authors remain focused on learning algorithms and optimization techniques, along with their application to real-world problems, proposing to improve the generalization performance and convergence rate of FNNs with the characteristics of computing optimal hyperparameters, connection weights, hidden units, selecting an appropriate network architecture rather than trial and error approaches and avoiding overfitting.
Practical implications
This study will help researchers and practitioners to deeply understand the existing algorithms merits of FNNs with limitations, research gaps, application areas and changes in research studies in the last three decades. Moreover, the user, after having in-depth knowledge by understanding the applications of algorithms in the real world, may apply appropriate FNN algorithms to get optimal results in the shortest possible time, with less effort, for their specific application area problems.
Originality/value
The existing literature surveys are limited in scope due to comparative study of the algorithms, studying algorithms application areas and focusing on specific techniques. This implies that the existing surveys are focused on studying some specific algorithms or their applications (e.g. pruning algorithms, constructive algorithms, etc.). In this work, the authors propose a comprehensive review of different categories, along with their real-world applications, that may affect FNN generalization performance and convergence rate. This makes the classification scheme novel and significant.
Details
Keywords
Wen-Qian Lou, Bin Wu and Bo-Wen Zhu
This study aims to clarify influencing factors of overcapacity of new energy enterprises in China and accurately predict whether these enterprises have overcapacity.
Abstract
Purpose
This study aims to clarify influencing factors of overcapacity of new energy enterprises in China and accurately predict whether these enterprises have overcapacity.
Design/methodology/approach
Based on relevant data including the experience and evidence from the capital market in China, the research establishes a generic univariate selection-comparative machine learning model to study relevant factors that affect overcapacity of new energy enterprises from five dimensions. These include the governmental intervention, market demand, corporate finance, corporate governance and corporate decision. Moreover, the bridging approach is used to strengthen findings from quantitative studies via the results from qualitative studies.
Findings
The authors' results show that the overcapacity of new energy enterprises in China is brought out by the combined effect of governmental intervention corporate governance and corporate decision. Governmental interventions increase the overcapacity risk of new energy enterprises mainly by distorting investment behaviors of enterprises. Corporate decision and corporate governance factors affect the overcapacity mainly by regulating the degree of overconfidence of the management team and the agency cost. Among the eight comparable integrated models, generic univariate selection-bagging exhibits the optimal comprehensive generalization performance and its area under the receiver operating characteristic curve Area under curve (AUC) accuracy precision and recall are 0.719, 0.960, 0.975 and 0.983, respectively.
Originality/value
The proposed integrated model analyzes causes and predicts presence of overcapacity of new energy enterprises to help governments to formulate appropriate strategies to deal with overcapacity and new energy enterprises to optimize resource allocation. Ten main features which affect the overcapacity of new energy enterprises in China are identified through generic univariate selection model. Through the bridging approach, the impact of the main features on the overcapacity of new energy enterprises and the mechanism of the influence are analyzed.
Details
Keywords
The rise of cryptocurrencies and other digital assets has triggered concerns about regulation and security. Governments and regulatory bodies are challenged to create frameworks…
Abstract
Purpose
The rise of cryptocurrencies and other digital assets has triggered concerns about regulation and security. Governments and regulatory bodies are challenged to create frameworks that protect consumers, combat money laundering and address risks linked to digital assets. Conventional approaches to confiscation and anti-money laundering are deemed insufficient in this evolving landscape. The absence of a central authority and the use of encryption hinder the identification of asset owners and the tracking of illicit activities. Moreover, the international and cross-border nature of digital assets complicates matters, demanding global coordination. The purpose of this study is to highlight that the effective combat of money laundering, legislative action, innovative investigative techniques and public–private partnerships are crucial.
Design/methodology/approach
The focal point of this paper is Australia’s approach to law enforcement in the realm of digital assets. It underscores the pivotal role of robust confiscation mechanisms in disrupting criminal networks operating through digital means. The paper firmly asserts that staying ahead of the curve and maintaining an agile stance is paramount. Criminals are quick to embrace emerging technologies, necessitating proactive measures from policymakers and law enforcement agencies.
Findings
It is argued that an agile and comprehensive approach is vital in countering money laundering, as criminals adapt to new technologies. Policymakers and law enforcement agencies must remain proactively ahead of these developments to efficiently identify, trace and seize digital assets involved in illicit activities, thereby safeguarding the integrity of the global financial system.
Originality/value
This paper provides a distinctive perspective by examining Australia’s legal anti-money laundering and counterterrorism financing framework, along with its law enforcement strategies within the realm of the digital asset landscape. While there is a plethora of literature on both asset confiscation and digital assets, there is a noticeable absence of exploration into their interplay, especially within the Australian context.
Details
Keywords
Zhongyi Hu, Raymond Chiong, Ilung Pranata, Yukun Bao and Yuqing Lin
Malicious web domain identification is of significant importance to the security protection of internet users. With online credibility and performance data, the purpose of this…
Abstract
Purpose
Malicious web domain identification is of significant importance to the security protection of internet users. With online credibility and performance data, the purpose of this paper to investigate the use of machine learning techniques for malicious web domain identification by considering the class imbalance issue (i.e. there are more benign web domains than malicious ones).
Design/methodology/approach
The authors propose an integrated resampling approach to handle class imbalance by combining the synthetic minority oversampling technique (SMOTE) and particle swarm optimisation (PSO), a population-based meta-heuristic algorithm. The authors use the SMOTE for oversampling and PSO for undersampling.
Findings
By applying eight well-known machine learning classifiers, the proposed integrated resampling approach is comprehensively examined using several imbalanced web domain data sets with different imbalance ratios. Compared to five other well-known resampling approaches, experimental results confirm that the proposed approach is highly effective.
Practical implications
This study not only inspires the practical use of online credibility and performance data for identifying malicious web domains but also provides an effective resampling approach for handling the class imbalance issue in the area of malicious web domain identification.
Originality/value
Online credibility and performance data are applied to build malicious web domain identification models using machine learning techniques. An integrated resampling approach is proposed to address the class imbalance issue. The performance of the proposed approach is confirmed based on real-world data sets with different imbalance ratios.
Details