Search results
1 – 10 of 88Zhongyi Hu, Raymond Chiong, Ilung Pranata, Yukun Bao and Yuqing Lin
Malicious web domain identification is of significant importance to the security protection of internet users. With online credibility and performance data, the purpose of this…
Abstract
Purpose
Malicious web domain identification is of significant importance to the security protection of internet users. With online credibility and performance data, the purpose of this paper to investigate the use of machine learning techniques for malicious web domain identification by considering the class imbalance issue (i.e. there are more benign web domains than malicious ones).
Design/methodology/approach
The authors propose an integrated resampling approach to handle class imbalance by combining the synthetic minority oversampling technique (SMOTE) and particle swarm optimisation (PSO), a population-based meta-heuristic algorithm. The authors use the SMOTE for oversampling and PSO for undersampling.
Findings
By applying eight well-known machine learning classifiers, the proposed integrated resampling approach is comprehensively examined using several imbalanced web domain data sets with different imbalance ratios. Compared to five other well-known resampling approaches, experimental results confirm that the proposed approach is highly effective.
Practical implications
This study not only inspires the practical use of online credibility and performance data for identifying malicious web domains but also provides an effective resampling approach for handling the class imbalance issue in the area of malicious web domain identification.
Originality/value
Online credibility and performance data are applied to build malicious web domain identification models using machine learning techniques. An integrated resampling approach is proposed to address the class imbalance issue. The performance of the proposed approach is confirmed based on real-world data sets with different imbalance ratios.
Details
Keywords
Jun Liu, Asad Khattak, Lee Han and Quan Yuan
Individuals’ driving behavior data are becoming available widely through Global Positioning System devices and on-board diagnostic systems. The incoming data can be sampled at…
Abstract
Purpose
Individuals’ driving behavior data are becoming available widely through Global Positioning System devices and on-board diagnostic systems. The incoming data can be sampled at rates ranging from one Hertz (or even lower) to hundreds of Hertz. Failing to capture substantial changes in vehicle movements over time by “undersampling” can cause loss of information and misinterpretations of the data, but “oversampling” can waste storage and processing resources. The purpose of this study is to empirically explore how micro-driving decisions to maintain speed, accelerate or decelerate, can be best captured, without substantial loss of information.
Design/methodology/approach
This study creates a set of indicators to quantify the magnitude of information loss (MIL). Each indicator is calculated as a percentage to index the extent of information loss (EIL) in different situations. An overall information loss index named EIL is created to combine the MIL indicators. Data from a driving simulator study collected at 20 Hertz are analyzed (N = 718,481 data points from 35,924 s of driving tests). The study quantifies the relationship between information loss indicators and sampling rates.
Findings
The results show that marginally more information is lost as data are sampled down from 20 to 0.5 Hz, but the relationship is not linear. With four indicators of MILs, the overall EIL is 3.85 per cent for 1-Hz sampling rate driving behavior data. If sampling rates are higher than 2 Hz, all MILs are under 5 per cent for importation loss.
Originality/value
This study contributes by developing a framework for quantifying the relationship between sampling rates, and information loss and depending on the objective of their study, researchers can choose the appropriate sampling rate necessary to get the right amount of accuracy.
Details
Keywords
Ziming Zeng, Tingting Li, Shouqiang Sun, Jingjing Sun and Jie Yin
Twitter fake accounts refer to bot accounts created by third-party organizations to influence public opinion, commercial propaganda or impersonate others. The effective…
Abstract
Purpose
Twitter fake accounts refer to bot accounts created by third-party organizations to influence public opinion, commercial propaganda or impersonate others. The effective identification of bot accounts is conducive to accurately judge the disseminated information for the public. However, in actual fake account identification, it is expensive and inefficient to manually label Twitter accounts, and the labeled data are usually unbalanced in classes. To this end, the authors propose a novel framework to solve these problems.
Design/methodology/approach
In the proposed framework, the authors introduce the concept of semi-supervised self-training learning and apply it to the real Twitter account data set from Kaggle. Specifically, the authors first train the classifier in the initial small amount of labeled account data, then use the trained classifier to automatically label large-scale unlabeled account data. Next, iteratively select high confidence instances from unlabeled data to expand the labeled data. Finally, an expanded Twitter account training set is obtained. It is worth mentioning that the resampling technique is integrated into the self-training process, and the data class is balanced at the initial stage of the self-training iteration.
Findings
The proposed framework effectively improves labeling efficiency and reduces the influence of class imbalance. It shows excellent identification results on 6 different base classifiers, especially for the initial small-scale labeled Twitter accounts.
Originality/value
This paper provides novel insights in identifying Twitter fake accounts. First, the authors take the lead in introducing a self-training method to automatically label Twitter accounts from the semi-supervised background. Second, the resampling technique is integrated into the self-training process to effectively reduce the influence of class imbalance on the identification effect.
Details
Keywords
This paper aims to propose the new incremental and parallel training algorithm of proximal support vector machines (Inc-Par-PSVM) tailored on the edge device (i.e. the Jetson…
Abstract
Purpose
This paper aims to propose the new incremental and parallel training algorithm of proximal support vector machines (Inc-Par-PSVM) tailored on the edge device (i.e. the Jetson Nano) to handle the large-scale ImageNet challenging problem.
Design/methodology/approach
The Inc-Par-PSVM trains in the incremental and parallel manner ensemble binary PSVM classifiers used for the One-Versus-All multiclass strategy on the Jetson Nano. The binary PSVM model is the average in bagged binary PSVM models built in undersampling training data block.
Findings
The empirical test results on the ImageNet data set show that the Inc-Par-PSVM algorithm with the Jetson Nano (Quad-core ARM A57 @ 1.43 GHz, 128-core NVIDIA Maxwell architecture-based graphics processing unit, 4 GB RAM) is faster and more accurate than the state-of-the-art linear SVM algorithm run on a PC [Intel(R) Core i7-4790 CPU, 3.6 GHz, 4 cores, 32 GB RAM].
Originality/value
The new incremental and parallel PSVM algorithm tailored on the Jetson Nano is able to efficiently handle the large-scale ImageNet challenge with 1.2 million images and 1,000 classes.
Details
Keywords
Sharifah Heryati Syed Nor, Shafinar Ismail and Bee Wah Yap
Personal bankruptcy is on the rise in Malaysia. The Insolvency Department of Malaysia reported that personal bankruptcy has increased since 2007, and the total accumulated…
Abstract
Purpose
Personal bankruptcy is on the rise in Malaysia. The Insolvency Department of Malaysia reported that personal bankruptcy has increased since 2007, and the total accumulated personal bankruptcy cases stood at 131,282 in 2014. This is indeed an alarming issue because the increasing number of personal bankruptcy cases will have a negative impact on the Malaysian economy, as well as on the society. From the aspect of individual’s personal economy, bankruptcy minimizes their chances of securing a job. Apart from that, their account will be frozen, lost control on their assets and properties and not allowed to start any business nor be a part of any company’s management. Bankrupts also will be denied from any loan application, restricted from travelling overseas and cannot act as a guarantor. This paper aims to investigate this problem by developing the personal bankruptcy prediction model using the decision tree technique.
Design/methodology/approach
In this paper, bankrupt is defined as terminated members who failed to settle their loans. The sample comprised of 24,546 cases with 17 per cent settled cases and 83 per cent terminated cases. The data included a dependent variable, i.e. bankruptcy status (Y = 1(bankrupt), Y = 0 (non-bankrupt)) and 12 predictors. SAS Enterprise Miner 14.1 software was used to develop the decision tree model.
Findings
Upon completion, this study succeeds to come out with the profiles of bankrupts, reliable personal bankruptcy scoring model and significant variables of personal bankruptcy.
Practical implications
This decision tree model is possible for patent and income generation. Financial institutions are able to use this model for potential borrowers to predict their tendency toward personal bankruptcy.
Social implications
Create awareness to society on significant variables of personal bankruptcy so that they can avoid being a bankrupt.
Originality/value
This decision tree model is able to facilitate and assist financial institutions in evaluating and assessing their potential borrower. It helps to identify potential defaulting borrowers. It also can assist financial institutions in implementing the right strategies to avoid defaulting borrowers.
Details
Keywords
Emmanuel Blanchard, Adrian Sandu and Corina Sandu
The purpose of this paper is to propose a new computational approach for parameter estimation in the Bayesian framework. A posteriori probability density functions are obtained…
Abstract
Purpose
The purpose of this paper is to propose a new computational approach for parameter estimation in the Bayesian framework. A posteriori probability density functions are obtained using the polynomial chaos theory for propagating uncertainties through system dynamics. The new method has the advantage of being able to deal with large parametric uncertainties, non‐Gaussian probability densities and nonlinear dynamics.
Design/methodology/approach
The maximum likelihood estimates are obtained by minimizing a cost function derived from the Bayesian theorem. Direct stochastic collocation is used as a less computationally expensive alternative to the traditional Galerkin approach to propagate the uncertainties through the system in the polynomial chaos framework.
Findings
The new approach is explained and is applied to very simple mechanical systems in order to illustrate how the Bayesian cost function can be affected by the noise level in the measurements, by undersampling, non‐identifiablily of the system, non‐observability and by excitation signals that are not rich enough. When the system is non‐identifiable and an a priori knowledge of the parameter uncertainties is available, regularization techniques can still yield most likely values among the possible combinations of uncertain parameters resulting in the same time responses than the ones observed.
Originality/value
The polynomial chaos method has been shown to be considerably more efficient than Monte Carlo in the simulation of systems with a small number of uncertain parameters. This is believed to be the first time the polynomial chaos theory has been applied to Bayesian estimation.
Details
Keywords
Yaotan Xie and Fei Xiang
This study aimed to adapt existing text-mining techniques and propose a novel topic recognition approach for textual patient reviews.
Abstract
Purpose
This study aimed to adapt existing text-mining techniques and propose a novel topic recognition approach for textual patient reviews.
Design/methodology/approach
The authors first transformed multilabel samples for adapting model training forms. Then, an improved method was proposed based on dynamic mixed sampling and transfer learning to improve the learning problem caused by imbalanced samples. Specifically, the training of our model was based on the framework of a convolutional neural network and self-trained Word2Vector on large-scale corpora.
Findings
Compared with the SVM and other CNN-based models, the CNN+ DMS + TL model proposed in this study has made significant improvement in F1 score.
Originality/value
The improved methods based on dynamic mixed sampling and transfer learning can adequately manage the learning problem caused by the skewed distribution of samples and achieve the effective and automatic topic recognition of textual patient reviews.
Peer review
The peer-review history for this article is available at: https://publons.com/publon/10.1108/OIR-01-2021-0059.
Details
Keywords
Bachriah Fatwa Dhini, Abba Suganda Girsang, Unggul Utan Sufandi and Heny Kurniawati
The authors constructed an automatic essay scoring (AES) model in a discussion forum where the result was compared with scores given by human evaluators. This research proposes…
Abstract
Purpose
The authors constructed an automatic essay scoring (AES) model in a discussion forum where the result was compared with scores given by human evaluators. This research proposes essay scoring, which is conducted through two parameters, semantic and keyword similarities, using a SentenceTransformers pre-trained model that can construct the highest vector embedding. Combining these models is used to optimize the model with increasing accuracy.
Design/methodology/approach
The development of the model in the study is divided into seven stages: (1) data collection, (2) pre-processing data, (3) selected pre-trained SentenceTransformers model, (4) semantic similarity (sentence pair), (5) keyword similarity, (6) calculate final score and (7) evaluating model.
Findings
The multilingual paraphrase-multilingual-MiniLM-L12-v2 and distilbert-base-multilingual-cased-v1 models got the highest scores from comparisons of 11 pre-trained multilingual models of SentenceTransformers with Indonesian data (Dhini and Girsang, 2023). Both multilingual models were adopted in this study. A combination of two parameters is obtained by comparing the response of the keyword extraction responses with the rubric keywords. Based on the experimental results, proposing a combination can increase the evaluation results by 0.2.
Originality/value
This study uses discussion forum data from the general biology course in online learning at the open university for the 2020.2 and 2021.2 semesters. Forum discussion ratings are still manual. In this survey, the authors created a model that automatically calculates the value of discussion forums, which are essays based on the lecturer's answers moreover rubrics.
Details
Keywords
Murtaza Nasir, Carole South-Winter, Srini Ragothaman and Ali Dag
The purpose of this paper is to formulate a framework to construct a patient-specific risk score and therefore to classify these patients into various risk groups that can be used…
Abstract
Purpose
The purpose of this paper is to formulate a framework to construct a patient-specific risk score and therefore to classify these patients into various risk groups that can be used as a decision support mechanism by the medical decision makers to augment their decision-making process, allowing them to optimally use the limited resources available.
Design/methodology/approach
A conventional statistical model (logistic regression) and two machine learning-based (i.e. artificial neural networks (ANNs) and support vector machines) data mining models were employed by also using five-fold cross-validation in the classification phase. In order to overcome the data imbalance problem, random undersampling technique was utilized. After constructing the patient-specific risk score, k-means clustering algorithm was employed to group these patients into risk groups.
Findings
Results showed that the ANN model achieved the best results with an area under the curve score of 0.867, while the sensitivity and specificity were 0.715 and 0.892, respectively. Also, the construction of patient-specific risk scores offer useful insights to the medical experts, by helping them find a trade-off between risks, costs and resources.
Originality/value
The study contributes to the existing body of knowledge by constructing a framework that can be utilized to determine the risk level of the targeted patient, by employing data mining-based predictive approach.
Details
Keywords
The purpose of this paper is to investigate the relationship between individual investors’ level of sophistication and their expectations of risk and return in the stock market.
Abstract
Purpose
The purpose of this paper is to investigate the relationship between individual investors’ level of sophistication and their expectations of risk and return in the stock market.
Design/methodology/approach
The author combines survey and registry data on individual investors in Sweden to obtain 11 sophistication proxies that previous research has related to individuals’ financial decisions. These proxies are related to a survey measure regarding individual investors’ expectations of risk and return in an index fund using linear regressions.
Findings
The findings in this paper indicate that sophisticated investors have lower risk and higher return expectations that are closer to objective measures than those of less-sophisticated investors.
Originality/value
These results are important, since they enhance the understanding of the underlying mechanisms through which sophistication can influence financial decisions.