Search results

1 – 10 of 88
Article
Publication date: 4 December 2018

Zhongyi Hu, Raymond Chiong, Ilung Pranata, Yukun Bao and Yuqing Lin

Malicious web domain identification is of significant importance to the security protection of internet users. With online credibility and performance data, the purpose of this…

Abstract

Purpose

Malicious web domain identification is of significant importance to the security protection of internet users. With online credibility and performance data, the purpose of this paper to investigate the use of machine learning techniques for malicious web domain identification by considering the class imbalance issue (i.e. there are more benign web domains than malicious ones).

Design/methodology/approach

The authors propose an integrated resampling approach to handle class imbalance by combining the synthetic minority oversampling technique (SMOTE) and particle swarm optimisation (PSO), a population-based meta-heuristic algorithm. The authors use the SMOTE for oversampling and PSO for undersampling.

Findings

By applying eight well-known machine learning classifiers, the proposed integrated resampling approach is comprehensively examined using several imbalanced web domain data sets with different imbalance ratios. Compared to five other well-known resampling approaches, experimental results confirm that the proposed approach is highly effective.

Practical implications

This study not only inspires the practical use of online credibility and performance data for identifying malicious web domains but also provides an effective resampling approach for handling the class imbalance issue in the area of malicious web domain identification.

Originality/value

Online credibility and performance data are applied to build malicious web domain identification models using machine learning techniques. An integrated resampling approach is proposed to address the class imbalance issue. The performance of the proposed approach is confirmed based on real-world data sets with different imbalance ratios.

Open Access
Article
Publication date: 6 February 2020

Jun Liu, Asad Khattak, Lee Han and Quan Yuan

Individuals’ driving behavior data are becoming available widely through Global Positioning System devices and on-board diagnostic systems. The incoming data can be sampled at…

1341

Abstract

Purpose

Individuals’ driving behavior data are becoming available widely through Global Positioning System devices and on-board diagnostic systems. The incoming data can be sampled at rates ranging from one Hertz (or even lower) to hundreds of Hertz. Failing to capture substantial changes in vehicle movements over time by “undersampling” can cause loss of information and misinterpretations of the data, but “oversampling” can waste storage and processing resources. The purpose of this study is to empirically explore how micro-driving decisions to maintain speed, accelerate or decelerate, can be best captured, without substantial loss of information.

Design/methodology/approach

This study creates a set of indicators to quantify the magnitude of information loss (MIL). Each indicator is calculated as a percentage to index the extent of information loss (EIL) in different situations. An overall information loss index named EIL is created to combine the MIL indicators. Data from a driving simulator study collected at 20 Hertz are analyzed (N = 718,481 data points from 35,924 s of driving tests). The study quantifies the relationship between information loss indicators and sampling rates.

Findings

The results show that marginally more information is lost as data are sampled down from 20 to 0.5 Hz, but the relationship is not linear. With four indicators of MILs, the overall EIL is 3.85 per cent for 1-Hz sampling rate driving behavior data. If sampling rates are higher than 2 Hz, all MILs are under 5 per cent for importation loss.

Originality/value

This study contributes by developing a framework for quantifying the relationship between sampling rates, and information loss and depending on the objective of their study, researchers can choose the appropriate sampling rate necessary to get the right amount of accuracy.

Details

Journal of Intelligent and Connected Vehicles, vol. 3 no. 1
Type: Research Article
ISSN: 2399-9802

Keywords

Article
Publication date: 29 November 2021

Ziming Zeng, Tingting Li, Shouqiang Sun, Jingjing Sun and Jie Yin

Twitter fake accounts refer to bot accounts created by third-party organizations to influence public opinion, commercial propaganda or impersonate others. The effective…

Abstract

Purpose

Twitter fake accounts refer to bot accounts created by third-party organizations to influence public opinion, commercial propaganda or impersonate others. The effective identification of bot accounts is conducive to accurately judge the disseminated information for the public. However, in actual fake account identification, it is expensive and inefficient to manually label Twitter accounts, and the labeled data are usually unbalanced in classes. To this end, the authors propose a novel framework to solve these problems.

Design/methodology/approach

In the proposed framework, the authors introduce the concept of semi-supervised self-training learning and apply it to the real Twitter account data set from Kaggle. Specifically, the authors first train the classifier in the initial small amount of labeled account data, then use the trained classifier to automatically label large-scale unlabeled account data. Next, iteratively select high confidence instances from unlabeled data to expand the labeled data. Finally, an expanded Twitter account training set is obtained. It is worth mentioning that the resampling technique is integrated into the self-training process, and the data class is balanced at the initial stage of the self-training iteration.

Findings

The proposed framework effectively improves labeling efficiency and reduces the influence of class imbalance. It shows excellent identification results on 6 different base classifiers, especially for the initial small-scale labeled Twitter accounts.

Originality/value

This paper provides novel insights in identifying Twitter fake accounts. First, the authors take the lead in introducing a self-training method to automatically label Twitter accounts from the semi-supervised background. Second, the resampling technique is integrated into the self-training process to effectively reduce the influence of class imbalance on the identification effect.

Details

Data Technologies and Applications, vol. 56 no. 3
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 22 July 2022

Thanh-Nghi Do

This paper aims to propose the new incremental and parallel training algorithm of proximal support vector machines (Inc-Par-PSVM) tailored on the edge device (i.e. the Jetson…

Abstract

Purpose

This paper aims to propose the new incremental and parallel training algorithm of proximal support vector machines (Inc-Par-PSVM) tailored on the edge device (i.e. the Jetson Nano) to handle the large-scale ImageNet challenging problem.

Design/methodology/approach

The Inc-Par-PSVM trains in the incremental and parallel manner ensemble binary PSVM classifiers used for the One-Versus-All multiclass strategy on the Jetson Nano. The binary PSVM model is the average in bagged binary PSVM models built in undersampling training data block.

Findings

The empirical test results on the ImageNet data set show that the Inc-Par-PSVM algorithm with the Jetson Nano (Quad-core ARM A57 @ 1.43 GHz, 128-core NVIDIA Maxwell architecture-based graphics processing unit, 4 GB RAM) is faster and more accurate than the state-of-the-art linear SVM algorithm run on a PC [Intel(R) Core i7-4790 CPU, 3.6 GHz, 4 cores, 32 GB RAM].

Originality/value

The new incremental and parallel PSVM algorithm tailored on the Jetson Nano is able to efficiently handle the large-scale ImageNet challenge with 1.2 million images and 1,000 classes.

Details

International Journal of Web Information Systems, vol. 18 no. 2/3
Type: Research Article
ISSN: 1744-0084

Keywords

Open Access
Article
Publication date: 5 March 2019

Sharifah Heryati Syed Nor, Shafinar Ismail and Bee Wah Yap

Personal bankruptcy is on the rise in Malaysia. The Insolvency Department of Malaysia reported that personal bankruptcy has increased since 2007, and the total accumulated…

4345

Abstract

Purpose

Personal bankruptcy is on the rise in Malaysia. The Insolvency Department of Malaysia reported that personal bankruptcy has increased since 2007, and the total accumulated personal bankruptcy cases stood at 131,282 in 2014. This is indeed an alarming issue because the increasing number of personal bankruptcy cases will have a negative impact on the Malaysian economy, as well as on the society. From the aspect of individual’s personal economy, bankruptcy minimizes their chances of securing a job. Apart from that, their account will be frozen, lost control on their assets and properties and not allowed to start any business nor be a part of any company’s management. Bankrupts also will be denied from any loan application, restricted from travelling overseas and cannot act as a guarantor. This paper aims to investigate this problem by developing the personal bankruptcy prediction model using the decision tree technique.

Design/methodology/approach

In this paper, bankrupt is defined as terminated members who failed to settle their loans. The sample comprised of 24,546 cases with 17 per cent settled cases and 83 per cent terminated cases. The data included a dependent variable, i.e. bankruptcy status (Y = 1(bankrupt), Y = 0 (non-bankrupt)) and 12 predictors. SAS Enterprise Miner 14.1 software was used to develop the decision tree model.

Findings

Upon completion, this study succeeds to come out with the profiles of bankrupts, reliable personal bankruptcy scoring model and significant variables of personal bankruptcy.

Practical implications

This decision tree model is possible for patent and income generation. Financial institutions are able to use this model for potential borrowers to predict their tendency toward personal bankruptcy.

Social implications

Create awareness to society on significant variables of personal bankruptcy so that they can avoid being a bankrupt.

Originality/value

This decision tree model is able to facilitate and assist financial institutions in evaluating and assessing their potential borrower. It helps to identify potential defaulting borrowers. It also can assist financial institutions in implementing the right strategies to avoid defaulting borrowers.

Details

Journal of Economics, Finance and Administrative Science, vol. 24 no. 47
Type: Research Article
ISSN: 2077-1886

Keywords

Article
Publication date: 17 July 2009

Emmanuel Blanchard, Adrian Sandu and Corina Sandu

The purpose of this paper is to propose a new computational approach for parameter estimation in the Bayesian framework. A posteriori probability density functions are obtained…

Abstract

Purpose

The purpose of this paper is to propose a new computational approach for parameter estimation in the Bayesian framework. A posteriori probability density functions are obtained using the polynomial chaos theory for propagating uncertainties through system dynamics. The new method has the advantage of being able to deal with large parametric uncertainties, non‐Gaussian probability densities and nonlinear dynamics.

Design/methodology/approach

The maximum likelihood estimates are obtained by minimizing a cost function derived from the Bayesian theorem. Direct stochastic collocation is used as a less computationally expensive alternative to the traditional Galerkin approach to propagate the uncertainties through the system in the polynomial chaos framework.

Findings

The new approach is explained and is applied to very simple mechanical systems in order to illustrate how the Bayesian cost function can be affected by the noise level in the measurements, by undersampling, non‐identifiablily of the system, non‐observability and by excitation signals that are not rich enough. When the system is non‐identifiable and an a priori knowledge of the parameter uncertainties is available, regularization techniques can still yield most likely values among the possible combinations of uncertain parameters resulting in the same time responses than the ones observed.

Originality/value

The polynomial chaos method has been shown to be considerably more efficient than Monte Carlo in the simulation of systems with a small number of uncertain parameters. This is believed to be the first time the polynomial chaos theory has been applied to Bayesian estimation.

Details

Engineering Computations, vol. 26 no. 5
Type: Research Article
ISSN: 0264-4401

Keywords

Article
Publication date: 1 February 2022

Yaotan Xie and Fei Xiang

This study aimed to adapt existing text-mining techniques and propose a novel topic recognition approach for textual patient reviews.

Abstract

Purpose

This study aimed to adapt existing text-mining techniques and propose a novel topic recognition approach for textual patient reviews.

Design/methodology/approach

The authors first transformed multilabel samples for adapting model training forms. Then, an improved method was proposed based on dynamic mixed sampling and transfer learning to improve the learning problem caused by imbalanced samples. Specifically, the training of our model was based on the framework of a convolutional neural network and self-trained Word2Vector on large-scale corpora.

Findings

Compared with the SVM and other CNN-based models, the CNN+ DMS + TL model proposed in this study has made significant improvement in F1 score.

Originality/value

The improved methods based on dynamic mixed sampling and transfer learning can adequately manage the learning problem caused by the skewed distribution of samples and achieve the effective and automatic topic recognition of textual patient reviews.

Peer review

The peer-review history for this article is available at: https://publons.com/publon/10.1108/OIR-01-2021-0059.

Details

Online Information Review, vol. 46 no. 6
Type: Research Article
ISSN: 1468-4527

Keywords

Open Access
Article
Publication date: 11 October 2023

Bachriah Fatwa Dhini, Abba Suganda Girsang, Unggul Utan Sufandi and Heny Kurniawati

The authors constructed an automatic essay scoring (AES) model in a discussion forum where the result was compared with scores given by human evaluators. This research proposes…

Abstract

Purpose

The authors constructed an automatic essay scoring (AES) model in a discussion forum where the result was compared with scores given by human evaluators. This research proposes essay scoring, which is conducted through two parameters, semantic and keyword similarities, using a SentenceTransformers pre-trained model that can construct the highest vector embedding. Combining these models is used to optimize the model with increasing accuracy.

Design/methodology/approach

The development of the model in the study is divided into seven stages: (1) data collection, (2) pre-processing data, (3) selected pre-trained SentenceTransformers model, (4) semantic similarity (sentence pair), (5) keyword similarity, (6) calculate final score and (7) evaluating model.

Findings

The multilingual paraphrase-multilingual-MiniLM-L12-v2 and distilbert-base-multilingual-cased-v1 models got the highest scores from comparisons of 11 pre-trained multilingual models of SentenceTransformers with Indonesian data (Dhini and Girsang, 2023). Both multilingual models were adopted in this study. A combination of two parameters is obtained by comparing the response of the keyword extraction responses with the rubric keywords. Based on the experimental results, proposing a combination can increase the evaluation results by 0.2.

Originality/value

This study uses discussion forum data from the general biology course in online learning at the open university for the 2020.2 and 2021.2 semesters. Forum discussion ratings are still manual. In this survey, the authors created a model that automatically calculates the value of discussion forums, which are essays based on the lecturer's answers moreover rubrics.

Details

Asian Association of Open Universities Journal, vol. 18 no. 3
Type: Research Article
ISSN: 1858-3431

Keywords

Article
Publication date: 23 August 2018

Murtaza Nasir, Carole South-Winter, Srini Ragothaman and Ali Dag

The purpose of this paper is to formulate a framework to construct a patient-specific risk score and therefore to classify these patients into various risk groups that can be used…

Abstract

Purpose

The purpose of this paper is to formulate a framework to construct a patient-specific risk score and therefore to classify these patients into various risk groups that can be used as a decision support mechanism by the medical decision makers to augment their decision-making process, allowing them to optimally use the limited resources available.

Design/methodology/approach

A conventional statistical model (logistic regression) and two machine learning-based (i.e. artificial neural networks (ANNs) and support vector machines) data mining models were employed by also using five-fold cross-validation in the classification phase. In order to overcome the data imbalance problem, random undersampling technique was utilized. After constructing the patient-specific risk score, k-means clustering algorithm was employed to group these patients into risk groups.

Findings

Results showed that the ANN model achieved the best results with an area under the curve score of 0.867, while the sensitivity and specificity were 0.715 and 0.892, respectively. Also, the construction of patient-specific risk scores offer useful insights to the medical experts, by helping them find a trade-off between risks, costs and resources.

Originality/value

The study contributes to the existing body of knowledge by constructing a framework that can be utilized to determine the risk level of the targeted patient, by employing data mining-based predictive approach.

Details

Industrial Management & Data Systems, vol. 119 no. 1
Type: Research Article
ISSN: 0263-5577

Keywords

Article
Publication date: 24 May 2019

Oscar Stålnacke

The purpose of this paper is to investigate the relationship between individual investors’ level of sophistication and their expectations of risk and return in the stock market.

Abstract

Purpose

The purpose of this paper is to investigate the relationship between individual investors’ level of sophistication and their expectations of risk and return in the stock market.

Design/methodology/approach

The author combines survey and registry data on individual investors in Sweden to obtain 11 sophistication proxies that previous research has related to individuals’ financial decisions. These proxies are related to a survey measure regarding individual investors’ expectations of risk and return in an index fund using linear regressions.

Findings

The findings in this paper indicate that sophisticated investors have lower risk and higher return expectations that are closer to objective measures than those of less-sophisticated investors.

Originality/value

These results are important, since they enhance the understanding of the underlying mechanisms through which sophistication can influence financial decisions.

Details

Review of Behavioral Finance, vol. 11 no. 1
Type: Research Article
ISSN: 1940-5979

Keywords

1 – 10 of 88