Search results

1 – 10 of over 21000
Book part
Publication date: 29 September 2023

Torben Juul Andersen

This chapter outlines how the comprehensive North American and European datasets were collected and explains the ensuing data cleaning process outlining three alternative methods…

Abstract

This chapter outlines how the comprehensive North American and European datasets were collected and explains the ensuing data cleaning process outlining three alternative methods applied to deal with missing values, the complete case, the multiple imputation (MI), and the K-nearest neighbor (KNN) methods. The complete case method is the conventional approach adopted in many mainstream management studies. We further discuss the implied assumption underlying use of this technique, which is rarely assessed, or tested in practice and explain the alternative imputation approaches in detail. Use of North American data is the norm but we also collected a European dataset, which is rarely done to enable subsequent comparative analysis between these geographical regions. We introduce the structure of firms organized within different industry classification schemes for use in the ensuing comparative analyses and provide base information on missing values in the original and cleaned datasets. The calculated performance indicators derived from the sampled data are defined and presented. We show how the three alternative approaches considered to deal with missing values have significantly different effects on the calculated performance measures in terms of extreme estimate ranges and mean performance values.

Details

A Study of Risky Business Outcomes: Adapting to Strategic Disruption
Type: Book
ISBN: 978-1-83797-074-2

Keywords

Book part
Publication date: 12 July 2023

Sahan Savas Karatasli

This paper discusses data-collection strategies that use digitized historical newspaper archives to study social conflicts and social movements from a global and historical…

Abstract

This paper discusses data-collection strategies that use digitized historical newspaper archives to study social conflicts and social movements from a global and historical perspective focusing on nationalist movements. I present an analysis of State-Seeking Nationalist Movements (SSNMs) dataset I, which includes news articles reporting on state-seeking activities throughout the world from 1804 to 2013 using the New York Times and the Guardian/Observer. In discussing this new source of data and its relative value, I explain the various benefits and challenges involved with using digitized historical newspaper archives for world-historical analysis of social movements. I also introduce strategies that can be used to detect and minimize some potential sources of bias. I demonstrate the utility of the strategies introduced in this paper by assessing the reliability of the SSNM dataset I and by comparing it to alternative datasets. The analysis presented in the paper also compares the labor-intensive manual data-coding strategies to automated approaches. In doing so, it explains why labor-intensive manual coding strategies will continue to be an invaluable tool for world-historical sociologists in a world of big data.

Details

Methodological Advances in Research on Social Movements, Conflict, and Change
Type: Book
ISBN: 978-1-80117-887-7

Keywords

Book part
Publication date: 26 October 2017

Son Nguyen, John Quinn and Alan Olinsky

We propose an oversampling technique to increase the true positive rate (sensitivity) in classifying imbalanced datasets (i.e., those with a value for the target variable that…

Abstract

We propose an oversampling technique to increase the true positive rate (sensitivity) in classifying imbalanced datasets (i.e., those with a value for the target variable that occurs with a small frequency) and hence boost the overall performance measurements such as balanced accuracy, G-mean and area under the receiver operating characteristic (ROC) curve, AUC. This oversampling method is based on the idea of applying the Synthetic Minority Oversampling Technique (SMOTE) on only a selective portion of the dataset instead of the entire dataset. We demonstrate the effectiveness of our oversampling method with four real and simulated datasets generated from three models.

Details

Advances in Business and Management Forecasting
Type: Book
ISBN: 978-1-78743-069-3

Keywords

Book part
Publication date: 29 September 2023

Torben Juul Andersen

This chapter first analyzes how the data-cleaning process affects the share of missing values in the extracted European and North American datasets. It then moves on to examine…

Abstract

This chapter first analyzes how the data-cleaning process affects the share of missing values in the extracted European and North American datasets. It then moves on to examine how three different approaches to treat the issue of missing values, Complete Case, Multiple Imputation Chained Equations (MICE), and K-Nearest Neighbor (KNN) imputations affect the number of firms and their average lifespan in the datasets compared to the original sample and assessed across different SIC industry divisions. This is extended to consider implied effects on the distribution of a key performance indicator, return on assets (ROA), calculating skewness and kurtosis measures for each of the treatment methods and across industry contexts. This consistently shows highly negatively skewed distributions with high positive excess kurtosis across all the industries where the KNN imputation treatment creates results with distribution characteristics that are closest to the original untreated data. We further analyze the persistency of the (extreme) left-skewed tails measured in terms of the share of outliers and extreme outliers, which shows consistent and rather high percentages of outliers around 15% of the full sample and extreme outliers around 7.5% indicating pervasive skewness in the data. Of the three alternative approaches to deal with missing values, the KNN imputation treatment is found to be the method that generates final datasets that most closely resemble the original data even though the Complete Case approach remains the norm in mainstream studies. One consequence of this is that most empirical studies are likely to underestimate the prevalence of extreme negative performance outcomes.

Details

A Study of Risky Business Outcomes: Adapting to Strategic Disruption
Type: Book
ISBN: 978-1-83797-074-2

Keywords

Article
Publication date: 25 January 2024

Besiki Stvilia and Dong Joon Lee

This study addresses the need for a theory-guided, rich, descriptive account of research data repositories' (RDRs) understanding of data quality and the structures of their data…

Abstract

Purpose

This study addresses the need for a theory-guided, rich, descriptive account of research data repositories' (RDRs) understanding of data quality and the structures of their data quality assurance (DQA) activities. Its findings can help develop operational DQA models and best practice guides and identify opportunities for innovation in the DQA activities.

Design/methodology/approach

The study analyzed 122 data repositories' applications for the Core Trustworthy Data Repositories, interview transcripts of 32 curators and repository managers and data curation-related webpages of their repository websites. The combined dataset represented 146 unique RDRs. The study was guided by a theoretical framework comprising activity theory and an information quality evaluation framework.

Findings

The study provided a theory-based examination of the DQA practices of RDRs summarized as a conceptual model. The authors identified three DQA activities: evaluation, intervention and communication and their structures, including activity motivations, roles played and mediating tools and rules and standards. When defining data quality, study participants went beyond the traditional definition of data quality and referenced seven facets of ethical and effective information systems in addition to data quality. Furthermore, the participants and RDRs referenced 13 dimensions in their DQA models. The study revealed that DQA activities were prioritized by data value, level of quality, available expertise, cost and funding incentives.

Practical implications

The study's findings can inform the design and construction of digital research data curation infrastructure components on university campuses that aim to provide access not just to big data but trustworthy data. Communities of practice focused on repositories and archives could consider adding FAIR operationalizations, extensions and metrics focused on data quality. The availability of such metrics and associated measurements can help reusers determine whether they can trust and reuse a particular dataset. The findings of this study can help to develop such data quality assessment metrics and intervention strategies in a sound and systematic way.

Originality/value

To the best of the authors' knowledge, this paper is the first data quality theory guided examination of DQA practices in RDRs.

Details

Journal of Documentation, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 28 September 2023

Moh. Riskiyadi

This study aims to compare machine learning models, datasets and splitting training-testing using data mining methods to detect financial statement fraud.

3585

Abstract

Purpose

This study aims to compare machine learning models, datasets and splitting training-testing using data mining methods to detect financial statement fraud.

Design/methodology/approach

This study uses a quantitative approach from secondary data on the financial reports of companies listed on the Indonesia Stock Exchange in the last ten years, from 2010 to 2019. Research variables use financial and non-financial variables. Indicators of financial statement fraud are determined based on notes or sanctions from regulators and financial statement restatements with special supervision.

Findings

The findings show that the Extremely Randomized Trees (ERT) model performs better than other machine learning models. The best original-sampling dataset compared to other dataset treatments. Training testing splitting 80:10 is the best compared to other training-testing splitting treatments. So the ERT model with an original-sampling dataset and 80:10 training-testing splitting are the most appropriate for detecting future financial statement fraud.

Practical implications

This study can be used by regulators, investors, stakeholders and financial crime experts to add insight into better methods of detecting financial statement fraud.

Originality/value

This study proposes a machine learning model that has not been discussed in previous studies and performs comparisons to obtain the best financial statement fraud detection results. Practitioners and academics can use findings for further research development.

Details

Asian Review of Accounting, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 1321-7348

Keywords

Open Access
Article
Publication date: 18 July 2022

Youakim Badr

In this research, the authors demonstrate the advantage of reinforcement learning (RL) based intrusion detection systems (IDS) to solve very complex problems (e.g. selecting input…

1276

Abstract

Purpose

In this research, the authors demonstrate the advantage of reinforcement learning (RL) based intrusion detection systems (IDS) to solve very complex problems (e.g. selecting input features, considering scarce resources and constrains) that cannot be solved by classical machine learning. The authors include a comparative study to build intrusion detection based on statistical machine learning and representational learning, using knowledge discovery in databases (KDD) Cup99 and Installation Support Center of Expertise (ISCX) 2012.

Design/methodology/approach

The methodology applies a data analytics approach, consisting of data exploration and machine learning model training and evaluation. To build a network-based intrusion detection system, the authors apply dueling double deep Q-networks architecture enabled with costly features, k-nearest neighbors (K-NN), support-vector machines (SVM) and convolution neural networks (CNN).

Findings

Machine learning-based intrusion detection are trained on historical datasets which lead to model drift and lack of generalization whereas RL is trained with data collected through interactions. RL is bound to learn from its interactions with a stochastic environment in the absence of a training dataset whereas supervised learning simply learns from collected data and require less computational resources.

Research limitations/implications

All machine learning models have achieved high accuracy values and performance. One potential reason is that both datasets are simulated, and not realistic. It was not clear whether a validation was ever performed to show that data were collected from real network traffics.

Practical implications

The study provides guidelines to implement IDS with classical supervised learning, deep learning and RL.

Originality/value

The research applied the dueling double deep Q-networks architecture enabled with costly features to build network-based intrusion detection from network traffics. This research presents a comparative study of reinforcement-based instruction detection with counterparts built with statistical and representational machine learning.

Article
Publication date: 8 August 2022

Ean Zou Teoh, Wei-Chuen Yau, Thian Song Ong and Tee Connie

This study aims to develop a regression-based machine learning model to predict housing price, determine and interpret factors that contribute to housing prices using different…

524

Abstract

Purpose

This study aims to develop a regression-based machine learning model to predict housing price, determine and interpret factors that contribute to housing prices using different data sets available publicly. The significant determinants that affect housing prices will be first identified by using multinomial logistics regression (MLR) based on the level of relative importance. A comprehensive study is then conducted by using SHapley Additive exPlanations (SHAP) analysis to examine the features that cause the major changes in housing prices.

Design/methodology/approach

Predictive analytics is an effective way to deal with uncertainties in process modelling and improve decision-making for housing price prediction. The focus of this paper is two-fold; the authors first apply regression analysis to investigate how well the housing independent variables contribute to the housing price prediction. Two data sets are used for this study, namely, Ames Housing dataset and Melbourne Housing dataset. For both the data sets, random forest regression performs the best by achieving an average R2 of 86% for the Ames dataset and 85% for the Melbourne dataset, respectively. Second, multinomial logistic regression is adopted to investigate and identify the factor determinants of housing sales price. For the Ames dataset, the authors find that the top three most significant factor variables to determine the housing price is the general living area, basement size and age of remodelling. As for the Melbourne dataset, properties having more rooms/bathrooms, larger land size and closer distance to central business district (CBD) are higher priced. This is followed by a comprehensive analysis on how these determinants contribute to the predictability of the selected regression model by using explainable SHAP values. These prominent factors can be used to determine the optimal price range of a property which are useful for decision-making for both buyers and sellers.

Findings

By using the combination of MLR and SHAP analysis, it is noticeable that general living area, basement size and age of remodelling are the top three most important variables in determining the house’s price in the Ames dataset, while properties with more rooms/bathrooms, larger land area and closer proximity to the CBD or to the South of Melbourne are more expensive in the Melbourne dataset. These important factors can be used to estimate the best price range for a housing property for better decision-making.

Research limitations/implications

A limitation of this study is that the distribution of the housing prices is highly skewed. Although it is normal that the properties’ price is normally cluttered at the lower side and only a few houses are highly price. As mentioned before, MLR can effectively help in evaluating the likelihood ratio of each variable towards these categories. However, housing price is originally continuous, and there is a need to convert the price to categorical type. Nonetheless, the most effective method to categorize the data is still questionable.

Originality/value

The key point of this paper is the use of explainable machine learning approach to identify the prominent factors of housing price determination, which could be used to determine the optimal price range of a property which are useful for decision-making for both the buyers and sellers.

Details

International Journal of Housing Markets and Analysis, vol. 16 no. 5
Type: Research Article
ISSN: 1753-8270

Keywords

Open Access
Article
Publication date: 18 April 2023

Worapan Kusakunniran, Pairash Saiviroonporn, Thanongchai Siriapisith, Trongtum Tongdee, Amphai Uraiverotchanakorn, Suphawan Leesakul, Penpitcha Thongnarintr, Apichaya Kuama and Pakorn Yodprom

The cardiomegaly can be determined by the cardiothoracic ratio (CTR) which can be measured in a chest x-ray image. It is calculated based on a relationship between a size of heart…

2660

Abstract

Purpose

The cardiomegaly can be determined by the cardiothoracic ratio (CTR) which can be measured in a chest x-ray image. It is calculated based on a relationship between a size of heart and a transverse dimension of chest. The cardiomegaly is identified when the ratio is larger than a cut-off threshold. This paper aims to propose a solution to calculate the ratio for classifying the cardiomegaly in chest x-ray images.

Design/methodology/approach

The proposed method begins with constructing lung and heart segmentation models based on U-Net architecture using the publicly available datasets with the groundtruth of heart and lung masks. The ratio is then calculated using the sizes of segmented lung and heart areas. In addition, Progressive Growing of GANs (PGAN) is adopted here for constructing the new dataset containing chest x-ray images of three classes including male normal, female normal and cardiomegaly classes. This dataset is then used for evaluating the proposed solution. Also, the proposed solution is used to evaluate the quality of chest x-ray images generated from PGAN.

Findings

In the experiments, the trained models are applied to segment regions of heart and lung in chest x-ray images on the self-collected dataset. The calculated CTR values are compared with the values that are manually measured by human experts. The average error is 3.08%. Then, the models are also applied to segment regions of heart and lung for the CTR calculation, on the dataset computed by PGAN. Then, the cardiomegaly is determined using various attempts of different cut-off threshold values. With the standard cut-off at 0.50, the proposed method achieves 94.61% accuracy, 88.31% sensitivity and 94.20% specificity.

Originality/value

The proposed solution is demonstrated to be robust across unseen datasets for the segmentation, CTR calculation and cardiomegaly classification, including the dataset generated from PGAN. The cut-off value can be adjusted to be lower than 0.50 for increasing the sensitivity. For example, the sensitivity of 97.04% can be achieved at the cut-off of 0.45. However, the specificity is decreased from 94.20% to 79.78%.

Details

Applied Computing and Informatics, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2634-1964

Keywords

Article
Publication date: 27 October 2020

Lokesh Singh, Rekh Ram Janghel and Satya Prakash Sahu

The study aims to cope with the problems confronted in the skin lesion datasets with less training data toward the classification of melanoma. The vital, challenging issue is the…

Abstract

Purpose

The study aims to cope with the problems confronted in the skin lesion datasets with less training data toward the classification of melanoma. The vital, challenging issue is the insufficiency of training data that occurred while classifying the lesions as melanoma and non-melanoma.

Design/methodology/approach

In this work, a transfer learning (TL) framework Transfer Constituent Support Vector Machine (TrCSVM) is designed for melanoma classification based on feature-based domain adaptation (FBDA) leveraging the support vector machine (SVM) and Transfer AdaBoost (TrAdaBoost). The working of the framework is twofold: at first, SVM is utilized for domain adaptation for learning much transferrable representation between source and target domain. In the first phase, for homogeneous domain adaptation, it augments features by transforming the data from source and target (different but related) domains in a shared-subspace. In the second phase, for heterogeneous domain adaptation, it leverages knowledge by augmenting features from source to target (different and not related) domains to a shared-subspace. Second, TrAdaBoost is utilized to adjust the weights of wrongly classified data in the newly generated source and target datasets.

Findings

The experimental results empirically prove the superiority of TrCSVM than the state-of-the-art TL methods on less-sized datasets with an accuracy of 98.82%.

Originality/value

Experiments are conducted on six skin lesion datasets and performance is compared based on accuracy, precision, sensitivity, and specificity. The effectiveness of TrCSVM is evaluated on ten other datasets towards testing its generalizing behavior. Its performance is also compared with two existing TL frameworks (TrResampling, TrAdaBoost) for the classification of melanoma.

Details

Data Technologies and Applications, vol. 55 no. 1
Type: Research Article
ISSN: 2514-9288

Keywords

1 – 10 of over 21000