Search results

1 – 10 of over 1000
Article
Publication date: 2 April 2021

Tressy Thomas and Enayat Rajabi

The primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel…

1401

Abstract

Purpose

The primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?

Design/methodology/approach

The review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.

Findings

This study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.

Originality/value

It is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.

Details

Data Technologies and Applications, vol. 55 no. 4
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 24 August 2018

Jewoo Kim and Jongho Im

The purpose of this paper is to introduce a new multiple imputation method that can effectively manage missing values in online review data, thereby allowing the online review…

Abstract

Purpose

The purpose of this paper is to introduce a new multiple imputation method that can effectively manage missing values in online review data, thereby allowing the online review analysis to yield valid results by using all available data.

Design/methodology/approach

This study develops a missing data method based on the multivariate imputation chained equation to generate imputed values for online reviews. Sentiment analysis is used to incorporate customers’ textual opinions as the auxiliary information in the imputation procedures. To check the validity of the proposed imputation method, the authors apply this method to missing values of sub-ratings on hotel attributes in both the simulated and real Honolulu hotel review data sets. The estimation results are compared to those of different missing data techniques, namely, listwise deletion and conventional multiple imputation which does not consider text reviews.

Findings

The findings from the simulation analysis show that the imputation method of the authors produces more efficient and less biased estimates compared to the other two missing data techniques when text reviews are possibly associated with the rating scores and response mechanism. When applying the imputation method to the real hotel review data, the findings show that the text sentiment-based propensity score can effectively explain the missingness of sub-ratings on hotel attributes, and the imputation method considering those propensity scores has better estimation results than the other techniques as in the simulation analysis.

Originality/value

This study extends multiple imputation to online data considering its spontaneous and unstructured nature. This new method helps make the fuller use of the observed online data while avoiding potential missing problems.

Details

International Journal of Contemporary Hospitality Management, vol. 30 no. 11
Type: Research Article
ISSN: 0959-6119

Keywords

Article
Publication date: 27 July 2021

Sonia Goel and Meena Tushir

In real-world decision-making, high accuracy data analysis is essential in a ubiquitous environment. However, we encounter missing data while collecting user-related data…

Abstract

Purpose

In real-world decision-making, high accuracy data analysis is essential in a ubiquitous environment. However, we encounter missing data while collecting user-related data information because of various privacy concerns on account of a user. This paper aims to deal with incomplete data for fuzzy model identification, a new method of parameter estimation of a Takagi–Sugeno model in the presence of missing features.

Design/methodology/approach

In this work, authors proposed a three-fold approach for fuzzy model identification in which imputation-based linear interpolation technique is used to estimate missing features of the data, and then fuzzy c-means clustering is used for determining optimal number of rules and for the determination of parameters of membership functions of the fuzzy model. Finally, the optimization of the all antecedent and consequent parameters along with the width of the antecedent (Gaussian) membership function is done by gradient descent algorithm based on the minimization of root mean square error.

Findings

The proposed method is tested on two well-known simulation examples as well as on a real data set, and the performance is compared with some traditional methods. The result analysis and statistical analysis show that the proposed model has achieved a considerable improvement in accuracy in the presence of varying degree of data incompleteness.

Originality/value

The proposed method works well for fuzzy model identification method, a new method of parameter estimation of a Takagi–Sugeno model in the presence of missing features with varying degree of missing data as compared to some well-known methods.

Details

International Journal of Pervasive Computing and Communications, vol. 17 no. 4
Type: Research Article
ISSN: 1742-7371

Keywords

Content available
Article
Publication date: 30 October 2018

Darryl Ahner and Luke Brantley

This paper aims to address the reasons behind the varying levels of volatile conflict and peace as seen during the Arab Spring of 2011 to 2015. During this time, higher rates of…

1175

Abstract

Purpose

This paper aims to address the reasons behind the varying levels of volatile conflict and peace as seen during the Arab Spring of 2011 to 2015. During this time, higher rates of conflict transition occurred than normally observed in previous studies for certain Middle Eastern and North African countries.

Design/methodology/approach

Previous prediction models decrease in accuracy during times of volatile conflict transition. Also, proper strategies for handling the Arab Spring have been highly debated. This paper identifies which countries were affected by the Arab Spring and then applies data analysis techniques to predict a country’s tendency to suffer from high-intensity, violent conflict. A large number of open-source variables are incorporated by implementing an imputation methodology useful to conflict prediction studies in the future. The imputed variables are implemented in four model building techniques: purposeful selection of covariates, logical selection of covariates, principal component regression and representative principal component regression resulting in modeling accuracies exceeding 90 per cent.

Findings

Analysis of the models produced by the four techniques supports hypotheses which propose political opportunity and quality of life factors as causations for increased instability following the Arab Spring.

Originality/value

Of particular note is that the paper addresses the reasons behind the varying levels of volatile conflict and peace as seen during the Arab Spring of 2011 to 2015 through data analytics. This paper considers various open-source, readily available data for inclusion in multiple models of identified Arab Spring nations in addition to implementing a novel imputation methodology useful to conflict prediction studies in the future.

Details

Journal of Defense Analytics and Logistics, vol. 2 no. 2
Type: Research Article
ISSN: 2399-6439

Keywords

Content available
Article
Publication date: 24 October 2023

Jared Nystrom, Raymond R. Hill, Andrew Geyer, Joseph J. Pignatiello and Eric Chicken

Present a method to impute missing data from a chaotic time series, in this case lightning prediction data, and then use that completed dataset to create lightning prediction…

Abstract

Purpose

Present a method to impute missing data from a chaotic time series, in this case lightning prediction data, and then use that completed dataset to create lightning prediction forecasts.

Design/methodology/approach

Using the technique of spatiotemporal kriging to estimate data that is autocorrelated but in space and time. Using the estimated data in an imputation methodology completes a dataset used in lightning prediction.

Findings

The techniques provided prove robust to the chaotic nature of the data, and the resulting time series displays evidence of smoothing while also preserving the signal of interest for lightning prediction.

Research limitations/implications

The research is limited to the data collected in support of weather prediction work through the 45th Weather Squadron of the United States Air Force.

Practical implications

These methods are important due to the increasing reliance on sensor systems. These systems often provide incomplete and chaotic data, which must be used despite collection limitations. This work establishes a viable data imputation methodology.

Social implications

Improved lightning prediction, as with any improved prediction methods for natural weather events, can save lives and resources due to timely, cautious behaviors as a result of the predictions.

Originality/value

Based on the authors’ knowledge, this is a novel application of these imputation methods and the forecasting methods.

Details

Journal of Defense Analytics and Logistics, vol. 7 no. 2
Type: Research Article
ISSN: 2399-6439

Keywords

Book part
Publication date: 29 September 2023

Torben Juul Andersen

This chapter outlines how the comprehensive North American and European datasets were collected and explains the ensuing data cleaning process outlining three alternative methods…

Abstract

This chapter outlines how the comprehensive North American and European datasets were collected and explains the ensuing data cleaning process outlining three alternative methods applied to deal with missing values, the complete case, the multiple imputation (MI), and the K-nearest neighbor (KNN) methods. The complete case method is the conventional approach adopted in many mainstream management studies. We further discuss the implied assumption underlying use of this technique, which is rarely assessed, or tested in practice and explain the alternative imputation approaches in detail. Use of North American data is the norm but we also collected a European dataset, which is rarely done to enable subsequent comparative analysis between these geographical regions. We introduce the structure of firms organized within different industry classification schemes for use in the ensuing comparative analyses and provide base information on missing values in the original and cleaned datasets. The calculated performance indicators derived from the sampled data are defined and presented. We show how the three alternative approaches considered to deal with missing values have significantly different effects on the calculated performance measures in terms of extreme estimate ranges and mean performance values.

Details

A Study of Risky Business Outcomes: Adapting to Strategic Disruption
Type: Book
ISBN: 978-1-83797-074-2

Keywords

Article
Publication date: 12 October 2020

Ibrahim Said Ahmad, Azuraliza Abu Bakar, Mohd Ridzwan Yaakub and Mohammad Darwich

Sequel movies are very popular; however, there are limited studies on sequel movie revenue prediction. The purpose of this paper is to propose a sentiment analysis based model for…

Abstract

Purpose

Sequel movies are very popular; however, there are limited studies on sequel movie revenue prediction. The purpose of this paper is to propose a sentiment analysis based model for sequel movie revenue prediction and to propose a missing value imputation method for the sequel revenue prediction dataset.

Design/methodology/approach

A sequel of a successful movie will most likely also be successful. Therefore, we propose a supervised learning approach in which data are created from sequel movies to predict the box-office revenue of an upcoming sequel. The algorithms used in the prediction are multiple linear regression, support vector machine and multilayer perceptron neural network.

Findings

The results show that using four sequel movies in a franchise to predict the box-office revenue of a fifth sequel achieved better prediction than using three sequels, which was also better than using two sequel movies.

Research limitations/implications

The model produced will be beneficial to movie producers and other stakeholders in the movie industry in deciding the viability of producing a movie sequel.

Originality/value

Previous studies do not give priority to sequel movies in movie revenue prediction. Additionally, a new missing value imputation method was introduced. Finally, sequel movie revenue prediction dataset was prepared.

Details

Data Technologies and Applications, vol. 54 no. 5
Type: Research Article
ISSN: 2514-9288

Keywords

Book part
Publication date: 23 November 2011

Denis Conniffe and Donal O'Neill

A common approach to dealing with missing data is to estimate the model on the common subset of data, by necessity throwing away potentially useful data. We derive a new probit…

Abstract

A common approach to dealing with missing data is to estimate the model on the common subset of data, by necessity throwing away potentially useful data. We derive a new probit type estimator for models with missing covariate data where the dependent variable is binary. For the benchmark case of conditional multinormality we show that our estimator is efficient and provide exact formulae for its asymptotic variance. Simulation results show that our estimator outperforms popular alternatives and is robust to departures from the parametric assumptions adopted in the benchmark case. We illustrate our estimator by examining the portfolio allocation decision of Italian households.

Details

Missing Data Methods: Cross-sectional Methods and Applications
Type: Book
ISBN: 978-1-78052-525-9

Keywords

Article
Publication date: 2 September 2014

John R. Nofsinger and Abhishek Varma

The purpose of this paper is to explore some commonly held beliefs about individuals investing in over-the-counter (OTC) stocks (those traded on Over-the-counter Bulletin Board…

1708

Abstract

Purpose

The purpose of this paper is to explore some commonly held beliefs about individuals investing in over-the-counter (OTC) stocks (those traded on Over-the-counter Bulletin Board (OTCBB) and Pink Sheets), a fairly pervasive activity. The authors frame the analysis within the context of direct gambling, aspirational preferences in behavioral portfolios, and private information.

Design/methodology/approach

Contrary to popular perceptions, the modeling of the deliberate act of buying OTC stocks at a discount brokerage house finds that unlike the typical lottery buyers/gamblers, OTC investors are older, wealthier, more experienced at investing, and display greater portfolio diversification than their non-OTC investing counterparts.

Findings

Behavioral portfolio investors (Shefrin and Statman, 2000) invest their money in layers, each of which corresponds to an aspiration or goal. Consistent with sensation seeking and aspirations in behavioral portfolios, OTC investors also display higher trading activity. Penny stocks seem to have different characteristics and trading behavior than other OTC stocks priced over one dollar. Irrespective of the price of OTC stocks, the authors find little evidence of information content in OTC trades.

Originality/value

The paper provides insight into individual investor decision making by empirically exploring the demographic and portfolio characteristics of individuals trading in OTC stocks.

Details

Review of Behavioral Finance, vol. 6 no. 1
Type: Research Article
ISSN: 1940-5979

Keywords

Article
Publication date: 14 August 2017

Panagiotis Loukopoulos, George Zolkiewski, Ian Bennett, Pericles Pilidis, Fang Duan and David Mba

Centrifugal compressors are integral components in oil industry, thus effective maintenance is required. Condition-based maintenance and prognostics and health management…

356

Abstract

Purpose

Centrifugal compressors are integral components in oil industry, thus effective maintenance is required. Condition-based maintenance and prognostics and health management (CBM/PHM) have been gaining popularity. CBM/PHM can also be performed remotely leading to e-maintenance. Its success depends on the quality of the data used for analysis and decision making. A major issue associated with it is the missing data. Their presence may compromise the information within a set, causing bias or misleading results. Addressing this matter is crucial. The purpose of this paper is to review and compare the most widely used imputation techniques in a case study using condition monitoring measurements from an operational industrial centrifugal compressor.

Design/methodology/approach

Brief overview and comparison of most widely used imputation techniques using a complete set with artificial missing values. They were tested regarding the effects of the amount, the location within the set and the variable containing the missing values.

Findings

Univariate and multivariate imputation techniques were compared, with the latter offering the smallest error levels. They seemed unaffected by the amount or location of the missing data although they were affected by the variable containing them.

Research limitations/implications

During the analysis, it was assumed that at any time only one variable contained missing data. Further research is still required to address this point.

Originality/value

This study can serve as a guide for selecting the appropriate imputation method for missing values in centrifugal compressor condition monitoring data.

Details

Journal of Quality in Maintenance Engineering, vol. 23 no. 3
Type: Research Article
ISSN: 1355-2511

Keywords

1 – 10 of over 1000