Search results
1 – 10 of over 84000This chapter outlines how the comprehensive North American and European datasets were collected and explains the ensuing data cleaning process outlining three alternative methods…
Abstract
This chapter outlines how the comprehensive North American and European datasets were collected and explains the ensuing data cleaning process outlining three alternative methods applied to deal with missing values, the complete case, the multiple imputation (MI), and the K-nearest neighbor (KNN) methods. The complete case method is the conventional approach adopted in many mainstream management studies. We further discuss the implied assumption underlying use of this technique, which is rarely assessed, or tested in practice and explain the alternative imputation approaches in detail. Use of North American data is the norm but we also collected a European dataset, which is rarely done to enable subsequent comparative analysis between these geographical regions. We introduce the structure of firms organized within different industry classification schemes for use in the ensuing comparative analyses and provide base information on missing values in the original and cleaned datasets. The calculated performance indicators derived from the sampled data are defined and presented. We show how the three alternative approaches considered to deal with missing values have significantly different effects on the calculated performance measures in terms of extreme estimate ranges and mean performance values.
Details
Keywords
Kedong Yin, Yun Cao, Shiwei Zhou and Xinman Lv
The purposes of this research are to study the theory and method of multi-attribute index system design and establish a set of systematic, standardized, scientific index systems…
Abstract
Purpose
The purposes of this research are to study the theory and method of multi-attribute index system design and establish a set of systematic, standardized, scientific index systems for the design optimization and inspection process. The research may form the basis for a rational, comprehensive evaluation and provide the most effective way of improving the quality of management decision-making. It is of practical significance to improve the rationality and reliability of the index system and provide standardized, scientific reference standards and theoretical guidance for the design and construction of the index system.
Design/methodology/approach
Using modern methods such as complex networks and machine learning, a system for the quality diagnosis of index data and the classification and stratification of index systems is designed. This guarantees the quality of the index data, realizes the scientific classification and stratification of the index system, reduces the subjectivity and randomness of the design of the index system, enhances its objectivity and rationality and lays a solid foundation for the optimal design of the index system.
Findings
Based on the ideas of statistics, system theory, machine learning and data mining, the focus in the present research is on “data quality diagnosis” and “index classification and stratification” and clarifying the classification standards and data quality characteristics of index data; a data-quality diagnosis system of “data review – data cleaning – data conversion – data inspection” is established. Using a decision tree, explanatory structural model, cluster analysis, K-means clustering and other methods, classification and hierarchical method system of indicators is designed to reduce the redundancy of indicator data and improve the quality of the data used. Finally, the scientific and standardized classification and hierarchical design of the index system can be realized.
Originality/value
The innovative contributions and research value of the paper are reflected in three aspects. First, a method system for index data quality diagnosis is designed, and multi-source data fusion technology is adopted to ensure the quality of multi-source, heterogeneous and mixed-frequency data of the index system. The second is to design a systematic quality-inspection process for missing data based on the systematic thinking of the whole and the individual. Aiming at the accuracy, reliability, and feasibility of the patched data, a quality-inspection method of patched data based on inversion thought and a unified representation method of data fusion based on a tensor model are proposed. The third is to use the modern method of unsupervised learning to classify and stratify the index system, which reduces the subjectivity and randomness of the design of the index system and enhances its objectivity and rationality.
Details
Keywords
The purpose of this paper is to introduce a new multiple imputation method that can effectively manage missing values in online review data, thereby allowing the online review…
Abstract
Purpose
The purpose of this paper is to introduce a new multiple imputation method that can effectively manage missing values in online review data, thereby allowing the online review analysis to yield valid results by using all available data.
Design/methodology/approach
This study develops a missing data method based on the multivariate imputation chained equation to generate imputed values for online reviews. Sentiment analysis is used to incorporate customers’ textual opinions as the auxiliary information in the imputation procedures. To check the validity of the proposed imputation method, the authors apply this method to missing values of sub-ratings on hotel attributes in both the simulated and real Honolulu hotel review data sets. The estimation results are compared to those of different missing data techniques, namely, listwise deletion and conventional multiple imputation which does not consider text reviews.
Findings
The findings from the simulation analysis show that the imputation method of the authors produces more efficient and less biased estimates compared to the other two missing data techniques when text reviews are possibly associated with the rating scores and response mechanism. When applying the imputation method to the real hotel review data, the findings show that the text sentiment-based propensity score can effectively explain the missingness of sub-ratings on hotel attributes, and the imputation method considering those propensity scores has better estimation results than the other techniques as in the simulation analysis.
Originality/value
This study extends multiple imputation to online data considering its spontaneous and unstructured nature. This new method helps make the fuller use of the observed online data while avoiding potential missing problems.
Details
Keywords
Zhenyuan Wang, Chih-Fong Tsai and Wei-Chao Lin
Class imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques…
Abstract
Purpose
Class imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques, which aim to identify anomalies as the minority class from the normal data as the majority class, are one representative solution for class imbalanced datasets. Since one-class classifiers are trained using only normal data to create a decision boundary for later anomaly detection, the quality of the training set, i.e. the majority class, is one key factor that affects the performance of one-class classifiers.
Design/methodology/approach
In this paper, we focus on two data cleaning or preprocessing methods to address class imbalanced datasets. The first method examines whether performing instance selection to remove some noisy data from the majority class can improve the performance of one-class classifiers. The second method combines instance selection and missing value imputation, where the latter is used to handle incomplete datasets that contain missing values.
Findings
The experimental results are based on 44 class imbalanced datasets; three instance selection algorithms, including IB3, DROP3 and the GA, the CART decision tree for missing value imputation, and three one-class classifiers, which include OCSVM, IFOREST and LOF, show that if the instance selection algorithm is carefully chosen, performing this step could improve the quality of the training data, which makes one-class classifiers outperform the baselines without instance selection. Moreover, when class imbalanced datasets contain some missing values, combining missing value imputation and instance selection, regardless of which step is first performed, can maintain similar data quality as datasets without missing values.
Originality/value
The novelty of this paper is to investigate the effect of performing instance selection on the performance of one-class classifiers, which has never been done before. Moreover, this study is the first attempt to consider the scenario of missing values that exist in the training set for training one-class classifiers. In this case, performing missing value imputation and instance selection with different orders are compared.
Details
Keywords
Craig Enders, Samantha Dietz, Marjorie Montague and Jennifer Dixon
Missing data are a pervasive problem in special education research. The purpose of this chapter is to provide researchers with an overview of two “modern” alternatives for…
Abstract
Missing data are a pervasive problem in special education research. The purpose of this chapter is to provide researchers with an overview of two “modern” alternatives for handling missing data, full information maximum likelihood (FIML) and multiple imputation (MI). These techniques are currently considered to be the methodological “state of the art”, and generally provide more accurate parameter estimates than the traditional methods that are still common in published educational studies. The chapter begins with an overview of missing data theory, and provides brief descriptions of some traditional missing data techniques and their requisite assumptions. Detailed descriptions of FIML and MI are given, and the chapter concludes with an analytic example from a longitudinal study of depression.
Marvin L. Brown and John F. Kros
The actual data mining process deals significantly with prediction, estimation, classification, pattern recognition and the development of association rules. Therefore, the…
Abstract
The actual data mining process deals significantly with prediction, estimation, classification, pattern recognition and the development of association rules. Therefore, the significance of the analysis depends heavily on the accuracy of the database and on the chosen sample data to be used for model training and testing. Data mining is based upon searching the concatenation of multiple databases that usually contain some amount of missing data along with a variable percentage of inaccurate data, pollution, outliers and noise. The issue of missing data must be addressed since ignoring this problem can introduce bias into the models being evaluated and lead to inaccurate data mining conclusions. The objective of this research is to address the impact of missing data on the data mining process.
Details
Keywords
Jeremy N.V Miles and Priscillia Hunt
In applied psychology research settings, such as criminal psychology, missing data are to be expected. Missing data can cause problems with both biased estimates and lack of…
Abstract
Purpose
In applied psychology research settings, such as criminal psychology, missing data are to be expected. Missing data can cause problems with both biased estimates and lack of statistical power. The paper aims to discuss these issues.
Design/methodology/approach
Recently, sophisticated methods for appropriately dealing with missing data, so as to minimize bias and to maximize power have been developed. In this paper the authors use an artificial data set to demonstrate the problems that can arise with missing data, and make naïve attempts to handle data sets where some data are missing.
Findings
With the artificial data set, and a data set comprising of the results of a survey investigating prices paid for recreational and medical marijuana, the authors demonstrate the use of multiple imputation and maximum likelihood estimation for obtaining appropriate estimates and standard errors when data are missing.
Originality/value
Missing data are ubiquitous in applied research. This paper demonstrates that techniques for handling missing data are accessible and should be employed by researchers.
Details
Keywords
Panagiotis Loukopoulos, George Zolkiewski, Ian Bennett, Pericles Pilidis, Fang Duan and David Mba
Centrifugal compressors are integral components in oil industry, thus effective maintenance is required. Condition-based maintenance and prognostics and health management…
Abstract
Purpose
Centrifugal compressors are integral components in oil industry, thus effective maintenance is required. Condition-based maintenance and prognostics and health management (CBM/PHM) have been gaining popularity. CBM/PHM can also be performed remotely leading to e-maintenance. Its success depends on the quality of the data used for analysis and decision making. A major issue associated with it is the missing data. Their presence may compromise the information within a set, causing bias or misleading results. Addressing this matter is crucial. The purpose of this paper is to review and compare the most widely used imputation techniques in a case study using condition monitoring measurements from an operational industrial centrifugal compressor.
Design/methodology/approach
Brief overview and comparison of most widely used imputation techniques using a complete set with artificial missing values. They were tested regarding the effects of the amount, the location within the set and the variable containing the missing values.
Findings
Univariate and multivariate imputation techniques were compared, with the latter offering the smallest error levels. They seemed unaffected by the amount or location of the missing data although they were affected by the variable containing them.
Research limitations/implications
During the analysis, it was assumed that at any time only one variable contained missing data. Further research is still required to address this point.
Originality/value
This study can serve as a guide for selecting the appropriate imputation method for missing values in centrifugal compressor condition monitoring data.
Details
Keywords
Gayaneh Kyureghian, Oral Capps and Rodolfo M. Nayga
The objective of this research is to examine, validate, and recommend techniques for handling the problem of missingness in observational data. We use a rich observational data…
Abstract
The objective of this research is to examine, validate, and recommend techniques for handling the problem of missingness in observational data. We use a rich observational data set, the Nielsen HomeScan data set, which allows us to effectively combine elements from simulated data sets: large numbers of observations, large number of data sets and variables, allowing elements of “design” that typically come with simulated data, and its observational nature. We created random 20% and 50% uniform missingness in our data sets and employed several widely used methods of single imputation, such as mean, regression, and stochastic regression imputations, and multiple imputation methods to fill in the data gaps. We compared these methods by measuring the error of predicting the missing values and the parameter estimates from the subsequent regression analysis using the imputed values. We also compared coverage or the percentages of intervals that covered the true parameter in both cases. Based on our results, the method of single regression or conditional mean imputation provided the best predictions of the missing price values with 28.34 and 28.59 mean absolute percent errors in 20% and 50% missingness settings, respectively. The imputation from conditional distribution method had the best rate of coverage. The parameter estimates based on data sets imputed by conditional mean method were consistently unbiased and had the smallest standard deviations. The multiple imputation methods had the best coverage of both the parameter estimates and predictions of the dependent variable.
Details
Keywords
Ibrahim Said Ahmad, Azuraliza Abu Bakar, Mohd Ridzwan Yaakub and Mohammad Darwich
Sequel movies are very popular; however, there are limited studies on sequel movie revenue prediction. The purpose of this paper is to propose a sentiment analysis based model for…
Abstract
Purpose
Sequel movies are very popular; however, there are limited studies on sequel movie revenue prediction. The purpose of this paper is to propose a sentiment analysis based model for sequel movie revenue prediction and to propose a missing value imputation method for the sequel revenue prediction dataset.
Design/methodology/approach
A sequel of a successful movie will most likely also be successful. Therefore, we propose a supervised learning approach in which data are created from sequel movies to predict the box-office revenue of an upcoming sequel. The algorithms used in the prediction are multiple linear regression, support vector machine and multilayer perceptron neural network.
Findings
The results show that using four sequel movies in a franchise to predict the box-office revenue of a fifth sequel achieved better prediction than using three sequels, which was also better than using two sequel movies.
Research limitations/implications
The model produced will be beneficial to movie producers and other stakeholders in the movie industry in deciding the viability of producing a movie sequel.
Originality/value
Previous studies do not give priority to sequel movies in movie revenue prediction. Additionally, a new missing value imputation method was introduced. Finally, sequel movie revenue prediction dataset was prepared.
Details