Search results
1 – 10 of over 1000The purpose of this paper is to introduce a new multiple imputation method that can effectively manage missing values in online review data, thereby allowing the online review…
Abstract
Purpose
The purpose of this paper is to introduce a new multiple imputation method that can effectively manage missing values in online review data, thereby allowing the online review analysis to yield valid results by using all available data.
Design/methodology/approach
This study develops a missing data method based on the multivariate imputation chained equation to generate imputed values for online reviews. Sentiment analysis is used to incorporate customers’ textual opinions as the auxiliary information in the imputation procedures. To check the validity of the proposed imputation method, the authors apply this method to missing values of sub-ratings on hotel attributes in both the simulated and real Honolulu hotel review data sets. The estimation results are compared to those of different missing data techniques, namely, listwise deletion and conventional multiple imputation which does not consider text reviews.
Findings
The findings from the simulation analysis show that the imputation method of the authors produces more efficient and less biased estimates compared to the other two missing data techniques when text reviews are possibly associated with the rating scores and response mechanism. When applying the imputation method to the real hotel review data, the findings show that the text sentiment-based propensity score can effectively explain the missingness of sub-ratings on hotel attributes, and the imputation method considering those propensity scores has better estimation results than the other techniques as in the simulation analysis.
Originality/value
This study extends multiple imputation to online data considering its spontaneous and unstructured nature. This new method helps make the fuller use of the observed online data while avoiding potential missing problems.
Details
Keywords
Tressy Thomas and Enayat Rajabi
The primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel…
Abstract
Purpose
The primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?
Design/methodology/approach
The review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.
Findings
This study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.
Originality/value
It is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.
Details
Keywords
Pooja Rani, Rajneesh Kumar and Anurag Jain
Decision support systems developed using machine learning classifiers have become a valuable tool in predicting various diseases. However, the performance of these systems is…
Abstract
Purpose
Decision support systems developed using machine learning classifiers have become a valuable tool in predicting various diseases. However, the performance of these systems is adversely affected by the missing values in medical datasets. Imputation methods are used to predict these missing values. In this paper, a new imputation method called hybrid imputation optimized by the classifier (HIOC) is proposed to predict missing values efficiently.
Design/methodology/approach
The proposed HIOC is developed by using a classifier to combine multivariate imputation by chained equations (MICE), K nearest neighbor (KNN), mean and mode imputation methods in an optimum way. Performance of HIOC has been compared to MICE, KNN, and mean and mode methods. Four classifiers support vector machine (SVM), naive Bayes (NB), random forest (RF) and decision tree (DT) have been used to evaluate the performance of imputation methods.
Findings
The results show that HIOC performed efficiently even with a high rate of missing values. It had reduced root mean square error (RMSE) up to 17.32% in the heart disease dataset and 34.73% in the breast cancer dataset. Correct prediction of missing values improved the accuracy of the classifiers in predicting diseases. It increased classification accuracy up to 18.61% in the heart disease dataset and 6.20% in the breast cancer dataset.
Originality/value
The proposed HIOC is a new hybrid imputation method that can efficiently predict missing values in any medical dataset.
Details
Keywords
Sanna Sintonen, Anssi Tarkiainen, John W. Cadogan, Olli Kuivalainen, Nick Lee and Sanna Sundqvist
The purpose of this paper is to focus on the case where – by design – one needs to impute cross-country cross-survey (CCCS) data (situation typical for example among multinational…
Abstract
Purpose
The purpose of this paper is to focus on the case where – by design – one needs to impute cross-country cross-survey (CCCS) data (situation typical for example among multinational firms who are confronted with the need to carry out comparative marketing surveys with respondents located in several countries). Importantly, while some work demonstrates approaches for single-item direct measures, no prior research has examined the common situation in international marketing where the researcher needs to use multi-item scales of latent constructs. The paper presents problem areas related to the choices international marketers have to make when doing cross-country/cross-survey research and provides guidance for future research.
Design/methodology/approach
Multi-country sample of real data is used as an example of cross-sample imputation (292 New Zealand exporters and 302 Finnish ones) the international entrepreneurial orientation (IEO) data. Three variations of the input data are tested: first, imputation based on all the data available for the measurement model; second, imputation based on the set of items based on the invariance structure of the joint items shared across the two groups; and third, imputation based both on examination of the invariance structures of the joint items and the performance of the measurement model in the group where the full data was originally available.
Findings
Based on distribution comparisons imputation for New Zealand after completing the measurement model with Finnish data (Model C) gave the most promising results. Consequently, using knowledge on between country measurement qualities may improve the imputation results, but this benefit comes with a downside since it simultaneously reduces the amount of data used for imputation. None of the imputation models leads to the same statistical inferences about covariances between latent constructs than as the original full data, however.
Research limitations/implications
Considering multiple imputation, the present exploratory study suggests that there are several concerns and issues that should be taken into account when planning CCCSs (or split questionnaire or sub-sampling designs). Even if there are several advantages available for well-implemented CCCS designs such as shorter questionnaires and improved response rates, these concerns lead us to question the appropriateness of the CCCS approach in general, due to the need to impute across the samples.
Originality/value
The combination of cross-country and cross-survey approaches is novel to international marketing, and it is not known how the different procedures utilized in imputation affect the results and their validity and reliability. The authors demonstrate the consequences of the various imputation strategy choices taken by using a real example of a two-country sample. The exploration may have significant implications to international marketing researchers and the paper offers stimulus for further research in the area.
Details
Keywords
Min Bai, Yafeng Qin and Feng Bai
The primary goal of this paper is to investigate the relationship between stock market liquidity and firm dividend policy within a market implementing the tax imputation system…
Abstract
Purpose
The primary goal of this paper is to investigate the relationship between stock market liquidity and firm dividend policy within a market implementing the tax imputation system. The main aim is to understand how the tax imputation system influences the relationship between firm dividend policy and stock market liquidity within a cross-sectional framework.
Design/methodology/approach
This paper investigates the relationship between stock market liquidity and the dividend payout policy under the full tax imputation system in the Australian market. This study uses the Generalized Least Squares regressions with firm- and year-fixed effects.
Findings
In contrast to the negative relationship between the liquidity of common shares and the firms' dividends documented in countries with the double tax system, the study reveals that in Australia, the dividend payout ratios are positively associated with liquidity after controlling for various explanatory variables with both the contemporaneous and lagged time periods. Such a finding is robust to the use of alternative liquidity proxies and to the sub-period tests and remains during the COVID-19 pandemic period.
Research limitations/implications
The insights derived from this study have significant implications for various stakeholders within the economy. The findings provide regulators with valuable insights to conduct a more holistic assessment of how the tax system impacts the economy, especially concerning the dividend choices of firms. Within the context of a full tax imputation system, investors can make investment decisions without factoring in the taxation impact. Simultaneously, firms can be relieved of concerns about losing investors who prioritize liquidity, particularly when a high dividend payout might not align optimally with their financial strategy.
Originality/value
This study contributes to the literature by extending the literature on the tax clientele effects on dividend policy, providing evidence that the tax imputation system can moderate the impact of liquidity on dividend policy. This study examines the impact of the dividend tax imputation system on the substitution effect between dividends and liquidity.
Details
Keywords
Panagiotis Loukopoulos, George Zolkiewski, Ian Bennett, Pericles Pilidis, Fang Duan and David Mba
Centrifugal compressors are integral components in oil industry, thus effective maintenance is required. Condition-based maintenance and prognostics and health management…
Abstract
Purpose
Centrifugal compressors are integral components in oil industry, thus effective maintenance is required. Condition-based maintenance and prognostics and health management (CBM/PHM) have been gaining popularity. CBM/PHM can also be performed remotely leading to e-maintenance. Its success depends on the quality of the data used for analysis and decision making. A major issue associated with it is the missing data. Their presence may compromise the information within a set, causing bias or misleading results. Addressing this matter is crucial. The purpose of this paper is to review and compare the most widely used imputation techniques in a case study using condition monitoring measurements from an operational industrial centrifugal compressor.
Design/methodology/approach
Brief overview and comparison of most widely used imputation techniques using a complete set with artificial missing values. They were tested regarding the effects of the amount, the location within the set and the variable containing the missing values.
Findings
Univariate and multivariate imputation techniques were compared, with the latter offering the smallest error levels. They seemed unaffected by the amount or location of the missing data although they were affected by the variable containing them.
Research limitations/implications
During the analysis, it was assumed that at any time only one variable contained missing data. Further research is still required to address this point.
Originality/value
This study can serve as a guide for selecting the appropriate imputation method for missing values in centrifugal compressor condition monitoring data.
Details
Keywords
Zhenyuan Wang, Chih-Fong Tsai and Wei-Chao Lin
Class imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques…
Abstract
Purpose
Class imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques, which aim to identify anomalies as the minority class from the normal data as the majority class, are one representative solution for class imbalanced datasets. Since one-class classifiers are trained using only normal data to create a decision boundary for later anomaly detection, the quality of the training set, i.e. the majority class, is one key factor that affects the performance of one-class classifiers.
Design/methodology/approach
In this paper, we focus on two data cleaning or preprocessing methods to address class imbalanced datasets. The first method examines whether performing instance selection to remove some noisy data from the majority class can improve the performance of one-class classifiers. The second method combines instance selection and missing value imputation, where the latter is used to handle incomplete datasets that contain missing values.
Findings
The experimental results are based on 44 class imbalanced datasets; three instance selection algorithms, including IB3, DROP3 and the GA, the CART decision tree for missing value imputation, and three one-class classifiers, which include OCSVM, IFOREST and LOF, show that if the instance selection algorithm is carefully chosen, performing this step could improve the quality of the training data, which makes one-class classifiers outperform the baselines without instance selection. Moreover, when class imbalanced datasets contain some missing values, combining missing value imputation and instance selection, regardless of which step is first performed, can maintain similar data quality as datasets without missing values.
Originality/value
The novelty of this paper is to investigate the effect of performing instance selection on the performance of one-class classifiers, which has never been done before. Moreover, this study is the first attempt to consider the scenario of missing values that exist in the training set for training one-class classifiers. In this case, performing missing value imputation and instance selection with different orders are compared.
Details
Keywords
In real-world decision-making, high accuracy data analysis is essential in a ubiquitous environment. However, we encounter missing data while collecting user-related data…
Abstract
Purpose
In real-world decision-making, high accuracy data analysis is essential in a ubiquitous environment. However, we encounter missing data while collecting user-related data information because of various privacy concerns on account of a user. This paper aims to deal with incomplete data for fuzzy model identification, a new method of parameter estimation of a Takagi–Sugeno model in the presence of missing features.
Design/methodology/approach
In this work, authors proposed a three-fold approach for fuzzy model identification in which imputation-based linear interpolation technique is used to estimate missing features of the data, and then fuzzy c-means clustering is used for determining optimal number of rules and for the determination of parameters of membership functions of the fuzzy model. Finally, the optimization of the all antecedent and consequent parameters along with the width of the antecedent (Gaussian) membership function is done by gradient descent algorithm based on the minimization of root mean square error.
Findings
The proposed method is tested on two well-known simulation examples as well as on a real data set, and the performance is compared with some traditional methods. The result analysis and statistical analysis show that the proposed model has achieved a considerable improvement in accuracy in the presence of varying degree of data incompleteness.
Originality/value
The proposed method works well for fuzzy model identification method, a new method of parameter estimation of a Takagi–Sugeno model in the presence of missing features with varying degree of missing data as compared to some well-known methods.
Details
Keywords
The purpose of this paper is to illustrate the argument that scholars' imputations of agency serve modern professional/institutional purposes other than the refinement of testable…
Abstract
Purpose
The purpose of this paper is to illustrate the argument that scholars' imputations of agency serve modern professional/institutional purposes other than the refinement of testable theories.
Design/methodology/approach
Data include articles from twenty‐first century issues of four gerontological journals. Content analysis involved coding articles for imputations of agency, constructivist analysis thereof, and the parties to whom authors directed their imputations.
Findings
Most authors rehearse theories of “structuration” and call for more imputations of agency to old people. They do this without imputing agency to privileged groups or to policy makers; and without settling theoretical question of how much agency people have or how scientists could demonstrate that. One article in ten provides constructivist critique.
Research limitations/implications
Patterns in imputations of agency in other scholarly realms (such as books) may support another interpretation.
Practical implications
Scholars should treat their imputations of agency as political activities and not refinements of testable theories. They position professional scholars as advocates for an oppressed group.
Originality/value
This paper provides a sociological context for interpreting routine imputations of agency in social scientific and humanist scholarship.
Details
Keywords
Marvin L. Brown and John F. Kros
The actual data mining process deals significantly with prediction, estimation, classification, pattern recognition and the development of association rules. Therefore, the…
Abstract
The actual data mining process deals significantly with prediction, estimation, classification, pattern recognition and the development of association rules. Therefore, the significance of the analysis depends heavily on the accuracy of the database and on the chosen sample data to be used for model training and testing. Data mining is based upon searching the concatenation of multiple databases that usually contain some amount of missing data along with a variable percentage of inaccurate data, pollution, outliers and noise. The issue of missing data must be addressed since ignoring this problem can introduce bias into the models being evaluated and lead to inaccurate data mining conclusions. The objective of this research is to address the impact of missing data on the data mining process.
Details