Search results

1 – 10 of over 51000
Open Access
Article
Publication date: 9 February 2022

Anja Perry and Sebastian Netscher

Budgeting data curation tasks in research projects is difficult. In this paper, we investigate the time spent on data curation, more specifically on cleaning and documenting…

2622

Abstract

Purpose

Budgeting data curation tasks in research projects is difficult. In this paper, we investigate the time spent on data curation, more specifically on cleaning and documenting quantitative data for data sharing. We develop recommendations on cost factors in research data management.

Design/methodology/approach

We make use of a pilot study conducted at the GESIS Data Archive for the Social Sciences in Germany between December 2016 and September 2017. During this period, data curators at GESIS - Leibniz Institute for the Social Sciences documented their working hours while cleaning and documenting data from ten quantitative survey studies. We analyse recorded times and discuss with the data curators involved in this work to identify and examine important cost factors in data curation, that is aspects that increase hours spent and factors that lead to a reduction of their work.

Findings

We identify two major drivers of time spent on data curation: The size of the data and personal information contained in the data. Learning effects can occur when data are similar, that is when they contain same variables. Important interdependencies exist between individual tasks in data curation and in connection with certain data characteristics.

Originality/value

The different tasks of data curation, time spent on them and interdependencies between individual steps in curation have so far not been analysed.

Details

Journal of Documentation, vol. 78 no. 7
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 21 May 2021

Burak Cankaya, Berna Eren Tokgoz, Ali Dag and K.C. Santosh

This paper aims to propose a machine learning-based automatic labeling methodology for chemical tanker activities that can be applied to any port with any number of active tankers…

Abstract

Purpose

This paper aims to propose a machine learning-based automatic labeling methodology for chemical tanker activities that can be applied to any port with any number of active tankers and the identification of important predictors. The methodology can be applied to any type of activity tracking that is based on automatically generated geospatial data.

Design/methodology/approach

The proposed methodology uses three machine learning algorithms (artificial neural networks, support vector machines (SVMs) and random forest) along with information fusion (IF)-based sensitivity analysis to classify chemical tanker activities. The data set is split into training and test data based on vessels, with two vessels in the training data and one in the test data set. Important predictors were identified using a receiver operating characteristic comparative approach, and overall variable importance was calculated using IF from the top models.

Findings

Results show that an SVM model has the best balance between sensitivity and specificity, at 93.5% and 91.4%, respectively. Speed, acceleration and change in the course on the ground for the vessels are identified as the most important predictors for classifying vessel activity.

Research limitations/implications

The study evaluates the vessel movements waiting between different terminals in the same port, but not their movements between different ports for their tank-cleaning activities.

Practical implications

The findings in this study can be used by port authorities, shipping companies, vessel operators and other stakeholders for decision support, performance tracking, as well as for automated alerts.

Originality/value

This analysis makes original contributions to the existing literature by defining and demonstrating a methodology that can automatically label vehicle activity based on location data and identify certain characteristics of the activity by finding important location-based predictors that effectively classify the activity status.

Details

Journal of Modelling in Management, vol. 16 no. 4
Type: Research Article
ISSN: 1746-5664

Keywords

Article
Publication date: 19 November 2021

Samir Al-Janabi and Ryszard Janicki

Data quality is a major challenge in data management. For organizations, the cleanliness of data is a significant problem that affects many business activities. Errors in data

Abstract

Purpose

Data quality is a major challenge in data management. For organizations, the cleanliness of data is a significant problem that affects many business activities. Errors in data occur for different reasons, such as violation of business rules. However, because of the huge amount of data, manual cleaning alone is infeasible. Methods are required to repair and clean the dirty data through automatic detection, which are data quality issues to address. The purpose of this work is to extend the density-based data cleaning approach using conditional functional dependencies to achieve better data repair.

Design/methodology/approach

A set of conditional functional dependencies is introduced as an input to the density-based data cleaning algorithm. The algorithm repairs inconsistent data using this set.

Findings

This new approach was evaluated through experiments on real-world as well as synthetic datasets. The repair quality was determined using the F-measure. The results showed that the quality and scalability of the density-based data cleaning approach improved when conditional functional dependencies were introduced.

Originality/value

Conditional functional dependencies capture semantic errors among data values. This work demonstrates that the density-based data cleaning approach can be improved in terms of repairing inconsistent data by using conditional functional dependencies.

Details

Data Technologies and Applications, vol. 56 no. 3
Type: Research Article
ISSN: 2514-9288

Keywords

Book part
Publication date: 11 November 2019

Manoj Kumar Jena and Brajaballav Kar

Data, either in primary or secondary form, represent the core strength of quantitative research. However, there is significant difference between collected data and the final…

Abstract

Data, either in primary or secondary form, represent the core strength of quantitative research. However, there is significant difference between collected data and the final researchable data. The data collection is driven by objectives of the research. The data also could be in various formats at different sources. The collected data in its original form may contain systematic and random errors. Such errors need to be cleaned from the data which is termed as data cleaning process.

The present chapter discusses about the different methodologies and steps that may be helpful for fine tuning the data into researchable format. The discussions are instantiated with the applications of methodologies on a set of financial data of companies listed in Bombay Stock Exchange. Various steps involved in transformation of collected data to researchable data are presented. A schematic model including data collection, data cleaning, working with variables, outlier treatment, testing the assumption of statistical test, normality, and heteroscedasticity is presented for the benefit of research scholars. Beyond this generic model, this paper focuses exclusively on financial data of listed companies in the Bombay Stock Exchange. The challenges involved in various sources, data gathering and other pre-analysis stages are also considered. This is also applicable for research based on secondary data sources in other fields as well.

Details

Methodological Issues in Management Research: Advances, Challenges, and the Way Ahead
Type: Book
ISBN: 978-1-78973-973-2

Keywords

Article
Publication date: 11 January 2021

Gui Yuan, Shali Huang, Jing Fu and Xinwei Jiang

This study aims to assess the default risk of borrowers in peer-to-peer (P2P) online lending platforms. The authors propose a novel default risk classification model based on data

Abstract

Purpose

This study aims to assess the default risk of borrowers in peer-to-peer (P2P) online lending platforms. The authors propose a novel default risk classification model based on data cleaning and feature extraction, which increases risk assessment accuracy.

Design/methodology/approach

The authors use borrower data from the Lending Club and propose the risk assessment model based on low-rank representation (LRR) and discriminant analysis. Firstly, the authors use three LRR models to clean the high-dimensional borrower data by removing outliers and noise, and then the authors adopt a discriminant analysis algorithm to reduce the dimension of the cleaned data. In the dimension-reduced feature space, machine learning classifiers including the k-nearest neighbour, support vector machine and artificial neural network are used to assess and classify default risks.

Findings

The results reveal significant noise and redundancy in the borrower data. LRR models can effectively clean such data, particularly the two LRR models with local manifold regularisation. In addition, the supervised discriminant analysis model, termed the local Fisher discriminant analysis model, can extract low-dimensional and discriminative features, which further increases the accuracy of the final risk assessment models.

Originality/value

The originality of this study is that it proposes a novel default risk assessment model, based on data cleaning and feature extraction, for P2P online lending platforms. The proposed approach is innovative and efficient in the P2P online lending field.

Details

Journal of Systems and Information Technology, vol. 24 no. 2
Type: Research Article
ISSN: 1328-7265

Keywords

Article
Publication date: 1 March 2012

Satu Kalliola and Jukka Niemelä

Outsourcing has gained favor since the 1980s, and Finnish paper companies used it as late as 2006, when a group of female cleaners were outsourced from the case plant of this…

Abstract

Outsourcing has gained favor since the 1980s, and Finnish paper companies used it as late as 2006, when a group of female cleaners were outsourced from the case plant of this study. This article focuses on the context of outsourcing, characterized by the bargaining power and choices made by the bargaining parties, the responses of the cleaners over time and the potential theoretical explanations of the outcome. The responses, such as disappointment and anger, mental and physical tiredness, sickness absenteeism, and starting to get adjusted, were interpreted in the frameworks of occupational culture, the job characteristics model, old and new craftsmanship, and relational and transactional psychological contracts. The method was a combination of naturalistic inquiry and abduction. The study points out that more than one theoretical framework was needed to gain an understanding of the situation.

Details

International Journal of Organization Theory & Behavior, vol. 15 no. 3
Type: Research Article
ISSN: 1093-4537

Article
Publication date: 27 March 2020

Vitaly Brazhkin

The purpose of this paper is to provide a comprehensive review of the respondents’ fraud phenomenon in online panel surveys, delineate data quality issues from surveys of broad…

Abstract

Purpose

The purpose of this paper is to provide a comprehensive review of the respondents’ fraud phenomenon in online panel surveys, delineate data quality issues from surveys of broad and narrow populations, alert fellow researchers about higher incidence of respondents’ fraud in online panel surveys of narrow populations, such as logistics professionals and recommend ways to protect the quality of data received from such surveys.

Design/methodology/approach

This general review paper has two parts, namely, descriptive and instructional. The current state of online survey and panel data use in supply chain research is examined first through a survey method literature review. Then, a more focused understanding of the phenomenon of fraud in surveys is provided through an analysis of online panel industry literature and psychological academic literature. Common survey design and data cleaning recommendations are critically assessed for their applicability to narrow populations. A survey of warehouse professionals is used to illustrate fraud detection techniques and glean additional, supply chain specific data protection recommendations.

Findings

Surveys of narrow populations, such as those typically targeted by supply chain researchers, are much more prone to respondents’ fraud. To protect and clean survey data, supply chain researchers need to use many measures that are different from those commonly recommended in methodological survey literature.

Research limitations/implications

For the first time, the need to distinguish between narrow and broad population surveys has been stated when it comes to data quality issues. The confusion and previously reported “mixed results” from literature reviews on the subject have been explained and a clear direction for future research is suggested: the two categories should be considered separately.

Practical implications

Specific fraud protection advice is provided to supply chain researchers on the strategic choices and specific aspects for all phases of surveying narrow populations, namely, survey preparation, administration and data cleaning.

Originality/value

This paper can greatly benefit researchers in several ways. It provides a comprehensive review and analysis of respondents’ fraud in online surveys, an issue poorly understood and rarely addressed in academic research. Drawing from literature from several fields, this paper, for the first time in literature, offers a systematic set of recommendations for narrow population surveys by clearly contrasting them with general population surveys.

Details

Supply Chain Management: An International Journal, vol. 25 no. 4
Type: Research Article
ISSN: 1359-8546

Keywords

Article
Publication date: 3 October 2016

Charlotte M. Karam and David A. Ralston

A large and growing number of researchers set out to cross-culturally examine empirical relationships. The purpose of this paper is to provide researchers, who are new to…

Abstract

Purpose

A large and growing number of researchers set out to cross-culturally examine empirical relationships. The purpose of this paper is to provide researchers, who are new to multicountry investigations, a discussion of the issues that one needs to address in order to be properly prepared to begin the cross-cultural analyses of relationships.

Design/methodology/approach

Thus, the authors consider two uniquely different but integrally connected challenges to getting ready to conduct the relevant analyses for just such multicountry studies. The first challenge is to collect the data. The second challenge is to prepare (clean) the collected data for analysis. Accordingly, the authors divide this paper into two parts to discuss the steps involved in both for multicountry studies.

Findings

The authors highlight the fact that in the process of collecting, there are a number of key issues that should be kept in mind including building trust with new team members, leading the team, and determining sufficient contribution of team members for authorship. Subsequently, the authors draw the reader’s attention to the equally important, but often-overlooked, data cleaning process and the steps that constitute it. This is important because failing to take serious the quality of the data can lead to violations of assumptions and mis-estimations of parameters and effects.

Originality/value

This paper provides a useful guide to assist researchers who are engaged in data collection and cleaning efforts with multiple country data sets. The review of the literature indicated how truly important a guideline of this nature is, given the expanding nature of cross-cultural investigations.

Details

Cross Cultural & Strategic Management, vol. 23 no. 4
Type: Research Article
ISSN: 2059-5794

Keywords

Article
Publication date: 9 May 2016

Melinda Hodkiewicz and Mark Tien-Wei Ho

The purpose of this paper is to identify quality issues with using historical work order (WO) data from computerised maintenance management systems for reliability analysis; and…

1160

Abstract

Purpose

The purpose of this paper is to identify quality issues with using historical work order (WO) data from computerised maintenance management systems for reliability analysis; and develop an efficient and transparent process to correct these data quality issues to ensure data is fit for purpose in a timely manner.

Design/methodology/approach

This paper develops a rule-based approach to data cleansing and demonstrates the process on data for heavy mobile equipment from a number of organisations.

Findings

Although historical WO records frequently contain missing or incorrect functional location, failure mode, maintenance action and WO status fields the authors demonstrate it is possible to make these records fit for purpose by using data in the freeform text fields; an understanding of the maintenance tactics and practices at the operation; and knowledge of where the asset is in its life cycle. The authors demonstrate that it is possible to have a repeatable and transparent process to deal with the data cleaning activities.

Originality/value

How engineers deal with raw maintenance data and the decisions they make in order to produce a data set for reliability analysis is seldom discussed in detail. Assumptions and actions are often left undocumented. This paper describes typical data cleaning decisions we all have to make as a routine part of the analysis and presents a process to support the data cleaning decisions in a repeatable and transparent fashion.

Details

Journal of Quality in Maintenance Engineering, vol. 22 no. 2
Type: Research Article
ISSN: 1355-2511

Keywords

Article
Publication date: 24 March 2022

Mahmoud El Samad, Sam El Nemar, Georgia Sakka and Hani El-Chaarani

The purpose of this paper is to propose a new conceptual framework for big data analytics (BDA) in the healthcare sector for the European Mediterranean region. The objective of…

Abstract

Purpose

The purpose of this paper is to propose a new conceptual framework for big data analytics (BDA) in the healthcare sector for the European Mediterranean region. The objective of this new conceptual framework is to improve the health conditions in a dynamic region characterized by the appearance of new diseases.

Design/methodology/approach

This study presents a new conceptual framework that could be employed in the European Mediterranean healthcare sector. Practically, this study can enhance medical services, taking smart decisions based on accurate data for healthcare and, finally, reducing the medical treatment costs, thanks to data quality control.

Findings

This research proposes a new conceptual framework for BDA in the healthcare sector that could be integrated in the European Mediterranean region. This framework introduces the big data quality (BDQ) module to filter and clean data that are provided from different European data sources. The BDQ module acts in a loop mode where bad data are redirected to their data source (e.g. European Centre for Disease Prevention and Control, university hospitals) to be corrected to improve the overall data quality in the proposed framework. Finally, clean data are directed to the BDA to take quick efficient decisions involving all the concerned stakeholders.

Practical implications

This study proposes a new conceptual framework for executives in the healthcare sector to improve the decision-making process, decrease operational costs, enhance management performance and save human lives.

Originality/value

This study focused on big data management and BDQ in the European Mediterranean healthcare sector as a broadly considered fundamental condition for the quality of medical services and conditions.

Details

EuroMed Journal of Business, vol. 17 no. 3
Type: Research Article
ISSN: 1450-2194

Keywords

1 – 10 of over 51000