To read this content please select one of the options below:

A clustering approach for data quality results of research information systems

Reza Edris Abadi (Central Tehran Branch, Islamic Azad University, Tehran, Iran)
Mohammad Javad Ershadi (Information Technology Department, Iranian Research Institute for Information Science and Technology (IranDoc), Tehran, Iran)
Seyed Taghi Akhavan Niaki (Industrial Engineering Department, Sharif University of Technology, Tehran, Iran)

Information Discovery and Delivery

ISSN: 2398-6247

Article publication date: 3 November 2022

Issue publication date: 24 November 2023

161

Abstract

Purpose

The overall goal of the data mining process is to extract information from an extensive data set and make it understandable for further use. When working with large volumes of unstructured data in research information systems, it is necessary to divide the information into logical groupings after examining their quality before attempting to analyze it. On the other hand, data quality results are valuable resources for defining quality excellence programs of any information system. Hence, the purpose of this study is to discover and extract knowledge to evaluate and improve data quality in research information systems.

Design/methodology/approach

Clustering in data analysis and exploiting the outputs allows practitioners to gain an in-depth and extensive look at their information to form some logical structures based on what they have found. In this study, data extracted from an information system are used in the first stage. Then, the data quality results are classified into an organized structure based on data quality dimension standards. Next, clustering algorithms (K-Means), density-based clustering (density-based spatial clustering of applications with noise [DBSCAN]) and hierarchical clustering (balanced iterative reducing and clustering using hierarchies [BIRCH]) are applied to compare and find the most appropriate clustering algorithms in the research information system.

Findings

This paper showed that quality control results of an information system could be categorized through well-known data quality dimensions, including precision, accuracy, completeness, consistency, reputation and timeliness. Furthermore, among different well-known clustering approaches, the BIRCH algorithm of hierarchical clustering methods performs better in data clustering and gives the highest silhouette coefficient value. Next in line is the DBSCAN method, which performs better than the K-Means method.

Research limitations/implications

In the data quality assessment process, the discrepancies identified and the lack of proper classification for inconsistent data have led to unstructured reports, making the statistical analysis of qualitative metadata problems difficult and thus impossible to root out the observed errors. Therefore, in this study, the evaluation results of data quality have been categorized into various data quality dimensions, based on which multiple analyses have been performed in the form of data mining methods.

Originality/value

Although several pieces of research have been conducted to assess data quality results of research information systems, knowledge extraction from obtained data quality scores is a crucial work that has rarely been studied in the literature. Besides, clustering in data quality analysis and exploiting the outputs allows practitioners to gain an in-depth and extensive look at their information to form some logical structures based on what they have found.

Keywords

Citation

Edris Abadi, R., Ershadi, M.J. and Niaki, S.T.A. (2023), "A clustering approach for data quality results of research information systems", Information Discovery and Delivery, Vol. 51 No. 4, pp. 337-348. https://doi.org/10.1108/IDD-07-2022-0063

Publisher

:

Emerald Publishing Limited

Copyright © 2022, Emerald Publishing Limited

Related articles