To read this content please select one of the options below:

Deep learning based approach to unstructured record linkage

Anna Jurek-Loughrey (Department of Computer Science, Queen’s University Belfast, Belfast, UK)

International Journal of Web Information Systems

ISSN: 1744-0084

Article publication date: 18 October 2021

Issue publication date: 1 December 2021

185

Abstract

Purpose

In the world of big data, data integration technology is crucial for maximising the capability of data-driven decision-making. Integrating data from multiple sources drastically expands the power of information and allows us to address questions that are impossible to answer using a single data source. Record Linkage (RL) is a task of identifying and linking records from multiple sources that describe the same real world object (e.g. person), and it plays a crucial role in the data integration process. RL is challenging, as it is uncommon for different data sources to share a unique identifier. Hence, the records must be matched based on the comparison of their corresponding values. Most of the existing RL techniques assume that records across different data sources are structured and represented by the same scheme (i.e. set of attributes). Given the increasing amount of heterogeneous data sources, those assumptions are rather unrealistic. The purpose of this paper is to propose a novel RL model for unstructured data.

Design/methodology/approach

In the previous work (Jurek-Loughrey, 2020), the authors proposed a novel approach to linking unstructured data based on the application of the Siamese Multilayer Perceptron model. It was demonstrated that the method performed on par with other approaches that make constraining assumptions regarding the data. This paper expands the previous work originally presented at iiWAS2020 [16] by exploring new architectures of the Siamese Neural Network, which improves the generalisation of the RL model and makes it less sensitive to parameter selection.

Findings

The experimental results confirm that the new Autoencoder-based architecture of the Siamese Neural Network obtains better results in comparison to the Siamese Multilayer Perceptron model proposed in (Jurek et al., 2020). Better results have been achieved in three out of four data sets. Furthermore, it has been demonstrated that the second proposed (hybrid) architecture based on integrating the Siamese Autoencoder with a Multilayer Perceptron model, makes the model more stable in terms of the parameter selection.

Originality/value

To address the problem of unstructured RL, this paper presents a new deep learning based approach to improve the generalisation of the Siamese Multilayer Preceptron model and make is less sensitive to parameter selection.

Keywords

Citation

Jurek-Loughrey, A. (2021), "Deep learning based approach to unstructured record linkage", International Journal of Web Information Systems, Vol. 17 No. 6, pp. 607-621. https://doi.org/10.1108/IJWIS-05-2021-0058

Publisher

:

Emerald Publishing Limited

Copyright © 20211, Emerald Publishing Limited

Related articles