To read this content please select one of the options below:

“At scale” author name matching with Hadoop/MapReduce

James Powell (Research Library, Los Alamos National Laboratory, Los Alamos, New Mexico, USA)
Linn Collins (Research Library, Los Alamos National Laboratory, Los Alamos, New Mexico, USA)
Ariane Eberhardt (International Research and Analysis Group, Los Alamos National Laboratory, Los Alamos, New Mexico, USA)
David Izraelevitz (Risk Analysis & Decision Support, Los Alamos National Laboratory, Los Alamos, New Mexico, USA)
Jorge Roman (Scientific Software Engineering, Los Alamos National Laboratory, Los Alamos, New Mexico, USA)
Thomas Dufresne (Risk Analysis & Decision Support, Los Alamos National Laboratory, Los Alamos, New Mexico, USA)
Mark Scott (International Research and Analysis Group, Los Alamos National Laboratory, Los Alamos, New Mexico, USA)
Miriam Blake (Research Library, Los Alamos National Laboratory, Los Alamos, New Mexico, USA)
Gary Grider (High Performance Computing, Los Alamos National Laboratory, Los Alamos, New Mexico, USA)

Library Hi Tech News

ISSN: 0741-9058

Article publication date: 1 June 2012

464

Abstract

Purpose

The purpose of this paper is to describe a process for extracting and matching author names from large collections of bibliographic metadata using the Hadoop implementation of MapReduce. It considers the challenges and risks associated with name matching on such a large‐scale and proposes simple matching heuristics for the reduce process. The resulting semantic graphs of authors link names to publications, and include additional features such as phonetic representations of author last names. The authors believe that this achieves an appropriate level of matching at scale, and enables further matching to be performed with graph analysis tools.

Design/methodology/approach

A topically‐focused collection of metadata records describing peer‐reviewed papers was generated based upon a search. The matching records were harvested and stored in the Hadoop Distributed File System (HDFS) for processing by hadoop. A MapReduce job was written to perform coarse‐grain author name matching, and multiple papers were matched with authors when the names were very similar or identical. Semantic graphs were generated so that the graphs could be analyzed to perform finer grained matching, for example by using other metadata such as subject headings.

Findings

When performing author name matching at scale using MapReduce, the heuristics that determine whether names match should be limited to the rules that yield the most reliable results for matching. Bad rules will result in lots of errors, at scale. MapReduce can also be used to generate or extract other data that might help resolve similar names when stricter rules fail to do so. The authors also found that matching is more reliable within a well‐defined topic domain.

Originality/value

Libraries have some of the same big data challenges as are found in data‐driven science. Big data tools such as hadoop can be used to explore large metadata collections, and these collections can be used as surrogates for other real world, big data problems. MapReduce activities need to be appropriately scoped so as to yield good results, while keeping an eye out for problems in code which can be magnified in the output from a MapReduce job.

Keywords

Citation

Powell, J., Collins, L., Eberhardt, A., Izraelevitz, D., Roman, J., Dufresne, T., Scott, M., Blake, M. and Grider, G. (2012), "“At scale” author name matching with Hadoop/MapReduce", Library Hi Tech News, Vol. 29 No. 4, pp. 6-12. https://doi.org/10.1108/07419051211249455

Publisher

:

Emerald Group Publishing Limited

Copyright © 2012, Emerald Group Publishing Limited

Related articles