To read this content please select one of the options below:

Identification of social scientifically relevant topics in an interview repository: a natural language processing experiment

Judit Gárdos (Centre for Social Sciences, Budapest, Hungary)
Julia Egyed-Gergely (Centre for Social Sciences, Budapest, Hungary)
Anna Horváth (Centre for Social Sciences, Budapest, Hungary)
Balázs Pataki (Department of Distributed Systems, Institute for Computer Science and Control, Budapest, Hungary)
Roza Vajda (Centre for Social Sciences, Budapest, Hungary)
András Micsik (Department of Distributed Systems, Institute for Computer Science and Control, Budapest, Hungary)

Journal of Documentation

ISSN: 0022-0418

Article publication date: 13 October 2023

Issue publication date: 22 February 2024

139

Abstract

Purpose

The present study is about generating metadata to enhance thematic transparency and facilitate research on interview collections at the Research Documentation Centre, Centre for Social Sciences (TK KDK) in Budapest. It explores the use of artificial intelligence (AI) in producing, managing and processing social science data and its potential to generate useful metadata to describe the contents of such archives on a large scale.

Design/methodology/approach

The authors combined manual and automated/semi-automated methods of metadata development and curation. The authors developed a suitable domain-oriented taxonomy to classify a large text corpus of semi-structured interviews. To this end, the authors adapted the European Language Social Science Thesaurus (ELSST) to produce a concise, hierarchical structure of topics relevant in social sciences. The authors identified and tested the most promising natural language processing (NLP) tools supporting the Hungarian language. The results of manual and machine coding will be presented in a user interface.

Findings

The study describes how an international social scientific taxonomy can be adapted to a specific local setting and tailored to be used by automated NLP tools. The authors show the potential and limitations of existing and new NLP methods for thematic assignment. The current possibilities of multi-label classification in social scientific metadata assignment are discussed, i.e. the problem of automated selection of relevant labels from a large pool.

Originality/value

Interview materials have not yet been used for building manually annotated training datasets for automated indexing of scientifically relevant topics in a data repository. Comparing various automated-indexing methods, this study shows a possible implementation of a researcher tool supporting custom visualizations and the faceted search of interview collections.

Keywords

Acknowledgements

The project presented in this publication, implemented by the Research Documentation Centre of the Centre for Social Sciences (TK KDK) and the Department of Distributed Systems of the Institute for Computer Science and Control (SZTAKI DSD), was supported by the European Union project RRF-2.3.1-21-2022-00004 within the framework of the Artificial Intelligence National Laboratory. The authors are thankful to Veronika Lipp, Dániel Martin, Attila Marx, Márton Matyasovszky-Németh, Enikő Meiszterics, Mária Neményi, Tamás P. Tóth, Bálint Sass and Melinda Siket for their valuable contributions to this project.

Citation

Gárdos, J., Egyed-Gergely, J., Horváth, A., Pataki, B., Vajda, R. and Micsik, A. (2024), "Identification of social scientifically relevant topics in an interview repository: a natural language processing experiment", Journal of Documentation, Vol. 80 No. 2, pp. 354-377. https://doi.org/10.1108/JD-12-2022-0269

Publisher

:

Emerald Publishing Limited

Copyright © 2023, Emerald Publishing Limited

Related articles