To read the full version of this content please select one of the options below:

Automatic extraction of metadata from scientific publications for CRIS systems

Aleksandar Kovačević (Faculty of Technical Sciences, University of Novi Sad, Novi Sad, Serbia)
Dragan Ivanović (Faculty of Technical Sciences, University of Novi Sad, Novi Sad, Serbia)
Branko Milosavljević (Faculty of Technical Sciences, University of Novi Sad, Novi Sad, Serbia)
Zora Konjović (Faculty of Technical Sciences, University of Novi Sad, Novi Sad, Serbia)
Dušan Surla (Faculty of Science, University of Novi Sad, Novi Sad, Serbia)

Program: electronic library and information systems

ISSN: 0033-0337

Article publication date: 27 September 2011

Abstract

Purpose

The aim of this paper is to develop a system for automatic extraction of metadata from scientific papers in PDF format for the information system for monitoring the scientific research activity of the University of Novi Sad (CRIS UNS).

Design/methodology/approach

The system is based on machine learning and performs automatic extraction and classification of metadata in eight pre‐defined categories. The extraction task is realised as a classification process. For the purpose of classification each row of text is represented with a vector that comprises different features: formatting, position, characteristics related to the words, etc. Experiments were performed with standard classification models. Both a single classifier with all eight categories and eight individual classifiers were tested. Classifiers were evaluated using the five‐fold cross validation, on a manually annotated corpus comprising 100 scientific papers in PDF format, collected from various conferences, journals and authors' personal web pages.

Findings

Based on the performances obtained on classification experiments, eight separate support vector machines (SVM) models (each of which recognises its corresponding category) were chosen. All eight models were established to have a good performance. The F‐measure was over 85 per cent for almost all of the classifiers and over 90 per cent for most of them.

Research limitations/implications

Automatically extracted metadata cannot be directly entered into CRIS UNS but requires control of the curators.

Practical implications

The proposed system for automatic metadata extraction using support vector machines model was integrated into the software system, CRIS UNS. Metadata extraction has been tested on the publications of researchers from the Department of Mathematics and Informatics of the Faculty of Sciences in Novi Sad. Analysis of extracted metadata from these publications showed that the performance of the system for the previously unseen data is in accordance with that obtained by the cross‐validation from eight separate SVM classifiers. This system will help in the process of synchronising metadata from CRIS UNS with other institutional repositories.

Originality/value

The paper documents a fully automated system for metadata extraction from scientific papers that was developed. The system is based on the SVM classifier and open source tools, and is capable of extracting eight types of metadata from scientific articles of any format that can be converted to PDF. Although developed as part of CRIS UNS, the proposed system can be integrated into other CRIS systems, as well as institutional repositories and library management systems.

Keywords

Citation

Kovačević, A., Ivanović, D., Milosavljević, B., Konjović, Z. and Surla, D. (2011), "Automatic extraction of metadata from scientific publications for CRIS systems", Program: electronic library and information systems, Vol. 45 No. 4, pp. 376-396. https://doi.org/10.1108/00330331111182094

Publisher

:

Emerald Group Publishing Limited

Copyright © 2011, Emerald Group Publishing Limited