The aim of this paper is to develop a system for automatic extraction of metadata from scientific papers in PDF format for the information system for monitoring the scientific research activity of the University of Novi Sad (CRIS UNS).
The system is based on machine learning and performs automatic extraction and classification of metadata in eight pre‐defined categories. The extraction task is realised as a classification process. For the purpose of classification each row of text is represented with a vector that comprises different features: formatting, position, characteristics related to the words, etc. Experiments were performed with standard classification models. Both a single classifier with all eight categories and eight individual classifiers were tested. Classifiers were evaluated using the five‐fold cross validation, on a manually annotated corpus comprising 100 scientific papers in PDF format, collected from various conferences, journals and authors' personal web pages.
Based on the performances obtained on classification experiments, eight separate support vector machines (SVM) models (each of which recognises its corresponding category) were chosen. All eight models were established to have a good performance. The F‐measure was over 85 per cent for almost all of the classifiers and over 90 per cent for most of them.
Automatically extracted metadata cannot be directly entered into CRIS UNS but requires control of the curators.
The proposed system for automatic metadata extraction using support vector machines model was integrated into the software system, CRIS UNS. Metadata extraction has been tested on the publications of researchers from the Department of Mathematics and Informatics of the Faculty of Sciences in Novi Sad. Analysis of extracted metadata from these publications showed that the performance of the system for the previously unseen data is in accordance with that obtained by the cross‐validation from eight separate SVM classifiers. This system will help in the process of synchronising metadata from CRIS UNS with other institutional repositories.
The paper documents a fully automated system for metadata extraction from scientific papers that was developed. The system is based on the SVM classifier and open source tools, and is capable of extracting eight types of metadata from scientific articles of any format that can be converted to PDF. Although developed as part of CRIS UNS, the proposed system can be integrated into other CRIS systems, as well as institutional repositories and library management systems.
Kovačević, A., Ivanović, D., Milosavljević, B., Konjović, Z. and Surla, D. (2011), "Automatic extraction of metadata from scientific publications for CRIS systems", Program: electronic library and information systems, Vol. 45 No. 4, pp. 376-396. https://doi.org/10.1108/00330331111182094
Emerald Group Publishing Limited
Copyright © 2011, Emerald Group Publishing Limited