Search results

1 – 10 of over 1000

View access options

Article

Publication date: 27 September 2011

Automatic extraction of metadata from scientific publications for CRIS systems

Aleksandar Kovačević, Dragan Ivanović, Branko Milosavljević, Zora Konjović and Dušan Surla

The aim of this paper is to develop a system for automatic extraction of metadata from scientific papers in PDF format for the information system for monitoring the scientific…

HTML

PDF (171 KB)

Downloads

1214

Abstract

Purpose

The aim of this paper is to develop a system for automatic extraction of metadata from scientific papers in PDF format for the information system for monitoring the scientific research activity of the University of Novi Sad (CRIS UNS).

Design/methodology/approach

The system is based on machine learning and performs automatic extraction and classification of metadata in eight pre‐defined categories. The extraction task is realised as a classification process. For the purpose of classification each row of text is represented with a vector that comprises different features: formatting, position, characteristics related to the words, etc. Experiments were performed with standard classification models. Both a single classifier with all eight categories and eight individual classifiers were tested. Classifiers were evaluated using the five‐fold cross validation, on a manually annotated corpus comprising 100 scientific papers in PDF format, collected from various conferences, journals and authors' personal web pages.

Findings

Based on the performances obtained on classification experiments, eight separate support vector machines (SVM) models (each of which recognises its corresponding category) were chosen. All eight models were established to have a good performance. The F‐measure was over 85 per cent for almost all of the classifiers and over 90 per cent for most of them.

Research limitations/implications

Automatically extracted metadata cannot be directly entered into CRIS UNS but requires control of the curators.

Practical implications

The proposed system for automatic metadata extraction using support vector machines model was integrated into the software system, CRIS UNS. Metadata extraction has been tested on the publications of researchers from the Department of Mathematics and Informatics of the Faculty of Sciences in Novi Sad. Analysis of extracted metadata from these publications showed that the performance of the system for the previously unseen data is in accordance with that obtained by the cross‐validation from eight separate SVM classifiers. This system will help in the process of synchronising metadata from CRIS UNS with other institutional repositories.

Originality/value

The paper documents a fully automated system for metadata extraction from scientific papers that was developed. The system is based on the SVM classifier and open source tools, and is capable of extracting eight types of metadata from scientific articles of any format that can be converted to PDF. Although developed as part of CRIS UNS, the proposed system can be integrated into other CRIS systems, as well as institutional repositories and library management systems.

Details

Program, vol. 45 no. 4

Type: Research Article

DOI:

ISSN: 0033-0337

Keywords

View access options

Article

Publication date: 1 July 2014

Extracting bibliographical data for PDF documents with HMM and external resources

Wen-Feng Hsiao, Te-Min Chang and Erwin Thomas

The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in portable…

HTML

PDF (264 KB)

Downloads

483

Abstract

Purpose

The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in portable document formats (PDFs).

Design/methodology/approach

The authors use PDFBox to extract text and font size information, a rule-based method to identify titles, and an Hidden Markov Model (HMM) to extract the titles and authors. Finally, the extracted titles and authors (possibly incorrect or incomplete) are sent as query strings to digital libraries (e.g. ACM, IEEE, CiteSeerX, SDOS, and Google Scholar) to retrieve the rest of metadata.

Findings

Four experiments are conducted to examine the feasibility of the proposed system. The first experiment compares two different HMM models: multi-state model and one state model (the proposed model). The result shows that one state model can have a comparable performance with multi-state model, but is more suitable to deal with real-world unknown states. The second experiment shows that our proposed model (without the aid of online query) can achieve as good performance as other researcher's model on Cora paper header dataset. In the third experiment the paper examines the performance of our system on a small dataset of 43 real PDF research papers. The result shows that our proposed system (with online query) can perform pretty well on bibliographical data extraction and even outperform the free citation management tool Zotero 3.0. Finally, the paper conducts the fourth experiment with a larger dataset of 103 papers to compare our system with Zotero 4.0. The result shows that our system significantly outperforms Zotero 4.0. The feasibility of the proposed model is thus justified.

Research limitations/implications

For academic implication, the system is unique in two folds: first, the system only uses Cora header set for HMM training, without using other tagged datasets or gazetteers resources, which means the system is light and scalable. Second, the system is workable and can be applied to extracting metadata of real-world PDF files. The extracted bibliographical data can then be imported into citation software such as endnote or refworks to increase researchers’ productivity.

Practical implications

For practical implication, the system can outperform the existing tool, Zotero v4.0. This provides practitioners good chances to develop similar products in real applications; though it might require some knowledge about HMM implementation.

Originality/value

The HMM implementation is not novel. What is innovative is that it actually combines two HMM models. The main model is adapted from Freitag and Mccallum (1999) and the authors add word features of the Nymble HMM (Bikel et al, 1997) to it. The system is workable even without manually tagging the datasets before training the model (the authors just use cora dataset to train and test on real-world PDF papers), as this is significantly different from what other works have done so far. The experimental results have shown sufficient evidence about the feasibility of our proposed method in this aspect.

Details

Program, vol. 48 no. 3

Type: Research Article

DOI:

ISSN: 0033-0337

Keywords

View access options

Article

Publication date: 8 August 2008

ScienceTreks: an autonomous digital library system

Alexander Ivanyukovich, Maurizio Marchese and Fausto Giunchiglia

The purpose of this paper is to provide support for automation of the annotation process of large corpora of digital content.

HTML

PDF (138 KB)

Downloads

1248

Abstract

Purpose

The purpose of this paper is to provide support for automation of the annotation process of large corpora of digital content.

Design/methodology/approach

The paper presents and discusses an information extraction pipeline from digital document acquisition to information extraction, processing and management. An overall architecture that supports such an extraction pipeline is detailed and discussed.

Findings

The proposed pipeline is implemented in a working prototype of an autonomous digital library (A‐DL) system called ScienceTreks that: supports a broad range of methods for document acquisition; does not rely on any external information sources and is solely based on the existing information in the document itself and in the overall set in a given digital archive; and provides application programming interfaces (API) to support easy integration of external systems and tools in the existing pipeline.

Practical implications

The proposed A‐DL system can be used in automating end‐to‐end information retrieval and processing, supporting the control and elimination of error‐prone human intervention in the process.

Originality/value

High quality automatic metadata extraction is a crucial step in the move from linguistic entities to logical entities, relation information and logical relations, and therefore to the semantic level of digital library usability. This in turn creates the opportunity for value‐added services within existing and future semantic‐enabled digital library systems.

Details

Online Information Review, vol. 32 no. 4

Type: Research Article

DOI:

ISSN: 1468-4527

Keywords

View access options

Article

Publication date: 19 June 2017

Dynamic cataloguing of the old Arabic manuscripts by automatic extraction of metadata

Mohammed Ourabah Soualah, Yassine Ait Ali Yahia, Abdelkader Keita and Abderrezak Guessoum

The purpose of this paper is to obtain online access to the digitised Arabic manuscripts images, which need to use a catalogue. The bibliographic cataloguing is unsuitable for old…

HTML

PDF (1.2 MB)

Downloads

1094

Abstract

Purpose

The purpose of this paper is to obtain online access to the digitised Arabic manuscripts images, which need to use a catalogue. The bibliographic cataloguing is unsuitable for old Arabic manuscripts, and it is imperative to establish a new cataloguing model. In the research, the authors propose a new cataloguing model based on manuscript annotations and transcriptions. This model can be an effective solution to dynamic catalogue old Arabic manuscripts. In this field, the authors used the automatic extraction of the metadata that is based on the structural similarity of the documents.

Design/methodology/approach

This work is based on experimental methodology. The whole proposed concepts and formulas were tested for validation. This, allows the authors to make concise conclusions.

Findings

Cataloguing old Arabic manuscripts faces problem of unavailability of information. However, this information may be found in another place in a copy of the original manuscript. Thus, cataloguing Arabic manuscript cannot be done in one time, it is a continual process which require information updating. The idea is to make a pre-cataloguing of a manuscript, then try to complete and improve it through a specific platform. Consequently, in the research work, the authors propose a new cataloguing model, which the authors call “Dynamic cataloguing”.

Research limitations/implications

The success of the proposed model is confronted with the involvement of all actors of the model. It is based on the conviction and the motivation of actors of the collaborative platform.

Practical implications

The model can be used in several cataloguing fields, where the encoding model is based on XML. The model is innovative and implements a smart cataloguing model. The model is useful by using a web platform. It allows an automatic update of a catalogue.

Social implications

The model prompts the user to participate and enrich the catalogue. The user could improve his social status from a passive to an active.

Originality/value

The dynamic cataloguing model is a new concept. It has never been proposed in the literature until now. The proposed cataloguing model is based on automatic extraction of metadata from user annotations/transcription. It is a smart system which automatically updates or fills the catalogue with the extracted metadata.

Details

Library Hi Tech, vol. 35 no. 2

Type: Research Article

DOI:

ISSN: 0737-8831

Keywords

View access options

Article

Publication date: 20 June 2016

Extraction, analysis and publication of bibliographical references within an institutional repository

Götz Hatop

The academic tradition of adding a reference section with references to cited and otherwise related academic material to an article provides a natural starting point for finding…

HTML

PDF (287 KB)

Downloads

681

Abstract

Purpose

The academic tradition of adding a reference section with references to cited and otherwise related academic material to an article provides a natural starting point for finding links to other publications. These links can then be published as linked data. Natural language processing technologies are available today that can perform the task of bibliographical reference extraction from text. Publishing references by the means of semantic web technologies is a prerequisite for a broader study and analysis of citations and thus can help to improve academic communication in a general sense. The paper aims to discuss these issues.

Design/methodology/approach

This paper examines the overall workflow required to extract, analyze and semantically publish bibliographical references within an Institutional Repository with the help of open source software components.

Findings

A publication infrastructure where references are available for software agents would enable additional benefits like citation analysis, e.g. the collection of citations of a known paper and the investigation of citation sentiment.The publication of reference information as demonstrated in this article is possible with existing semantic web technologies based on established ontologies and open source software components.

Research limitations/implications

Only a limited number of metadata extraction programs have been considered for performance evaluation and reference extraction was tested for journal articles only, whereas Institutional Repositories usually do contain a large number of other material like monographs. Also, citation analysis is in an experimental state and citation sentiment is currently not published at all. For future work, the problem of distributing reference information between repositories is an important problem that needs to be tackled.

Originality/value

Publishing reference information as linked data are new within the academic publishing domain.

Details

Library Hi Tech, vol. 34 no. 2

Type: Research Article

DOI:

ISSN: 0737-8831

Keywords

View access options

Article

Publication date: 28 October 2020

HerCulB: content-based information extraction and retrieval for cultural heritage of the Balkans

Ivana Tanasijević and Gordana Pavlović-Lažetić

The purpose of this paper is to provide a methodology for automatic annotation of a multimedia collection of intangible cultural heritage mostly in the form of interviews…

HTML

PDF (421 KB)

Downloads

293

Abstract

Purpose

The purpose of this paper is to provide a methodology for automatic annotation of a multimedia collection of intangible cultural heritage mostly in the form of interviews. Assigned annotations provide a way to search the collection.

Design/methodology/approach

Annotation is based on automatic extraction of metadata and is conducted by named entity and topic extraction from textual descriptions with a rule-based approach supported by vocabulary resources, a compiled domain-specific classification scheme and domain-oriented corpus analysis.

Findings

The proposed methodology for automatic annotation of a collection of intangible cultural heritage, applied on the cultural heritage of the Balkans, has very good results according to F measure, which is 0.87 for the named entity and 0.90 for topic annotation. The overall methodology enables encapsulating domain-specific and language-specific knowledge into collections of finite state transducers and allows further improvements.

Originality/value

Although cultural heritage has a significant role in the development of identity of a group or an individual, it is one of those specific domains that have not yet been fully explored in case of many languages. A methodology is proposed that can be used for incorporating natural language processing techniques into digital libraries of cultural heritage.

Details

The Electronic Library , vol. 38 no. 5/6

Type: Research Article

DOI:

ISSN: 0264-0473

Keywords

View access options

Article

Publication date: 26 February 2019

Towards automated pre-ingest workflow for bridging information systems and digital preservation services

Parvaneh Westerlund, Ingemar Andersson, Tero Päivärinta and Jörgen Nilsson

This paper aims to automate pre-ingest workflow for preserving digital content, such as records, through middleware that integrates potentially many information systems with…

HTML

PDF (715 KB)

Downloads

672

Abstract

Purpose

This paper aims to automate pre-ingest workflow for preserving digital content, such as records, through middleware that integrates potentially many information systems with potentially several alternative digital preservation services.

Design/methodology/approach

This design research approach resulted in a design for model- and component-based software for such workflow. A proof-of-concept prototype was implemented and demonstrated in context of a European research project, ForgetIT.

Findings

The study identifies design issues of automated pre-ingest for digital preservation while using middleware as a design choice for this purpose. The resulting model and solution suggest functionalities and interaction patterns based on open interface protocols between the source systems of digital content, middleware and digital preservation services. The resulting workflow automates the tasks of fetching digital objects from the source system with metadata extraction, preservation preparation and transfer to a selected preservation service. The proof-of-concept verified that the suggested model for pre-ingest workflow and the suggested component architecture was technologically implementable. Future research and development needs to include new solutions to support context-aware preservation management with increased support for configuring submission agreements as a basis for dynamic automation of pre-ingest and more automated error handling.

Originality/value

The paper addresses design issues for middleware as a design choice to support automated pre-ingest in digital preservation. The suggested middleware architecture supports many-to-many relationships between the source information systems and digital preservation services through open interface protocols, thus enabling dynamic digital preservation solutions for records management.

Details

Records Management Journal, vol. 29 no. 3

Type: Research Article

DOI:

ISSN: 0956-5698

Keywords

View access options

Article

Publication date: 29 May 2009

Open access dissemination challenges: a case study

Philip Young

The purpose of this paper is to explore dissemination, broadly considered, of an open access (OA) database as part of a librarian‐faculty collaboration currently in progress.

HTML

PDF (319 KB)

Downloads

1122

Abstract

Purpose

The purpose of this paper is to explore dissemination, broadly considered, of an open access (OA) database as part of a librarian‐faculty collaboration currently in progress.

Design/methodology/approach

Dissemination of an online database by librarians is broadly considered, including metadata optimization for multiple access points and user notification methods.

Findings

Librarians address OA dissemination challenges by investigating search engine optimization and seeking new opportunities for dissemination on the web. Differences in library metadata formats inhibit metadata optimization and need resolution.

Research limitations/implications

The collaboration is in progress and many of the ideas and conclusions listed have not been implemented.

Practical implications

Libraries should consider their role in scholarly publishing, develop workflows to enable it, and extend their efforts to the web.

Originality/value

This paper contributes to the scant literature on dissemination by libraries, and discusses dissemination challenges encountered by a non‐peer reviewed, dynamic scholarly resource.

Details

OCLC Systems & Services: International digital library perspectives, vol. 25 no. 2

Type: Research Article

DOI:

ISSN: 1065-075X

Keywords

Open Access

Article

Publication date: 15 February 2022

Modular framework for similarity-based dataset discovery using external knowledge

Martin Nečaský, Petr Škoda, David Bernhauer, Jakub Klímek and Tomáš Skopal

Semantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking…

HTML

PDF (2.9 MB)

Downloads

1210

Abstract

Purpose

Semantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking the luxury of centralized database administration, database schemes, shared attributes, vocabulary, structure and semantics. The existing dataset catalogs provide basic search functionality relying on keyword search in brief, incomplete or misleading textual metadata attached to the datasets. The search results are thus often insufficient. However, there exist many ways of improving the dataset discovery by employing content-based retrieval, machine learning tools, third-party (external) knowledge bases, countless feature extraction methods and description models and so forth.

Design/methodology/approach

In this paper, the authors propose a modular framework for rapid experimentation with methods for similarity-based dataset discovery. The framework consists of an extensible catalog of components prepared to form custom pipelines for dataset representation and discovery.

Findings

The study proposes several proof-of-concept pipelines including experimental evaluation, which showcase the usage of the framework.

Originality/value

To the best of authors’ knowledge, there is no similar formal framework for experimentation with various similarity methods in the context of dataset discovery. The framework has the ambition to establish a platform for reproducible and comparable research in the area of dataset discovery. The prototype implementation of the framework is available on GitHub.

Details

Data Technologies and Applications, vol. 56 no. 4

Type: Research Article

DOI:

ISSN: 2514-9288

Keywords

View access options

Article

Publication date: 1 March 2000

Teaching an information organization course with Nordic DC metadata creator

Cliff Glaviano

Exploration and use of Dublin Core metadata tools in a graduate library course, Organization of Information, suggests applications of these tools in similar courses as an…

HTML

PDF (244 KB)

Downloads

518

Abstract

Exploration and use of Dublin Core metadata tools in a graduate library course, Organization of Information, suggests applications of these tools in similar courses as an introduction to cataloging. Students used the Nordic DC metadata creator to catalog and classify both Internet and traditional information objects. Because links available from the Nordic DC metadata creator and Dublin Core home pages make cataloging resource materials easily available, students had less difficulty integrating their knowledge and those resources to create cataloging records than their counterparts in prior courses did. Practice with the Dublin Core tools, observation of their use by student catalogers, and the potential for gaining experience with more sophisticated tools available in the Cooperative Online Resource Catalog (CORC) project encouraged Bowling Green State University to participate in CORC.

Details

OCLC Systems & Services: International digital library perspectives, vol. 16 no. 1

Type: Research Article

DOI:

ISSN: 1065-075X

Keywords

Access

Year

Content type

1 – 10 of over 1000