Search results

1 – 10 of 556
Article
Publication date: 27 September 2011

Aleksandar Kovačević, Dragan Ivanović, Branko Milosavljević, Zora Konjović and Dušan Surla

The aim of this paper is to develop a system for automatic extraction of metadata from scientific papers in PDF format for the information system for monitoring the scientific…

1207

Abstract

Purpose

The aim of this paper is to develop a system for automatic extraction of metadata from scientific papers in PDF format for the information system for monitoring the scientific research activity of the University of Novi Sad (CRIS UNS).

Design/methodology/approach

The system is based on machine learning and performs automatic extraction and classification of metadata in eight pre‐defined categories. The extraction task is realised as a classification process. For the purpose of classification each row of text is represented with a vector that comprises different features: formatting, position, characteristics related to the words, etc. Experiments were performed with standard classification models. Both a single classifier with all eight categories and eight individual classifiers were tested. Classifiers were evaluated using the five‐fold cross validation, on a manually annotated corpus comprising 100 scientific papers in PDF format, collected from various conferences, journals and authors' personal web pages.

Findings

Based on the performances obtained on classification experiments, eight separate support vector machines (SVM) models (each of which recognises its corresponding category) were chosen. All eight models were established to have a good performance. The F‐measure was over 85 per cent for almost all of the classifiers and over 90 per cent for most of them.

Research limitations/implications

Automatically extracted metadata cannot be directly entered into CRIS UNS but requires control of the curators.

Practical implications

The proposed system for automatic metadata extraction using support vector machines model was integrated into the software system, CRIS UNS. Metadata extraction has been tested on the publications of researchers from the Department of Mathematics and Informatics of the Faculty of Sciences in Novi Sad. Analysis of extracted metadata from these publications showed that the performance of the system for the previously unseen data is in accordance with that obtained by the cross‐validation from eight separate SVM classifiers. This system will help in the process of synchronising metadata from CRIS UNS with other institutional repositories.

Originality/value

The paper documents a fully automated system for metadata extraction from scientific papers that was developed. The system is based on the SVM classifier and open source tools, and is capable of extracting eight types of metadata from scientific articles of any format that can be converted to PDF. Although developed as part of CRIS UNS, the proposed system can be integrated into other CRIS systems, as well as institutional repositories and library management systems.

Article
Publication date: 19 June 2017

Mohammed Ourabah Soualah, Yassine Ait Ali Yahia, Abdelkader Keita and Abderrezak Guessoum

The purpose of this paper is to obtain online access to the digitised Arabic manuscripts images, which need to use a catalogue. The bibliographic cataloguing is unsuitable for old…

1064

Abstract

Purpose

The purpose of this paper is to obtain online access to the digitised Arabic manuscripts images, which need to use a catalogue. The bibliographic cataloguing is unsuitable for old Arabic manuscripts, and it is imperative to establish a new cataloguing model. In the research, the authors propose a new cataloguing model based on manuscript annotations and transcriptions. This model can be an effective solution to dynamic catalogue old Arabic manuscripts. In this field, the authors used the automatic extraction of the metadata that is based on the structural similarity of the documents.

Design/methodology/approach

This work is based on experimental methodology. The whole proposed concepts and formulas were tested for validation. This, allows the authors to make concise conclusions.

Findings

Cataloguing old Arabic manuscripts faces problem of unavailability of information. However, this information may be found in another place in a copy of the original manuscript. Thus, cataloguing Arabic manuscript cannot be done in one time, it is a continual process which require information updating. The idea is to make a pre-cataloguing of a manuscript, then try to complete and improve it through a specific platform. Consequently, in the research work, the authors propose a new cataloguing model, which the authors call “Dynamic cataloguing”.

Research limitations/implications

The success of the proposed model is confronted with the involvement of all actors of the model. It is based on the conviction and the motivation of actors of the collaborative platform.

Practical implications

The model can be used in several cataloguing fields, where the encoding model is based on XML. The model is innovative and implements a smart cataloguing model. The model is useful by using a web platform. It allows an automatic update of a catalogue.

Social implications

The model prompts the user to participate and enrich the catalogue. The user could improve his social status from a passive to an active.

Originality/value

The dynamic cataloguing model is a new concept. It has never been proposed in the literature until now. The proposed cataloguing model is based on automatic extraction of metadata from user annotations/transcription. It is a smart system which automatically updates or fills the catalogue with the extracted metadata.

Article
Publication date: 1 July 2014

Wen-Feng Hsiao, Te-Min Chang and Erwin Thomas

The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in portable…

Abstract

Purpose

The purpose of this paper is to propose an automatic metadata extraction and retrieval system to extract bibliographical information from digital academic documents in portable document formats (PDFs).

Design/methodology/approach

The authors use PDFBox to extract text and font size information, a rule-based method to identify titles, and an Hidden Markov Model (HMM) to extract the titles and authors. Finally, the extracted titles and authors (possibly incorrect or incomplete) are sent as query strings to digital libraries (e.g. ACM, IEEE, CiteSeerX, SDOS, and Google Scholar) to retrieve the rest of metadata.

Findings

Four experiments are conducted to examine the feasibility of the proposed system. The first experiment compares two different HMM models: multi-state model and one state model (the proposed model). The result shows that one state model can have a comparable performance with multi-state model, but is more suitable to deal with real-world unknown states. The second experiment shows that our proposed model (without the aid of online query) can achieve as good performance as other researcher's model on Cora paper header dataset. In the third experiment the paper examines the performance of our system on a small dataset of 43 real PDF research papers. The result shows that our proposed system (with online query) can perform pretty well on bibliographical data extraction and even outperform the free citation management tool Zotero 3.0. Finally, the paper conducts the fourth experiment with a larger dataset of 103 papers to compare our system with Zotero 4.0. The result shows that our system significantly outperforms Zotero 4.0. The feasibility of the proposed model is thus justified.

Research limitations/implications

For academic implication, the system is unique in two folds: first, the system only uses Cora header set for HMM training, without using other tagged datasets or gazetteers resources, which means the system is light and scalable. Second, the system is workable and can be applied to extracting metadata of real-world PDF files. The extracted bibliographical data can then be imported into citation software such as endnote or refworks to increase researchers’ productivity.

Practical implications

For practical implication, the system can outperform the existing tool, Zotero v4.0. This provides practitioners good chances to develop similar products in real applications; though it might require some knowledge about HMM implementation.

Originality/value

The HMM implementation is not novel. What is innovative is that it actually combines two HMM models. The main model is adapted from Freitag and Mccallum (1999) and the authors add word features of the Nymble HMM (Bikel et al, 1997) to it. The system is workable even without manually tagging the datasets before training the model (the authors just use cora dataset to train and test on real-world PDF papers), as this is significantly different from what other works have done so far. The experimental results have shown sufficient evidence about the feasibility of our proposed method in this aspect.

Details

Program, vol. 48 no. 3
Type: Research Article
ISSN: 0033-0337

Keywords

Article
Publication date: 8 August 2008

Alexander Ivanyukovich, Maurizio Marchese and Fausto Giunchiglia

The purpose of this paper is to provide support for automation of the annotation process of large corpora of digital content.

1246

Abstract

Purpose

The purpose of this paper is to provide support for automation of the annotation process of large corpora of digital content.

Design/methodology/approach

The paper presents and discusses an information extraction pipeline from digital document acquisition to information extraction, processing and management. An overall architecture that supports such an extraction pipeline is detailed and discussed.

Findings

The proposed pipeline is implemented in a working prototype of an autonomous digital library (A‐DL) system called ScienceTreks that: supports a broad range of methods for document acquisition; does not rely on any external information sources and is solely based on the existing information in the document itself and in the overall set in a given digital archive; and provides application programming interfaces (API) to support easy integration of external systems and tools in the existing pipeline.

Practical implications

The proposed A‐DL system can be used in automating end‐to‐end information retrieval and processing, supporting the control and elimination of error‐prone human intervention in the process.

Originality/value

High quality automatic metadata extraction is a crucial step in the move from linguistic entities to logical entities, relation information and logical relations, and therefore to the semantic level of digital library usability. This in turn creates the opportunity for value‐added services within existing and future semantic‐enabled digital library systems.

Details

Online Information Review, vol. 32 no. 4
Type: Research Article
ISSN: 1468-4527

Keywords

Article
Publication date: 28 October 2020

Ivana Tanasijević and Gordana Pavlović-Lažetić

The purpose of this paper is to provide a methodology for automatic annotation of a multimedia collection of intangible cultural heritage mostly in the form of interviews…

Abstract

Purpose

The purpose of this paper is to provide a methodology for automatic annotation of a multimedia collection of intangible cultural heritage mostly in the form of interviews. Assigned annotations provide a way to search the collection.

Design/methodology/approach

Annotation is based on automatic extraction of metadata and is conducted by named entity and topic extraction from textual descriptions with a rule-based approach supported by vocabulary resources, a compiled domain-specific classification scheme and domain-oriented corpus analysis.

Findings

The proposed methodology for automatic annotation of a collection of intangible cultural heritage, applied on the cultural heritage of the Balkans, has very good results according to F measure, which is 0.87 for the named entity and 0.90 for topic annotation. The overall methodology enables encapsulating domain-specific and language-specific knowledge into collections of finite state transducers and allows further improvements.

Originality/value

Although cultural heritage has a significant role in the development of identity of a group or an individual, it is one of those specific domains that have not yet been fully explored in case of many languages. A methodology is proposed that can be used for incorporating natural language processing techniques into digital libraries of cultural heritage.

Details

The Electronic Library , vol. 38 no. 5/6
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 17 August 2015

Matthias Elser, Ronald Mies, Peter Altendorf, Alberto Messina, Fulvio Negro, Werner Bailer, Albert Hofmann and Georg Thallinger

This paper aims to propose a service-oriented framework for performing content annotation and search, adapted to the task context. Media production workflows are becoming…

Abstract

Purpose

This paper aims to propose a service-oriented framework for performing content annotation and search, adapted to the task context. Media production workflows are becoming increasingly distributed and heterogeneous.The tasks of professionals in media production can be supported by automatic content analysis and search and retrieval services.

Design/methodology/approach

The processes of the framework are derived top-down, starting from business goals and scenarios in audiovisual media production. Formal models of tasks in the production workflow are defined, and business processes are derived from the task models. A software framework enabling the orchestrated execution of these models is developed.

Findings

This paper presents a framework that implements the proposed approach called Metadata Production Management Framework (MPMF). The authors show how a media production workflow for a real-world scenario is implemented using the MPMF.

Research limitations/implications

The authors have demonstrated the feasibility of a model-based approach for media processing. In the reification step, there is still information that needs to be provided in addition to the task models to obtain executable processes. Future research should target the further automation of this process.

Practical implications

By means of this approach, the implementation of the business process defines the workflow, whereas the services that are actually used are defined by the configuration. Thus, the processes are stable and, at the same time, the services can be managed very flexibly. If necessary, service implementations can also be completely replaced by others without changing the business process implementation.

Originality/value

The authors introduce a model-based approach to media processing and define a reification process from business-driven task models to executable workflows. This enables a more task-oriented design of media processing workflows and adaptive use of automatic information extraction tools.

Details

International Journal of Web Information Systems, vol. 11 no. 3
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 9 February 2015

Bhaskar Sinha, Somnath Chandra and Megha Garg

The purpose of this explorative research study is to focus on the implementation of semantic Web technology on agriculture domain of e-governance data. The study contributes to an…

1143

Abstract

Purpose

The purpose of this explorative research study is to focus on the implementation of semantic Web technology on agriculture domain of e-governance data. The study contributes to an understanding of problems and difficulties in implantations of unstructured and unformatted unique datasets of multilingual local language-based electronic dictionary (IndoWordnet).

Design/methodology/approach

An approach to an implementation in the perspective of conceptual logical concept to realization of agriculture-based terms and terminology extracted from linked multilingual IndoWordNet while maintaining the support and specification of the World Wide Web Consortium (W3C) standard of semantic Web technology to generate ontology and uniform unicode structured datasets.

Findings

The findings reveal the fact about partial support of extraction of terms, relations and concepts while linking to IndoWordNet, resulting in the form of SynSets, lexical relations of Words and relations between themselves. This helped in generation of ontology, hierarchical modeling and creation of structured metadata datasets.

Research limitations/implications

IndoWordNet has limitations, as it is not fully revised version due to diversified cultural base in India, and the new version is yet to be released in due time span. As mentioned in Section 5, implications of these ideas and experiments will have good impact in doing more exploration and better applications using such wordnet.

Practical implications

Language developer tools and frameworks have been used to get tagged annotated raw data processed and get intermediate results, which provides as a source for the generation of ontology and dynamic metadata.

Social implications

The results are expected to be applied for other e-governance applications. Better use of applications in social and government departments.

Originality/value

The authors have worked out experimental facts and raw information source datasets, revealing satisfactory results such as SynSets, sensecount, semantic and lexical relations, class concepts hierarchy and other related output, which helped in developing ontology of domain interest and, hence, creation of a dynamic metadata which can be globally used to facilitate various applications support.

Details

Journal of Knowledge Management, vol. 19 no. 1
Type: Research Article
ISSN: 1367-3270

Keywords

Article
Publication date: 6 November 2007

Alan Burk, Muhammad Al‐Digeil, Dominic Forest and Jennifer Whitney

The purpose of this paper is to develop automated methods for creating metadata for documents in an institutional repository.

1276

Abstract

Purpose

The purpose of this paper is to develop automated methods for creating metadata for documents in an institutional repository.

Design/methodology/approach

Two methods are examined for automatically building metadata in an institutional repository context. Text mining techniques are employed to discover relationships among documents with similar content, from which are inferred possible values for missing or incomplete metadata elements. Machine learning techniques are used to identify and extract specific metadata element values from document content.

Findings

Text mining techniques can be used to cluster documents with similar content. This allows values for metadata elements, like keyword, to be projected from documents with established metadata to related documents. Machine learning techniques are found to be reasonably accurate for extracting from documents values for metadata elements, such as, title, author, and abstract. Results show sufficient promise to support the next phase of the project: the development of assistive tools for use by metadata specialists to create or edit document metadata.

Originality/value

This paper focuses on the use of automated metadata extraction techniques to assist metadata creation, lessening the time and effort required to add documents to institutional repositories.

Details

OCLC Systems & Services: International digital library perspectives, vol. 23 no. 4
Type: Research Article
ISSN: 1065-075X

Keywords

Article
Publication date: 20 June 2016

Götz Hatop

The academic tradition of adding a reference section with references to cited and otherwise related academic material to an article provides a natural starting point for finding…

Abstract

Purpose

The academic tradition of adding a reference section with references to cited and otherwise related academic material to an article provides a natural starting point for finding links to other publications. These links can then be published as linked data. Natural language processing technologies are available today that can perform the task of bibliographical reference extraction from text. Publishing references by the means of semantic web technologies is a prerequisite for a broader study and analysis of citations and thus can help to improve academic communication in a general sense. The paper aims to discuss these issues.

Design/methodology/approach

This paper examines the overall workflow required to extract, analyze and semantically publish bibliographical references within an Institutional Repository with the help of open source software components.

Findings

A publication infrastructure where references are available for software agents would enable additional benefits like citation analysis, e.g. the collection of citations of a known paper and the investigation of citation sentiment.The publication of reference information as demonstrated in this article is possible with existing semantic web technologies based on established ontologies and open source software components.

Research limitations/implications

Only a limited number of metadata extraction programs have been considered for performance evaluation and reference extraction was tested for journal articles only, whereas Institutional Repositories usually do contain a large number of other material like monographs. Also, citation analysis is in an experimental state and citation sentiment is currently not published at all. For future work, the problem of distributing reference information between repositories is an important problem that needs to be tackled.

Originality/value

Publishing reference information as linked data are new within the academic publishing domain.

Details

Library Hi Tech, vol. 34 no. 2
Type: Research Article
ISSN: 0737-8831

Keywords

Article
Publication date: 7 August 2009

Arash Joorabchi and Abdulhussain E. Mahdi

With the significant growth in electronic education materials such as syllabus documents and lecture notes, available on the internet and intranets, there is a need for robust…

1547

Abstract

Purpose

With the significant growth in electronic education materials such as syllabus documents and lecture notes, available on the internet and intranets, there is a need for robust central repositories of such materials to allow both educators and learners to conveniently share, search and access them. The purpose of this paper is to report on the work to develop a national repository for course syllabi in Ireland.

Design/methodology/approach

The paper describes a prototype syllabus repository system for higher education in Ireland, which has been developed by utilising a number of information extraction and document classification techniques, including a new fully unsupervised document classification method that uses a web search engine for automatic collection of training set for the classification algorithm.

Findings

Preliminary experimental results for evaluating the performance of the system and its various units, particularly the information extractor and the classifier, are presented and discussed.

Originality/value

In this paper, three major obstacles associated with creating a large‐scale syllabus repository are identified, and a comprehensive review of published research work related to addressing these problems is provided. Two different types of syllabus documents are identified and describe a rule‐based information extraction system capable of extracting structured information from unstructured syllabus documents is described. Finally, the importance of classifying resources in a syllabus digital library is highlighted, a number of standard education classification schemes are introduced, and the unsupervised automated document classification system, which classifies syllabus documents based on an extended version of the International Standard Classification of Education, is described.

Details

The Electronic Library, vol. 27 no. 4
Type: Research Article
ISSN: 0264-0473

Keywords

1 – 10 of 556