To read this content please select one of the options below:

A semi-automatic indexing system based on embedded information in HTML documents

Mari Vállez (Department of Communication, Universitat Pompeu Fabra, Barcelona, Spain)
Rafael Pedraza-Jiménez (Department of Communication, Universitat Pompeu Fabra, Barcelona, Spain)
Lluís Codina (Department of Communication, Universitat Pompeu Fabra, Barcelona, Spain)
Saúl Blanco (Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Madrid, Spain)
Cristòfol Rovira (Department of Communication, Universitat Pompeu Fabra, Barcelona, Spain)

Library Hi Tech

ISSN: 0737-8831

Article publication date: 15 June 2015

887

Abstract

Purpose

The purpose of this paper is to describe and evaluate the tool DigiDoc MetaEdit which allows the semi-automatic indexing of HTML documents. The tool works by identifying and suggesting keywords from a thesaurus according to the embedded information in HTML documents. This enables the parameterization of keyword assignment based on how frequently the terms appear in the document, the relevance of their position, and the combination of both.

Design/methodology/approach

In order to evaluate the efficiency of the indexing tool, the descriptors/keywords suggested by the indexing tool are compared to the keywords which have been indexed manually by human experts. To make this comparison a corpus of HTML documents are randomly selected from a journal devoted to Library and Information Science.

Findings

The results of the evaluation show that there: first, is close to a 50 per cent match or overlap between the two indexing systems, however, if you take into consideration the related terms and the narrow terms the matches can reach 73 per cent; and second, the first terms identified by the tool are the most relevant.

Originality/value

The tool presented identifies the most important keywords in an HTML document based on the embedded information in HTML documents. Nowadays, representing the contents of documents with keywords is an essential practice in areas such as information retrieval and e-commerce.

Keywords

Acknowledgements

This paper is part of the projects: “Audiencias activas y periodismo” (Active audiences and journalism). CSO2012-39518-C04-02 and “Comunicación online de los destinos turísticos” (Online communication of tourist destinations) CSO 2011-22691. Plan Nacional de I+D+i, Ministerio de Economía y Competitividad (Spain).

Citation

Vállez, M., Pedraza-Jiménez, R., Codina, L., Blanco, S. and Rovira, C. (2015), "A semi-automatic indexing system based on embedded information in HTML documents", Library Hi Tech, Vol. 33 No. 2, pp. 195-210. https://doi.org/10.1108/LHT-12-2014-0114

Publisher

:

Emerald Group Publishing Limited

Copyright © 2015, Emerald Group Publishing Limited

Related articles