To read the full version of this content please select one of the options below:

Predicting the quality of health web documents using their characteristics

Melinda Oroszlányová (Department of Informatics Engineering, University of Porto, Porto, Portugal)
Carla Teixeira Lopes (Department of Informatics Engineering, University of Porto, Porto, Portugal)
Sérgio Nunes (Department of Informatics Engineering, University of Porto, Porto, Portugal)
Cristina Ribeiro (Department of Informatics Engineering, University of Porto, Porto, Portugal)

Online Information Review

ISSN: 1468-4527

Article publication date: 5 September 2018

Issue publication date: 16 October 2018



The quality of consumer-oriented health information on the web has been defined and evaluated in several studies. Usually it is based on evaluation criteria identified by the researchers and, so far, there is no agreed standard for the quality indicators to use. Based on such indicators, tools have been developed to evaluate the quality of web information. The HONcode is one of such tools. The purpose of this paper is to investigate the influence of web document features on their quality, using HONcode as ground truth, with the aim of finding whether it is possible to predict the quality of a document using its characteristics.


The present work uses a set of health documents and analyzes how their characteristics (e.g. web domain, last update, type, mention of places of treatment and prevention strategies) are associated with their quality. Based on these features, statistical models are built which predict whether health-related web documents have certification-level quality. Multivariate analysis is performed, using classification to estimate the probability of a document having quality given its characteristics. This approach tells us which predictors are important. Three types of full and reduced logistic regression models are built and evaluated. The first one includes every feature, without any exclusion, the second one disregards the Utilization Review Accreditation Commission variable, due to it being a quality indicator, and the third one excludes the variables related to the HONcode principles, which might also be indicators of quality. The reduced models were built with the aim to see whether they reach similar results with a smaller number of features.


The prediction models have high accuracy, even without including the characteristics of Health on the Net code principles in the models. The most informative prediction model considers characteristics that can be assessed automatically (e.g. split content, type, process of revision and place of treatment). It has an accuracy of 89 percent.


This paper proposes models that automatically predict whether a document has quality or not. Some of the used features (e.g. prevention, prognosis or treatment) have not yet been explicitly considered in this context. The findings of the present study may be used by search engines to promote high-quality documents. This will improve health information retrieval and may contribute to reduce the problems caused by inaccurate information.



This work was supported by Project “NORTE-01-0145-FEDER-000016” (NanoSTIMA), financed by the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement and through the European Regional Development Fund (ERDF).


Oroszlányová, M., Teixeira Lopes, C., Nunes, S. and Ribeiro, C. (2018), "Predicting the quality of health web documents using their characteristics", Online Information Review, Vol. 42 No. 7, pp. 1024-1047.



Emerald Publishing Limited

Copyright © 2018, Emerald Publishing Limited