Predicting the quality of health web documents using their characteristics

Melinda Oroszlányová (Department of Informatics Engineering, University of Porto, Porto, Portugal)
Carla Teixeira Lopes (Department of Informatics Engineering, University of Porto, Porto, Portugal)
Sérgio Nunes (Department of Informatics Engineering, University of Porto, Porto, Portugal)
Cristina Ribeiro (Department of Informatics Engineering, University of Porto, Porto, Portugal)

Online Information Review

ISSN: 1468-4527

Publication date: 12 November 2018



The quality of consumer-oriented health information on the web has been defined and evaluated in several studies. Usually it is based on evaluation criteria identified by the researchers and, so far, there is no agreed standard for the quality indicators to use. Based on such indicators, tools have been developed to evaluate the quality of web information. The HONcode is one of such tools. The purpose of this paper is to investigate the influence of web document features on their quality, using HONcode as ground truth, with the aim of finding whether it is possible to predict the quality of a document using its characteristics.


The present work uses a set of health documents and analyzes how their characteristics (e.g. web domain, last update, type, mention of places of treatment and prevention strategies) are associated with their quality. Based on these features, statistical models are built which predict whether health-related web documents have certification-level quality. Multivariate analysis is performed, using classification to estimate the probability of a document having quality given its characteristics. This approach tells us which predictors are important. Three types of full and reduced logistic regression models are built and evaluated. The first one includes every feature, without any exclusion, the second one disregards the Utilization Review Accreditation Commission variable, due to it being a quality indicator, and the third one excludes the variables related to the HONcode principles, which might also be indicators of quality. The reduced models were built with the aim to see whether they reach similar results with a smaller number of features.


The prediction models have high accuracy, even without including the characteristics of Health on the Net code principles in the models. The most informative prediction model considers characteristics that can be assessed automatically (e.g. split content, type, process of revision and place of treatment). It has an accuracy of 89 percent.


This paper proposes models that automatically predict whether a document has quality or not. Some of the used features (e.g. prevention, prognosis or treatment) have not yet been explicitly considered in this context. The findings of the present study may be used by search engines to promote high-quality documents. This will improve health information retrieval and may contribute to reduce the problems caused by inaccurate information.



Melinda Oroszlányová, Carla Teixeira Lopes, Sérgio Nunes and Cristina Ribeiro (2018) "Predicting the quality of health web documents using their characteristics", Online Information Review, Vol. 42 No. 7, pp. 1024-1047

Download as .RIS





Emerald Publishing Limited

Copyright © 2018, Emerald Publishing Limited

Please note you might not have access to this content

You may be able to access this content by login via Shibboleth, Open Athens or with your Emerald account.
If you would like to contact us about accessing this content, click the button and fill out the form.
To rent this content from Deepdyve, please click the button.