Using visual content‐based analysis with textual and structural analysis for improving web filtering

Mohamed Hammami (LIRIS UMR CNRS 5205, Ecole Centrale de Lyon 36, Av Guy de Collongue, 69134 Ecully, France)
Youssef Chahir (LIRIS UMR CNRS 5205, Ecole Centrale de Lyon 36, Av Guy de Collongue, 69134 Ecully, France)
Liming Chen (GREYC, Campus II ‐ BP 5186 Université de Caen, 14032 Caen Cedex, France)

International Journal of Web Information Systems

ISSN: 1744-0084

Publication date: 1 November 2005

Abstract

Along with the ever growingWeb is the proliferation of objectionable content, such as sex, violence, racism, etc. We need efficient tools for classifying and filtering undesirable web content. In this paper, we investigate this problem through WebGuard, our automatic machine learning based pornographic website classification and filtering system. Facing the Internet more and more visual and multimedia as exemplified by pornographic websites, we focus here our attention on the use of skin color related visual content based analysis along with textual and structural content based analysis for improving pornographic website filtering. While the most commercial filtering products on the marketplace are mainly based on textual content‐based analysis such as indicative keywords detection or manually collected black list checking, the originality of our work resides on the addition of structural and visual content‐based analysis to the classical textual content‐based analysis along with several major‐data mining techniques for learning and classifying. Experimented on a testbed of 400 websites including 200 adult sites and 200 non pornographic ones, WebGuard, our Web filtering engine scored a 96.1% classification accuracy rate when only textual and structural content based analysis are used, and 97.4% classification accuracy rate when skin color related visual content based analysis is driven in addition. Further experiments on a black list of 12 311 adult websites manually collected and classified by the French Ministry of Education showed that WebGuard scored 87.82% classification accuracy rate when using only textual and structural content‐based analysis, and 95.62% classification accuracy rate when the visual content‐based analysis is driven in addition. The basic framework of WebGuard can apply to other categorization problems of websites which combine, as most of them do today, textual and visual content.

Keywords

Citation

Hammami, M., Chahir, Y. and Chen, L. (2005), "Using visual content‐based analysis with textual and structural analysis for improving web filtering", International Journal of Web Information Systems, Vol. 1 No. 4, pp. 241-254. https://doi.org/10.1108/17440080580000096

Download as .RIS

Publisher

:

Emerald Group Publishing Limited

Copyright © 2005, Emerald Group Publishing Limited

To read the full version of this content please select one of the options below

You may be able to access this content by logging in via Shibboleth, Open Athens or with your Emerald account.
To rent this content from Deepdyve, please click the button.
If you think you should have access to this content, click the button to contact our support team.