Search results

1 – 10 of over 3000
Article
Publication date: 1 November 2005

Mohamed Hammami, Youssef Chahir and Liming Chen

Along with the ever growingWeb is the proliferation of objectionable content, such as sex, violence, racism, etc. We need efficient tools for classifying and filtering undesirable…

Abstract

Along with the ever growingWeb is the proliferation of objectionable content, such as sex, violence, racism, etc. We need efficient tools for classifying and filtering undesirable web content. In this paper, we investigate this problem through WebGuard, our automatic machine learning based pornographic website classification and filtering system. Facing the Internet more and more visual and multimedia as exemplified by pornographic websites, we focus here our attention on the use of skin color related visual content based analysis along with textual and structural content based analysis for improving pornographic website filtering. While the most commercial filtering products on the marketplace are mainly based on textual content‐based analysis such as indicative keywords detection or manually collected black list checking, the originality of our work resides on the addition of structural and visual content‐based analysis to the classical textual content‐based analysis along with several major‐data mining techniques for learning and classifying. Experimented on a testbed of 400 websites including 200 adult sites and 200 non pornographic ones, WebGuard, our Web filtering engine scored a 96.1% classification accuracy rate when only textual and structural content based analysis are used, and 97.4% classification accuracy rate when skin color related visual content based analysis is driven in addition. Further experiments on a black list of 12 311 adult websites manually collected and classified by the French Ministry of Education showed that WebGuard scored 87.82% classification accuracy rate when using only textual and structural content‐based analysis, and 95.62% classification accuracy rate when the visual content‐based analysis is driven in addition. The basic framework of WebGuard can apply to other categorization problems of websites which combine, as most of them do today, textual and visual content.

Details

International Journal of Web Information Systems, vol. 1 no. 4
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 1 May 2006

Koraljka Golub

To provide an integrated perspective to similarities and differences between approaches to automated classification in different research communities (machine learning…

2207

Abstract

Purpose

To provide an integrated perspective to similarities and differences between approaches to automated classification in different research communities (machine learning, information retrieval and library science), and point to problems with the approaches and automated classification as such.

Design/methodology/approach

A range of works dealing with automated classification of full‐text web documents are discussed. Explorations of individual approaches are given in the following sections: special features (description, differences, evaluation), application and characteristics of web pages.

Findings

Provides major similarities and differences between the three approaches: document pre‐processing and utilization of web‐specific document characteristics is common to all the approaches; major differences are in applied algorithms, employment or not of the vector space model and of controlled vocabularies. Problems of automated classification are recognized.

Research limitations/implications

The paper does not attempt to provide an exhaustive bibliography of related resources.

Practical implications

As an integrated overview of approaches from different research communities with application examples, it is very useful for students in library and information science and computer science, as well as for practitioners. Researchers from one community have the information on how similar tasks are conducted in different communities.

Originality/value

To the author's knowledge, no review paper on automated text classification attempted to discuss more than one community's approach from an integrated perspective.

Details

Journal of Documentation, vol. 62 no. 3
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 18 June 2018

Efthimia Mavridou, Konstantinos M. Giannoutakis, Dionysios Kehagias, Dimitrios Tzovaras and George Hassapis

Semantic categorization of Web services comprises a fundamental requirement for enabling more efficient and accurate search and discovery of services in the semantic Web era…

Abstract

Purpose

Semantic categorization of Web services comprises a fundamental requirement for enabling more efficient and accurate search and discovery of services in the semantic Web era. However, to efficiently deal with the growing presence of Web services, more automated mechanisms are required. This paper aims to introduce an automatic Web service categorization mechanism, by exploiting various techniques that aim to increase the overall prediction accuracy.

Design/methodology/approach

The paper proposes the use of Error Correcting Output Codes on top of a Logistic Model Trees-based classifier, in conjunction with a data pre-processing technique that reduces the original feature-space dimension without affecting data integrity. The proposed technique is generalized so as to adhere to all Web services with a description file. A semantic matchmaking scheme is also proposed for enabling the semantic annotation of the input and output parameters of each operation.

Findings

The proposed Web service categorization framework was tested with the OWLS-TC v4.0, as well as a synthetic data set with a systematic evaluation procedure that enables comparison with well-known approaches. After conducting exhaustive evaluation experiments, categorization efficiency in terms of accuracy, precision, recall and F-measure was measured. The presented Web service categorization framework outperformed the other benchmark techniques, which comprise different variations of it and also third-party implementations.

Originality/value

The proposed three-level categorization approach is a significant contribution to the Web service community, as it allows the automatic semantic categorization of all functional elements of Web services that are equipped with a service description file.

Details

International Journal of Web Information Systems, vol. 14 no. 2
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 22 August 2011

Rose Marie Santini

This paper aims to discuss how collaborative classification works in online music information retrieval systems and its impacts on the construction, fixation and orientation of…

2168

Abstract

Purpose

This paper aims to discuss how collaborative classification works in online music information retrieval systems and its impacts on the construction, fixation and orientation of the social uses of popular music on the internet.

Design/methodology/approach

Using a comparative method, the paper examines the logic behind music classification in Recommender Systems by studying the case of Last.fm, one of the most popular web sites of this type on the web. Data collected about users' ritual classifications are compared with the classification used by the music industry, represented by the AllMusic web site.

Findings

The paper identifies the differences between the criteria used for the collaborative classification of popular music, which is defined by users, and the traditional standards of commercial classification, used by the cultural industries, and discusses why commercial and non‐commercial classification methods vary.

Practical implications

Collaborative ritual classification reveals a shift in the demand for cultural information that may affect the way in which this demand is organized, as well as the classification criteria for works on the digital music market.

Social implications

Collective creation of a music classification in recommender systems represents a new model of cultural mediation that might change the way of building new uses, tastes and patterns of musical consumption in online environments.

Originality/value

The paper highlights the way in which the classification process might influence the behavior of the users of music information retrieval systems, and vice versa.

Details

OCLC Systems & Services: International digital library perspectives, vol. 27 no. 3
Type: Research Article
ISSN: 1065-075X

Keywords

Article
Publication date: 1 September 2006

Haichao Dong, Siu Cheung Hui and Yulan He

The purpose of this research is to study the characteristics of chat messages from analysing a collection of 33,121 sample messages gathered from 1,700 sessions of conversations…

1373

Abstract

Purpose

The purpose of this research is to study the characteristics of chat messages from analysing a collection of 33,121 sample messages gathered from 1,700 sessions of conversations of 72 pairs of MSN Messenger users over a four month duration from June to September of 2005. The primary objective of chat message characterization is to understand the properties of chat messages for effective message analysis, such as message topic detection.

Design/methodology/approach

From the study on chat message characteristics, an indicative term‐based categorization approach for chat topic detection is proposed. In the proposed approach, different techniques such as sessionalisation of chat messages and extraction of features from icon texts and URLs are incorporated for message pre‐processing. Naïve Bayes, Associative Classification, and Support Vector Machine are employed as classifiers for categorizing topics from chat sessions.

Findings

Indicative term‐based approach is superior to the traditional document frequency based approach, for feature selection in chat topic categorization.

Originality/value

This paper studies the characteristics of chat messages and proposes an indicative term‐based categorization approach for chat topic detection.

Details

Online Information Review, vol. 30 no. 5
Type: Research Article
ISSN: 1468-4527

Keywords

Article
Publication date: 24 January 2018

Deborah Lee and Lyn Robinson

The purpose of this paper is to understand the classification of musical medium, which is a critical part of music classification. It considers how musical medium is currently…

Abstract

Purpose

The purpose of this paper is to understand the classification of musical medium, which is a critical part of music classification. It considers how musical medium is currently classified, provides a theoretical understanding of what is currently problematic, and proposes a model which rethinks the classification of medium and resolves these issues.

Design/methodology/approach

The analysis is drawn from existing classification schemes, additionally using musicological and knowledge organization literature where relevant. The paper culminates in the design of a model of musical medium.

Findings

The analysis elicits sub-facets, orders and categorizations of medium: there is a strict categorization between vocal and instrumental music, a categorization based on broad size, and important sub-facets for multiples, accompaniment and arrangement. Problematically, there is a mismatch between the definitiveness of library and information science vocal/instrumental categorization and the blurred nature of real musical works; arrangements and accompaniments are limited by other categorizations; multiple voices and groups are not accommodated. So, a model with a radical new structure is proposed which resolves these classification issues.

Research limitations/implications

The results could be used to further understanding of music classification generally, for Western art music and other types of music.

Practical implications

The resulting model could be used to improve and design new classification schemes and to improve understanding of music retrieval.

Originality/value

Deep theoretical analysis of music classification is rare, so this paper’s approach is original. Furthermore, the paper’s value lies in studying a vital area of music classification which is not currently understood, and providing explanations and solutions. The proposed model is novel in structure and concept, and its original structure could be adapted for other knotty subjects.

Details

Journal of Documentation, vol. 74 no. 2
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 1 May 2007

Fuchun Peng and Xiangji Huang

The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word…

Abstract

Purpose

The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task.

Design/methodology/approach

Naïve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation‐based approach was compared with the non‐segmentation‐based approach.

Findings

There were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy.

Practical implications

Apply the findings to real web text classification is ongoing work.

Originality/value

The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.

Details

Journal of Documentation, vol. 63 no. 3
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 11 January 2011

Olga Godlevskaja, Jos van Iwaarden and Ton van der Wiele

This paper aims to propose a framework that can be used for analysing services in the automotive industry.

9027

Abstract

Purpose

This paper aims to propose a framework that can be used for analysing services in the automotive industry.

Design/methodology/approach

Existing categorisation schemes for services are investigated and evaluated in terms of their applicability to services in the automotive industry.

Findings

Services categorisation schemes are grouped under eight service paradigms, expressing the understanding that various authors had about services in different times and contexts.

Research limitations/implications

The remarks are limited to the automotive industry.

Practical implications

The paper suggests services classification schemes, which can be effectively applied to automotive services in order to generate valuable managerial insights.

Originality/value

This paper provides an overview over multiple services categorisation schemes existing in the literature.

Details

International Journal of Quality & Reliability Management, vol. 28 no. 1
Type: Research Article
ISSN: 0265-671X

Keywords

Article
Publication date: 19 June 2009

Dion Hoe‐Lian Goh, Alton Chua, Chei Sian Lee and Khasfariyati Razikin

Social tagging systems allow users to assign keywords (tags) to useful resources, facilitating their future access by the tag creator and possibly by other users. Social tagging…

1225

Abstract

Purpose

Social tagging systems allow users to assign keywords (tags) to useful resources, facilitating their future access by the tag creator and possibly by other users. Social tagging has both proponents and critics, and this paper aims to investigate if tags are an effective means of resource discovery.

Design/methodology/approach

The paper adopts techniques from text categorisation in which webpages and their associated tags from del.icio.us and trained Support Vector Machine (SVM) classifiers are downloaded to determine if the documents could be assigned to their associated tags. Two text categorisation experiments were conducted. The first used only the terms from the documents as features while the second experiment included tags in addition to terms as part of its feature set. Performance metrics used were precision, recall, accuracy and F1 score. A content analysis was also conducted to uncover characteristics of effective and ineffective tags for resource discovery.

Findings

Results from the classifiers were mixed, and the inclusion of tags as part of the feature set did not result in a statistically significant improvement (or degradation) of the performance of the SVM classifiers. This suggests that not all tags can be used for resource discovery by public users, confirming earlier work that there are many dynamic reasons for tagging documents that may not be apparent to others.

Originality/value

The authors extend their understanding of social classification and its utility in sharing and accessing resources. Results of this work may be used to guide development in social tagging systems as well as social tagging practices.

Details

Online Information Review, vol. 33 no. 3
Type: Research Article
ISSN: 1468-4527

Keywords

Article
Publication date: 30 July 2020

V. Srilakshmi, K. Anuradha and C. Shoba Bindu

This paper aims to model a technique that categorizes the texts from huge documents. The progression in internet technologies has raised the count of document accessibility, and…

Abstract

Purpose

This paper aims to model a technique that categorizes the texts from huge documents. The progression in internet technologies has raised the count of document accessibility, and thus the documents available online become countless. The text documents comprise of research article, journal papers, newspaper, technical reports and blogs. These large documents are useful and valuable for processing real-time applications. Also, these massive documents are used in several retrieval methods. Text classification plays a vital role in information retrieval technologies and is considered as an active field for processing massive applications. The aim of text classification is to categorize the large-sized documents into different categories on the basis of its contents. There exist numerous methods for performing text-related tasks such as profiling users, sentiment analysis and identification of spams, which is considered as a supervised learning issue and is addressed with text classifier.

Design/methodology/approach

At first, the input documents are pre-processed using the stop word removal and stemming technique such that the input is made effective and capable for feature extraction. In the feature extraction process, the features are extracted using the vector space model (VSM) and then, the feature selection is done for selecting the highly relevant features to perform text categorization. Once the features are selected, the text categorization is progressed using the deep belief network (DBN). The training of the DBN is performed using the proposed grasshopper crow optimization algorithm (GCOA) that is the integration of the grasshopper optimization algorithm (GOA) and Crow search algorithm (CSA). Moreover, the hybrid weight bounding model is devised using the proposed GCOA and range degree. Thus, the proposed GCOA + DBN is used for classifying the text documents.

Findings

The performance of the proposed technique is evaluated using accuracy, precision and recall is compared with existing techniques such as naive bayes, k-nearest neighbors, support vector machine and deep convolutional neural network (DCNN) and Stochastic Gradient-CAViaR + DCNN. Here, the proposed GCOA + DBN has improved performance with the values of 0.959, 0.959 and 0.96 for precision, recall and accuracy, respectively.

Originality/value

This paper proposes a technique that categorizes the texts from massive sized documents. From the findings, it can be shown that the proposed GCOA-based DBN effectively classifies the text documents.

Details

International Journal of Web Information Systems, vol. 16 no. 3
Type: Research Article
ISSN: 1744-0084

Keywords

1 – 10 of over 3000