Search results
1 – 10 of over 4000Mohamed Hammami, Youssef Chahir and Liming Chen
Along with the ever growingWeb is the proliferation of objectionable content, such as sex, violence, racism, etc. We need efficient tools for classifying and filtering undesirable…
Abstract
Along with the ever growingWeb is the proliferation of objectionable content, such as sex, violence, racism, etc. We need efficient tools for classifying and filtering undesirable web content. In this paper, we investigate this problem through WebGuard, our automatic machine learning based pornographic website classification and filtering system. Facing the Internet more and more visual and multimedia as exemplified by pornographic websites, we focus here our attention on the use of skin color related visual content based analysis along with textual and structural content based analysis for improving pornographic website filtering. While the most commercial filtering products on the marketplace are mainly based on textual content‐based analysis such as indicative keywords detection or manually collected black list checking, the originality of our work resides on the addition of structural and visual content‐based analysis to the classical textual content‐based analysis along with several major‐data mining techniques for learning and classifying. Experimented on a testbed of 400 websites including 200 adult sites and 200 non pornographic ones, WebGuard, our Web filtering engine scored a 96.1% classification accuracy rate when only textual and structural content based analysis are used, and 97.4% classification accuracy rate when skin color related visual content based analysis is driven in addition. Further experiments on a black list of 12 311 adult websites manually collected and classified by the French Ministry of Education showed that WebGuard scored 87.82% classification accuracy rate when using only textual and structural content‐based analysis, and 95.62% classification accuracy rate when the visual content‐based analysis is driven in addition. The basic framework of WebGuard can apply to other categorization problems of websites which combine, as most of them do today, textual and visual content.
Details
Keywords
To provide an integrated perspective to similarities and differences between approaches to automated classification in different research communities (machine learning…
Abstract
Purpose
To provide an integrated perspective to similarities and differences between approaches to automated classification in different research communities (machine learning, information retrieval and library science), and point to problems with the approaches and automated classification as such.
Design/methodology/approach
A range of works dealing with automated classification of full‐text web documents are discussed. Explorations of individual approaches are given in the following sections: special features (description, differences, evaluation), application and characteristics of web pages.
Findings
Provides major similarities and differences between the three approaches: document pre‐processing and utilization of web‐specific document characteristics is common to all the approaches; major differences are in applied algorithms, employment or not of the vector space model and of controlled vocabularies. Problems of automated classification are recognized.
Research limitations/implications
The paper does not attempt to provide an exhaustive bibliography of related resources.
Practical implications
As an integrated overview of approaches from different research communities with application examples, it is very useful for students in library and information science and computer science, as well as for practitioners. Researchers from one community have the information on how similar tasks are conducted in different communities.
Originality/value
To the author's knowledge, no review paper on automated text classification attempted to discuss more than one community's approach from an integrated perspective.
Details
Keywords
Efthimia Mavridou, Konstantinos M. Giannoutakis, Dionysios Kehagias, Dimitrios Tzovaras and George Hassapis
Semantic categorization of Web services comprises a fundamental requirement for enabling more efficient and accurate search and discovery of services in the semantic Web era…
Abstract
Purpose
Semantic categorization of Web services comprises a fundamental requirement for enabling more efficient and accurate search and discovery of services in the semantic Web era. However, to efficiently deal with the growing presence of Web services, more automated mechanisms are required. This paper aims to introduce an automatic Web service categorization mechanism, by exploiting various techniques that aim to increase the overall prediction accuracy.
Design/methodology/approach
The paper proposes the use of Error Correcting Output Codes on top of a Logistic Model Trees-based classifier, in conjunction with a data pre-processing technique that reduces the original feature-space dimension without affecting data integrity. The proposed technique is generalized so as to adhere to all Web services with a description file. A semantic matchmaking scheme is also proposed for enabling the semantic annotation of the input and output parameters of each operation.
Findings
The proposed Web service categorization framework was tested with the OWLS-TC v4.0, as well as a synthetic data set with a systematic evaluation procedure that enables comparison with well-known approaches. After conducting exhaustive evaluation experiments, categorization efficiency in terms of accuracy, precision, recall and F-measure was measured. The presented Web service categorization framework outperformed the other benchmark techniques, which comprise different variations of it and also third-party implementations.
Originality/value
The proposed three-level categorization approach is a significant contribution to the Web service community, as it allows the automatic semantic categorization of all functional elements of Web services that are equipped with a service description file.
Details
Keywords
This paper aims to discuss how collaborative classification works in online music information retrieval systems and its impacts on the construction, fixation and orientation of…
Abstract
Purpose
This paper aims to discuss how collaborative classification works in online music information retrieval systems and its impacts on the construction, fixation and orientation of the social uses of popular music on the internet.
Design/methodology/approach
Using a comparative method, the paper examines the logic behind music classification in Recommender Systems by studying the case of Last.fm, one of the most popular web sites of this type on the web. Data collected about users' ritual classifications are compared with the classification used by the music industry, represented by the AllMusic web site.
Findings
The paper identifies the differences between the criteria used for the collaborative classification of popular music, which is defined by users, and the traditional standards of commercial classification, used by the cultural industries, and discusses why commercial and non‐commercial classification methods vary.
Practical implications
Collaborative ritual classification reveals a shift in the demand for cultural information that may affect the way in which this demand is organized, as well as the classification criteria for works on the digital music market.
Social implications
Collective creation of a music classification in recommender systems represents a new model of cultural mediation that might change the way of building new uses, tastes and patterns of musical consumption in online environments.
Originality/value
The paper highlights the way in which the classification process might influence the behavior of the users of music information retrieval systems, and vice versa.
Details
Keywords
Haichao Dong, Siu Cheung Hui and Yulan He
The purpose of this research is to study the characteristics of chat messages from analysing a collection of 33,121 sample messages gathered from 1,700 sessions of conversations…
Abstract
Purpose
The purpose of this research is to study the characteristics of chat messages from analysing a collection of 33,121 sample messages gathered from 1,700 sessions of conversations of 72 pairs of MSN Messenger users over a four month duration from June to September of 2005. The primary objective of chat message characterization is to understand the properties of chat messages for effective message analysis, such as message topic detection.
Design/methodology/approach
From the study on chat message characteristics, an indicative term‐based categorization approach for chat topic detection is proposed. In the proposed approach, different techniques such as sessionalisation of chat messages and extraction of features from icon texts and URLs are incorporated for message pre‐processing. Naïve Bayes, Associative Classification, and Support Vector Machine are employed as classifiers for categorizing topics from chat sessions.
Findings
Indicative term‐based approach is superior to the traditional document frequency based approach, for feature selection in chat topic categorization.
Originality/value
This paper studies the characteristics of chat messages and proposes an indicative term‐based categorization approach for chat topic detection.
Details
Keywords
The purpose of this paper is to understand the classification of musical medium, which is a critical part of music classification. It considers how musical medium is currently…
Abstract
Purpose
The purpose of this paper is to understand the classification of musical medium, which is a critical part of music classification. It considers how musical medium is currently classified, provides a theoretical understanding of what is currently problematic, and proposes a model which rethinks the classification of medium and resolves these issues.
Design/methodology/approach
The analysis is drawn from existing classification schemes, additionally using musicological and knowledge organization literature where relevant. The paper culminates in the design of a model of musical medium.
Findings
The analysis elicits sub-facets, orders and categorizations of medium: there is a strict categorization between vocal and instrumental music, a categorization based on broad size, and important sub-facets for multiples, accompaniment and arrangement. Problematically, there is a mismatch between the definitiveness of library and information science vocal/instrumental categorization and the blurred nature of real musical works; arrangements and accompaniments are limited by other categorizations; multiple voices and groups are not accommodated. So, a model with a radical new structure is proposed which resolves these classification issues.
Research limitations/implications
The results could be used to further understanding of music classification generally, for Western art music and other types of music.
Practical implications
The resulting model could be used to improve and design new classification schemes and to improve understanding of music retrieval.
Originality/value
Deep theoretical analysis of music classification is rare, so this paper’s approach is original. Furthermore, the paper’s value lies in studying a vital area of music classification which is not currently understood, and providing explanations and solutions. The proposed model is novel in structure and concept, and its original structure could be adapted for other knotty subjects.
Details
Keywords
The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word…
Abstract
Purpose
The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task.
Design/methodology/approach
Naïve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation‐based approach was compared with the non‐segmentation‐based approach.
Findings
There were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy.
Practical implications
Apply the findings to real web text classification is ongoing work.
Originality/value
The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.
Details
Keywords
Dion Hoe‐Lian Goh, Alton Chua, Chei Sian Lee and Khasfariyati Razikin
Social tagging systems allow users to assign keywords (tags) to useful resources, facilitating their future access by the tag creator and possibly by other users. Social tagging…
Abstract
Purpose
Social tagging systems allow users to assign keywords (tags) to useful resources, facilitating their future access by the tag creator and possibly by other users. Social tagging has both proponents and critics, and this paper aims to investigate if tags are an effective means of resource discovery.
Design/methodology/approach
The paper adopts techniques from text categorisation in which webpages and their associated tags from del.icio.us and trained Support Vector Machine (SVM) classifiers are downloaded to determine if the documents could be assigned to their associated tags. Two text categorisation experiments were conducted. The first used only the terms from the documents as features while the second experiment included tags in addition to terms as part of its feature set. Performance metrics used were precision, recall, accuracy and F1 score. A content analysis was also conducted to uncover characteristics of effective and ineffective tags for resource discovery.
Findings
Results from the classifiers were mixed, and the inclusion of tags as part of the feature set did not result in a statistically significant improvement (or degradation) of the performance of the SVM classifiers. This suggests that not all tags can be used for resource discovery by public users, confirming earlier work that there are many dynamic reasons for tagging documents that may not be apparent to others.
Originality/value
The authors extend their understanding of social classification and its utility in sharing and accessing resources. Results of this work may be used to guide development in social tagging systems as well as social tagging practices.
Details
Keywords
Olga Godlevskaja, Jos van Iwaarden and Ton van der Wiele
This paper aims to propose a framework that can be used for analysing services in the automotive industry.
Abstract
Purpose
This paper aims to propose a framework that can be used for analysing services in the automotive industry.
Design/methodology/approach
Existing categorisation schemes for services are investigated and evaluated in terms of their applicability to services in the automotive industry.
Findings
Services categorisation schemes are grouped under eight service paradigms, expressing the understanding that various authors had about services in different times and contexts.
Research limitations/implications
The remarks are limited to the automotive industry.
Practical implications
The paper suggests services classification schemes, which can be effectively applied to automotive services in order to generate valuable managerial insights.
Originality/value
This paper provides an overview over multiple services categorisation schemes existing in the literature.
Details
Keywords
V. Srilakshmi, K. Anuradha and C. Shoba Bindu
This paper aims to model a technique that categorizes the texts from huge documents. The progression in internet technologies has raised the count of document accessibility, and…
Abstract
Purpose
This paper aims to model a technique that categorizes the texts from huge documents. The progression in internet technologies has raised the count of document accessibility, and thus the documents available online become countless. The text documents comprise of research article, journal papers, newspaper, technical reports and blogs. These large documents are useful and valuable for processing real-time applications. Also, these massive documents are used in several retrieval methods. Text classification plays a vital role in information retrieval technologies and is considered as an active field for processing massive applications. The aim of text classification is to categorize the large-sized documents into different categories on the basis of its contents. There exist numerous methods for performing text-related tasks such as profiling users, sentiment analysis and identification of spams, which is considered as a supervised learning issue and is addressed with text classifier.
Design/methodology/approach
At first, the input documents are pre-processed using the stop word removal and stemming technique such that the input is made effective and capable for feature extraction. In the feature extraction process, the features are extracted using the vector space model (VSM) and then, the feature selection is done for selecting the highly relevant features to perform text categorization. Once the features are selected, the text categorization is progressed using the deep belief network (DBN). The training of the DBN is performed using the proposed grasshopper crow optimization algorithm (GCOA) that is the integration of the grasshopper optimization algorithm (GOA) and Crow search algorithm (CSA). Moreover, the hybrid weight bounding model is devised using the proposed GCOA and range degree. Thus, the proposed GCOA + DBN is used for classifying the text documents.
Findings
The performance of the proposed technique is evaluated using accuracy, precision and recall is compared with existing techniques such as naive bayes, k-nearest neighbors, support vector machine and deep convolutional neural network (DCNN) and Stochastic Gradient-CAViaR + DCNN. Here, the proposed GCOA + DBN has improved performance with the values of 0.959, 0.959 and 0.96 for precision, recall and accuracy, respectively.
Originality/value
This paper proposes a technique that categorizes the texts from massive sized documents. From the findings, it can be shown that the proposed GCOA-based DBN effectively classifies the text documents.
Details