Search results

1 – 10 of over 32000
Article
Publication date: 23 August 2019

Janani Balakumar and S. Vijayarani Mohan

Owing to the huge volume of documents available on the internet, text classification becomes a necessary task to handle these documents. To achieve optimal text…

Abstract

Purpose

Owing to the huge volume of documents available on the internet, text classification becomes a necessary task to handle these documents. To achieve optimal text classification results, feature selection, an important stage, is used to curtail the dimensionality of text documents by choosing suitable features. The main purpose of this research work is to classify the personal computer documents based on their content.

Design/methodology/approach

This paper proposes a new algorithm for feature selection based on artificial bee colony (ABCFS) to enhance the text classification accuracy. The proposed algorithm (ABCFS) is scrutinized with the real and benchmark data sets, which is contrary to the other existing feature selection approaches such as information gain and χ2 statistic. To justify the efficiency of the proposed algorithm, the support vector machine (SVM) and improved SVM classifier are used in this paper.

Findings

The experiment was conducted on real and benchmark data sets. The real data set was collected in the form of documents that were stored in the personal computer, and the benchmark data set was collected from Reuters and 20 Newsgroups corpus. The results prove the performance of the proposed feature selection algorithm by enhancing the text document classification accuracy.

Originality/value

This paper proposes a new ABCFS algorithm for feature selection, evaluates the efficiency of the ABCFS algorithm and improves the support vector machine. In this paper, the ABCFS algorithm is used to select the features from text (unstructured) documents. Although, there is no text feature selection algorithm in the existing work, the ABCFS algorithm is used to select the data (structured) features. The proposed algorithm will classify the documents automatically based on their content.

Article
Publication date: 1 May 2006

Koraljka Golub

To provide an integrated perspective to similarities and differences between approaches to automated classification in different research communities (machine learning…

2161

Abstract

Purpose

To provide an integrated perspective to similarities and differences between approaches to automated classification in different research communities (machine learning, information retrieval and library science), and point to problems with the approaches and automated classification as such.

Design/methodology/approach

A range of works dealing with automated classification of full‐text web documents are discussed. Explorations of individual approaches are given in the following sections: special features (description, differences, evaluation), application and characteristics of web pages.

Findings

Provides major similarities and differences between the three approaches: document pre‐processing and utilization of web‐specific document characteristics is common to all the approaches; major differences are in applied algorithms, employment or not of the vector space model and of controlled vocabularies. Problems of automated classification are recognized.

Research limitations/implications

The paper does not attempt to provide an exhaustive bibliography of related resources.

Practical implications

As an integrated overview of approaches from different research communities with application examples, it is very useful for students in library and information science and computer science, as well as for practitioners. Researchers from one community have the information on how similar tasks are conducted in different communities.

Originality/value

To the author's knowledge, no review paper on automated text classification attempted to discuss more than one community's approach from an integrated perspective.

Details

Journal of Documentation, vol. 62 no. 3
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 21 January 2019

Issa Alsmadi and Keng Hoon Gan

Rapid developments in social networks and their usage in everyday life have caused an explosion in the amount of short electronic documents. Thus, the need to classify…

Abstract

Purpose

Rapid developments in social networks and their usage in everyday life have caused an explosion in the amount of short electronic documents. Thus, the need to classify this type of document based on their content has a significant implication in many applications. The need to classify these documents in relevant classes according to their text contents should be interested in many practical reasons. Short-text classification is an essential step in many applications, such as spam filtering, sentiment analysis, Twitter personalization, customer review and many other applications related to social networks. Reviews on short text and its application are limited. Thus, this paper aims to discuss the characteristics of short text, its challenges and difficulties in classification. The paper attempt to introduce all stages in principle classification, the technique used in each stage and the possible development trend in each stage.

Design/methodology/approach

The paper as a review of the main aspect of short-text classification. The paper is structured based on the classification task stage.

Findings

This paper discusses related issues and approaches to these problems. Further research could be conducted to address the challenges in short texts and avoid poor accuracy in classification. Problems in low performance can be solved by using optimized solutions, such as genetic algorithms that are powerful in enhancing the quality of selected features. Soft computing solution has a fuzzy logic that makes short-text problems a promising area of research.

Originality/value

Using a powerful short-text classification method significantly affects many applications in terms of efficiency enhancement. Current solutions still have low performance, implying the need for improvement. This paper discusses related issues and approaches to these problems.

Details

International Journal of Web Information Systems, vol. 15 no. 2
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 1 February 1986

CLARE BEGHTOL

A strong definition of aboutness and a theory of its role in information retrieval systems have not been developed. Such a definition and theory may be extracted from the…

1228

Abstract

A strong definition of aboutness and a theory of its role in information retrieval systems have not been developed. Such a definition and theory may be extracted from the work of T. A. van Dijk. This paper discusses some of the implications of van Dijk's work for bibliographic classification theory. Two kinds of intertextuality are identified: that between documents classified in the same class of the same classification system; and that between the classification system as a text in its own right and the documents that are classified by it. Consideration of the two kinds of intertextuality leads to an investigation of the linguistic/cognitive processes that have been called the ‘translation’ of a document topic into a classificatory language. A descriptive model of the cognitive process of classifying documents is presented. The general design of an empirical study to test this model is suggested, and some problems of implementing such a study are briefly identified. It is concluded that further investigation of the relationships between text linguistics and classification theory and practice might reveal other fruitful intersections between the two fields.

Details

Journal of Documentation, vol. 42 no. 2
Type: Research Article
ISSN: 0022-0418

Article
Publication date: 7 August 2009

Arash Joorabchi and Abdulhussain E. Mahdi

With the significant growth in electronic education materials such as syllabus documents and lecture notes, available on the internet and intranets, there is a need for…

1536

Abstract

Purpose

With the significant growth in electronic education materials such as syllabus documents and lecture notes, available on the internet and intranets, there is a need for robust central repositories of such materials to allow both educators and learners to conveniently share, search and access them. The purpose of this paper is to report on the work to develop a national repository for course syllabi in Ireland.

Design/methodology/approach

The paper describes a prototype syllabus repository system for higher education in Ireland, which has been developed by utilising a number of information extraction and document classification techniques, including a new fully unsupervised document classification method that uses a web search engine for automatic collection of training set for the classification algorithm.

Findings

Preliminary experimental results for evaluating the performance of the system and its various units, particularly the information extractor and the classifier, are presented and discussed.

Originality/value

In this paper, three major obstacles associated with creating a large‐scale syllabus repository are identified, and a comprehensive review of published research work related to addressing these problems is provided. Two different types of syllabus documents are identified and describe a rule‐based information extraction system capable of extracting structured information from unstructured syllabus documents is described. Finally, the importance of classifying resources in a syllabus digital library is highlighted, a number of standard education classification schemes are introduced, and the unsupervised automated document classification system, which classifies syllabus documents based on an extended version of the International Standard Classification of Education, is described.

Details

The Electronic Library, vol. 27 no. 4
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 24 February 2021

Yen-Liang Chen, Li-Chen Cheng and Yi-Jun Zhang

A necessary preprocessing of document classification is to label some documents so that a classifier can be built based on which the remaining documents can be classified…

Abstract

Purpose

A necessary preprocessing of document classification is to label some documents so that a classifier can be built based on which the remaining documents can be classified. Because each document differs in length and complexity, the cost of labeling each document is different. The purpose of this paper is to consider how to select a subset of documents for labeling with a limited budget so that the total cost of the spending does not exceed the budget limit, while at the same time building a classifier with the best classification results.

Design/methodology/approach

In this paper, a framework is proposed to select the instances for labeling that integrate two clustering algorithms and two centroid selection methods. From the selected and labeled instances, five different classifiers were constructed with good classification accuracy to prove the superiority of the selected instances.

Findings

Experimental results show that this method can establish a training data set containing the most suitable data under the premise of considering the cost constraints. The data set considers both “data representativeness” and “data selection cost,” so that the training data labeled by experts can effectively establish a classifier with high accuracy.

Originality/value

No previous research has considered how to establish a training set with a cost limit when each document has a distinct labeling cost. This paper is the first attempt to resolve this issue.

Details

The Electronic Library , vol. 39 no. 1
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 20 November 2009

Alexander Mehler and Ulli Waltinger

The purpose of this paper is to present a topic classification model using the Dewey Decimal Classification (DDC) as the target scheme. This is to be done by exploring…

Abstract

Purpose

The purpose of this paper is to present a topic classification model using the Dewey Decimal Classification (DDC) as the target scheme. This is to be done by exploring metadata as provided by the Open Archives Initiative (OAI) to derive document snippets as minimal document representations. The reason is to reduce the effort of document processing in digital libraries. Further, the paper seeks to perform feature selection and extension by means of social ontologies and related web‐based lexical resources. This is done to provide reliable topic‐related classifications while circumventing the problem of data sparseness. Finally, the paper aims to evaluate the model by means of two language‐specific corpora. The paper bridges digital libraries, on the one hand, and computational linguistics, on the other. The aim is to make accessible computational linguistic methods to provide thematic classifications in digital libraries based on closed topic models such as the DDC.

Design/methodology/approach

The approach takes the form of text classification, text‐technology, computational linguistics, computational semantics, and social semantics.

Findings

It is shown that SVM‐based classifiers perform best by exploring certain selections of OAI document metadata.

Research limitations/implications

The findings show that it is necessary to further develop SVM‐based DDC‐classifiers by using larger training sets possibly for more than two languages in order to get better F‐measure values.

Originality/value

Algorithmic and formal‐mathematical information is provided on how to build DDC‐classifiers for digital libraries.

Details

Library Hi Tech, vol. 27 no. 4
Type: Research Article
ISSN: 0737-8831

Keywords

Article
Publication date: 3 December 2018

Cong-Phuoc Phan, Hong-Quang Nguyen and Tan-Tai Nguyen

Large collections of patent documents disclosing novel, non-obvious technologies are publicly available and beneficial to academia and industries. To maximally exploit its…

Abstract

Purpose

Large collections of patent documents disclosing novel, non-obvious technologies are publicly available and beneficial to academia and industries. To maximally exploit its potential, searching these patent documents has increasingly become an important topic. Although much research has processed a large size of collections, a few studies have attempted to integrate both patent classifications and specifications for analyzing user queries. Consequently, the queries are often insufficiently analyzed for improving the accuracy of search results. This paper aims to address such limitation by exploiting semantic relationships between patent contents and their classification.

Design/methodology/approach

The contributions are fourfold. First, the authors enhance similarity measurement between two short sentences and make it 20 per cent more accurate. Second, the Graph-embedded Tree ontology is enriched by integrating both patent documents and classification scheme. Third, the ontology does not rely on rule-based method or text matching; instead, an heuristic meaning comparison to extract semantic relationships between concepts is applied. Finally, the patent search approach uses the ontology effectively with the results sorted based on their most common order.

Findings

The experiment on searching for 600 patent documents in the field of Logistics brings better 15 per cent in terms of F-Measure when compared with traditional approaches.

Research limitations/implications

The research, however, still requires improvement in which the terms and phrases extracted by Noun and Noun phrases making less sense in some aspect and thus might not result in high accuracy. The large collection of extracted relationships could be further optimized for its conciseness. In addition, parallel processing such as Map-Reduce could be further used to improve the search processing performance.

Practical implications

The experimental results could be used for scientists and technologists to search for novel, non-obvious technologies in the patents.

Social implications

High quality of patent search results will reduce the patent infringement.

Originality/value

The proposed ontology is semantically enriched by integrating both patent documents and their classification. This ontology facilitates the analysis of the user queries for enhancing the accuracy of the patent search results.

Details

International Journal of Web Information Systems, vol. 15 no. 3
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 1 April 1974

KAREN SPARCK JONES

This article reviews the state of the art in automatic indexing, that is, automatic techniques for analysing and characterising documents, for manipulating their…

Abstract

This article reviews the state of the art in automatic indexing, that is, automatic techniques for analysing and characterising documents, for manipulating their descriptions in searching, and for generating the index language used for these purposes. It concentrates on the literature from 1968 to 1973. Section I defines the topic and its context. Sections II and III consider work in syntax and semantics respectively in detail. Section IV comments on ‘indirect’ indexing. Section V briefly surveys operating mechanized systems. In Section VI major experiments in automatic indexing are reviewed, and Section VII attempts an overall conclusion on the current state of automatic indexing techniques.

Details

Journal of Documentation, vol. 30 no. 4
Type: Research Article
ISSN: 0022-0418

Article
Publication date: 18 November 2013

Arash Joorabchi and Abdulhussain E. Mahdi

This paper aims to report on the design and development of a new approach for automatic classification and subject indexing of research documents in scientific digital…

1652

Abstract

Purpose

This paper aims to report on the design and development of a new approach for automatic classification and subject indexing of research documents in scientific digital libraries and repositories (DLR) according to library controlled vocabularies such as DDC and FAST.

Design/methodology/approach

The proposed concept matching-based approach (CMA) detects key Wikipedia concepts occurring in a document and searches the OPACs of conventional libraries via querying the WorldCat database to retrieve a set of MARC records which share one or more of the detected key concepts. Then the semantic similarity of each retrieved MARC record to the document is measured and, using an inference algorithm, the DDC classes and FAST subjects of those MARC records which have the highest similarity to the document are assigned to it.

Findings

The performance of the proposed method in terms of the accuracy of the DDC classes and FAST subjects automatically assigned to a set of research documents is evaluated using standard information retrieval measures of precision, recall, and F1. The authors demonstrate the superiority of the proposed approach in terms of accuracy performance in comparison to a similar system currently deployed in a large scale scientific search engine.

Originality/value

The proposed approach enables the development of a new type of subject classification system for DLR, and addresses some of the problems similar systems suffer from, such as the problem of imbalanced training data encountered by machine learning-based systems, and the problem of word-sense ambiguity encountered by string matching-based systems.

1 – 10 of over 32000