A feature selection approach for automatic e-book classification based on discourse segmentation

Jiunn-Liang Guo (R.O.C. (Taiwan) Air Force Academy)
Hei-Chia Wang (Institute of Information Management, National Cheng Kung University)
Ming-Way Lai (Institute of Information Management, National Cheng Kung University)

Program: electronic library and information systems

ISSN: 0033-0337

Publication date: 2 February 2015

Abstract

Purpose

The purpose of this paper is to develop a novel feature selection approach for automatic text classification of large digital documents – e-books of online library system. The main idea mainly aims on automatically identifying the discourse features in order to improving the feature selection process rather than focussing on the size of the corpus.

Design/methodology/approach

The proposed framework intends to automatically identify the discourse segments within e-books and capture proper discourse subtopics that are cohesively expressed in discourse segments and treating these subtopics as informative and prominent features. The selected set of features is then used to train and perform the e-book classification task based on the support vector machine technique.

Findings

The evaluation of the proposed framework shows that identifying discourse segments and capturing subtopic features leads to better performance, in comparison with two conventional feature selection techniques: TFIDF and mutual information. It also demonstrates that discourse features play important roles among textual features, especially for large documents such as e-books.

Research limitations/implications

Automatically extracted subtopic features cannot be directly entered into FS process but requires control of the threshold.

Practical implications

The proposed technique has demonstrated the promised application of using discourse analysis to enhance the classification of large digital documents – e-books as against to conventional techniques.

Originality/value

A new FS technique is proposed which can inspect the narrative structure of large documents and it is new to the text classification domain. The other contribution is that it inspires the consideration of discourse information in future text analysis, by providing more evidences through evaluation of the results. The proposed system can be integrated into other library management systems.

Keywords

Acknowledgements

The research is based on work supported (in part) by the NSC 101-2410-H-006-011-MY2 project from National Science Council, Taiwan. Besides, the authors sincerely appreciate the anonymous reviewers and editors who offered valuable comments to help improve the quality of this paper significantly.

Citation

Guo, J.-L., Wang, H.-C. and Lai, M.-W. (2015), "A feature selection approach for automatic e-book classification based on discourse segmentation", Program: electronic library and information systems, Vol. 49 No. 1, pp. 2-22. https://doi.org/10.1108/PROG-12-2012-0071

Publisher

:

Emerald Group Publishing Limited

Copyright © 2015, Emerald Group Publishing Limited

To read the full version of this content please select one of the options below

You may be able to access this content by logging in via Shibboleth, Open Athens or with your Emerald account.
To rent this content from Deepdyve, please click the button.
If you think you should have access to this content, click the button to contact our support team.