To read the full version of this content please select one of the options below:

A paper-text perspective: Studies on the influence of feature granularity for Chinese short-text-classification in the Big Data era

Hao Wang (School of Information Management, Nanjing University, Nanjing, China)
Sanhong Deng (School of Information Management, Nanjing University, Nanjing, China)

The Electronic Library

ISSN: 0264-0473

Article publication date: 7 August 2017

Abstract

Purpose

In the era of Big Data, network digital resources are growing rapidly, especially the short-text resources, such as tweets, comments, messages and so on, are showing a vigorous vitality. This study aims to compare the categories discriminative capacity (CDC) of Chinese language fragments with different granularities and to explore and verify feasibility, rationality and effectiveness of the low-granularity feature, such as Chinese characters in Chinese short-text classification (CSTC).

Design/methodology/approach

This study takes discipline classification of journal articles from CSSCI as a simulation environment. On the basis of sorting out the distribution rules of classification features with various granularities, including keywords, terms and characters, the classification effects accessed by the SVM algorithm are comprehensively compared and evaluated from three angles of using the same experiment samples, testing before and after feature optimization, and introducing external data.

Findings

The granularity of a classification feature has an important impact on CSTC. In general, the larger the granularity is, the better the classification result is, and vice versa. However, a low-granularity feature is also feasible, and its CDC could be improved by reasonable weight setting, even exceeding a high-granularity feature if synthetically considering classification precision, computational complexity and text coverage.

Originality/value

This is the first study to propose that Chinese characters are more suitable as descriptive features in CSTC than terms and keywords and to demonstrate that CDC of Chinese character features could be strengthened by mixing frequency and position as weight.

Keywords

Acknowledgements

This work was supported by Jiangsu Province Natural Science Foundation Project named “Study on Chinese Ontology Learning Oriented Patent Forewarning” (No. BK20130587) and the Major Program of National Social Science Foundation of China named “Study on Rapid Response Information System of Emergency Decision for Unexpected Events” (No. 13&ZD174).

Citation

Wang, H. and Deng, S. (2017), "A paper-text perspective: Studies on the influence of feature granularity for Chinese short-text-classification in the Big Data era", The Electronic Library, Vol. 35 No. 4, pp. 689-708. https://doi.org/10.1108/EL-09-2016-0192

Publisher

:

Emerald Publishing Limited

Copyright © 2017, Emerald Publishing Limited