In the era of Big Data, network digital resources are growing rapidly, especially the short-text resources, such as tweets, comments, messages and so on, are showing a vigorous vitality. This study aims to compare the categories discriminative capacity (CDC) of Chinese language fragments with different granularities and to explore and verify feasibility, rationality and effectiveness of the low-granularity feature, such as Chinese characters in Chinese short-text classification (CSTC).
This study takes discipline classification of journal articles from CSSCI as a simulation environment. On the basis of sorting out the distribution rules of classification features with various granularities, including keywords, terms and characters, the classification effects accessed by the SVM algorithm are comprehensively compared and evaluated from three angles of using the same experiment samples, testing before and after feature optimization, and introducing external data.
The granularity of a classification feature has an important impact on CSTC. In general, the larger the granularity is, the better the classification result is, and vice versa. However, a low-granularity feature is also feasible, and its CDC could be improved by reasonable weight setting, even exceeding a high-granularity feature if synthetically considering classification precision, computational complexity and text coverage.
This is the first study to propose that Chinese characters are more suitable as descriptive features in CSTC than terms and keywords and to demonstrate that CDC of Chinese character features could be strengthened by mixing frequency and position as weight.
This work was supported by Jiangsu Province Natural Science Foundation Project named “Study on Chinese Ontology Learning Oriented Patent Forewarning” (No. BK20130587) and the Major Program of National Social Science Foundation of China named “Study on Rapid Response Information System of Emergency Decision for Unexpected Events” (No. 13&ZD174).
Wang, H. and Deng, S. (2017), "A paper-text perspective: Studies on the influence of feature granularity for Chinese short-text-classification in the Big Data era", The Electronic Library, Vol. 35 No. 4, pp. 689-708. https://doi.org/10.1108/EL-09-2016-0192
Emerald Publishing Limited
Copyright © 2017, Emerald Publishing Limited