Using pretraining and text mining methods to automatically extract the chemical scientific data
Data Technologies and Applications
ISSN: 2514-9288
Article publication date: 22 October 2021
Issue publication date: 15 March 2022
Abstract
Purpose
In computational chemistry, the chemical bond energy (pKa) is essential, but most pKa-related data are submerged in scientific papers, with only a few data that have been extracted by domain experts manually. The loss of scientific data does not contribute to in-depth and innovative scientific data analysis. To address this problem, this study aims to utilize natural language processing methods to extract pKa-related scientific data in chemical papers.
Design/methodology/approach
Based on the previous Bert-CRF model combined with dictionaries and rules to resolve the problem of a large number of unknown words of professional vocabulary, in this paper, the authors proposed an end-to-end Bert-CRF model with inputting constructed domain wordpiece tokens using text mining methods. The authors use standard high-frequency string extraction techniques to construct domain wordpiece tokens for specific domains. And in the subsequent deep learning work, domain features are added to the input.
Findings
The experiments show that the end-to-end Bert-CRF model could have a relatively good result and can be easily transferred to other domains because it reduces the requirements for experts by using automatic high-frequency wordpiece tokens extraction techniques to construct the domain wordpiece tokenization rules and then input domain features to the Bert model.
Originality/value
By decomposing lots of unknown words with domain feature-based wordpiece tokens, the authors manage to resolve the problem of a large amount of professional vocabulary and achieve a relatively ideal extraction result compared to the baseline model. The end-to-end model explores low-cost migration for entity and relation extraction in professional fields, reducing the requirements for experts.
Keywords
Acknowledgements
The authors would like to thank the support by the Center of Basic molecular Science at Tsinghua University and the National Science Library of Chinese Academy of Sciences. The authors thank Huizhou Liu, Li Qian, Jinpei Cheng, Jin-Dong Yang and Sanzhong Luo for the insightful suggestions and discussions. This research was supported by the Special foundation of Science and Technology Resources Survey (No.2018FY201202).
Citation
Pang, N., Qian, L., Lyu, W. and Yang, J.-D. (2022), "Using pretraining and text mining methods to automatically extract the chemical scientific data", Data Technologies and Applications, Vol. 56 No. 2, pp. 205-222. https://doi.org/10.1108/DTA-11-2020-0284
Publisher
:Emerald Publishing Limited
Copyright © 2021, Emerald Publishing Limited