To read this content please select one of the options below:

Using pretraining and text mining methods to automatically extract the chemical scientific data

Na Pang (Department of Information Management, Peking University, Beijing, China)
Li Qian (National Science Library Chinese Academy of Sciences, Beijing, China) (Department of Library, Information and Archives Management, University of Chinese Academy of Sciences, Beijing, China)
Weimin Lyu (Department of Computer Science, Stony Brook University, Stony Brook, New York, USA)
Jin-Dong Yang (Department of Chemistry, Tsinghua University, Beijing, China)

Data Technologies and Applications

ISSN: 2514-9288

Article publication date: 22 October 2021

Issue publication date: 15 March 2022

271

Abstract

Purpose

In computational chemistry, the chemical bond energy (pKa) is essential, but most pKa-related data are submerged in scientific papers, with only a few data that have been extracted by domain experts manually. The loss of scientific data does not contribute to in-depth and innovative scientific data analysis. To address this problem, this study aims to utilize natural language processing methods to extract pKa-related scientific data in chemical papers.

Design/methodology/approach

Based on the previous Bert-CRF model combined with dictionaries and rules to resolve the problem of a large number of unknown words of professional vocabulary, in this paper, the authors proposed an end-to-end Bert-CRF model with inputting constructed domain wordpiece tokens using text mining methods. The authors use standard high-frequency string extraction techniques to construct domain wordpiece tokens for specific domains. And in the subsequent deep learning work, domain features are added to the input.

Findings

The experiments show that the end-to-end Bert-CRF model could have a relatively good result and can be easily transferred to other domains because it reduces the requirements for experts by using automatic high-frequency wordpiece tokens extraction techniques to construct the domain wordpiece tokenization rules and then input domain features to the Bert model.

Originality/value

By decomposing lots of unknown words with domain feature-based wordpiece tokens, the authors manage to resolve the problem of a large amount of professional vocabulary and achieve a relatively ideal extraction result compared to the baseline model. The end-to-end model explores low-cost migration for entity and relation extraction in professional fields, reducing the requirements for experts.

Keywords

Acknowledgements

The authors would like to thank the support by the Center of Basic molecular Science at Tsinghua University and the National Science Library of Chinese Academy of Sciences. The authors thank Huizhou Liu, Li Qian, Jinpei Cheng, Jin-Dong Yang and Sanzhong Luo for the insightful suggestions and discussions. This research was supported by the Special foundation of Science and Technology Resources Survey (No.2018FY201202).

Citation

Pang, N., Qian, L., Lyu, W. and Yang, J.-D. (2022), "Using pretraining and text mining methods to automatically extract the chemical scientific data", Data Technologies and Applications, Vol. 56 No. 2, pp. 205-222. https://doi.org/10.1108/DTA-11-2020-0284

Publisher

:

Emerald Publishing Limited

Copyright © 2021, Emerald Publishing Limited

Related articles