To read this content please select one of the options below:

A dictionary‐based approach to normalizing gene names in one domain of knowledge from the biomedical literature

Carmen Galvez (Department of Information Science, Communication and Documentation Faculty, University of Granada, Granada, Spain)

Félix de Moya‐Anegón (SCImago Research Group (CSIC), Institute of Public Goods and Policies (IPP), Madrid, Spain)

Journal of Documentation

ISSN: 0022-0418

Article publication date: 13 January 2012

Downloads

641

Abstract

Purpose

–

Gene term variation is a shortcoming in text‐mining applications based on biomedical literature‐based knowledge discovery. The purpose of this paper is to propose a technique for normalizing gene names in biomedical literature.

Design/methodology/approach

–

Under this proposal, the normalized forms can be characterized as a unique gene symbol, defined as the official symbol or normalized name. The unification method involves five stages: collection of the gene term, using the resources provided by the Entrez Gene database; encoding of gene‐naming terms in a table or binary matrix; design of a parametrized finite‐state graph (P‐FSG); automatic generation of a dictionary; and matching based on dictionary look‐up to transform the gene mentions into the corresponding unified form.

Findings

–

The findings show that the approach yields a high percentage of recall. Precision is only moderately high, basically due to ambiguity problems between gene‐naming terms and words and abbreviations in general English.

Research limitations/implications

–

The major limitation of this study is that biomedical abstracts were analyzed instead of full‐text documents. The number of under‐normalization and over‐normalization errors is reduced considerably by limiting the realm of application to biomedical abstracts in a well‐defined domain.

Practical implications

–

The system can be used for practical tasks in biomedical literature mining. Normalized gene terms can be used as input to literature‐based gene clustering algorithms, for identifying hidden gene‐to‐disease, gene‐to‐gene and gene‐to‐literature relationships.

Originality/value

–

Few systems for gene term variation handling have been developed to date. The technique described performs gene name normalization by dictionary look‐up.

Keywords

Citation

Galvez, C. and de Moya‐Anegón, F. (2012), "A dictionary‐based approach to normalizing gene names in one domain of knowledge from the biomedical literature", Journal of Documentation, Vol. 68 No. 1, pp. 5-30. https://doi.org/10.1108/00220411211200301

Publisher

:

Emerald Group Publishing Limited

To read this content please select one of the options below:

Please note you do not have access to teaching notes

A dictionary‐based approach to normalizing gene names in one domain of knowledge from the biomedical literature

Abstract

Purpose

Design/methodology/approach

Findings

Research limitations/implications

Practical implications

Originality/value

Keywords

Citation

Publisher

Related articles

Something didn’t work…

All feedback is valuable

Platform update page

Questions & More Information

To read this content please select one of the options below:

Please note you do not have access to teaching notes

Abstract

Purpose

Design/methodology/approach

Findings

Research limitations/implications

Practical implications

Originality/value

Keywords

Citation

Publisher

Related articles

We’re listening — tell us what you think

Something didn’t work…

All feedback is valuable

Join us on our journey

Platform update page

Questions & More Information