Scientific abstracts reproduce only part of the information and the complexity of argumentation in a scientific article. The purpose of this paper provides a first analysis of the similarity between the text of scientific abstracts and the body of articles, using sentences as the basic textual unit. It contributes to the understanding of the structure of abstracts.
Using sentence-based similarity metrics, the authors quantify the phenomenon of text re-use in abstracts and examine the positions of the sentences that are similar to sentences in abstracts in the introduction, methods, results and discussion structure, using a corpus of over 85,000 research articles published in the seven Public Library of Science journals.
The authors provide evidence that 84 percent of abstract have at least one sentence in common with the body of the paper. Studying the distributions of sentences in the body of the articles that are re-used in abstracts, the authors show that there exists a strong relation between the rhetorical structure of articles and the zones that authors re-use when writing abstracts, with sentences mainly coming from the beginning of the introduction and the end of the conclusion.
Scientific abstracts contain what is considered by the author(s) as information that best describe documents’ content. This is a first study that examines the relation between the contents of abstracts and the rhetorical structure of scientific articles. The work might provide new insight for improving automatic abstracting tools as well as information retrieval approaches, in which text organization and structure are important features.
The authors thank Benoit Macaluso of the Observatoire des sciences et des technologies (OST-UQAM) for harvesting and providing the PLOS dataset.
Emerald Group Publishing Limited
Copyright © 2016, Emerald Group Publishing Limited