Search results

1 – 4 of 4

View access options

Article

Publication date: 7 June 2021

Named-entity recognition for early modern textual documents: a review of capabilities and challenges with strategies for the future

Marco Humbel, Julianne Nyhan, Andreas Vlachidis, Kim Sloan and Alexandra Ortolja-Baird

By mapping-out the capabilities, challenges and limitations of named-entity recognition (NER), this article aims to synthesise the state of the art of NER in the context of the…

HTML

PDF (662 KB)

Downloads

749

Abstract

Purpose

By mapping-out the capabilities, challenges and limitations of named-entity recognition (NER), this article aims to synthesise the state of the art of NER in the context of the early modern research field and to inform discussions about the kind of resources, methods and directions that may be pursued to enrich the application of the technique going forward.

Design/methodology/approach

Through an extensive literature review, this article maps out the current capabilities, challenges and limitations of NER and establishes the state of the art of the technique in the context of the early modern, digitally augmented research field. It also presents a new case study of NER research undertaken by Enlightenment Architectures: Sir Hans Sloane's Catalogues of his Collections (2016–2021), a Leverhulme funded research project and collaboration between the British Museum and University College London, with contributing expertise from the British Library and the Natural History Museum.

Findings

Currently, it is not possible to benchmark the capabilities of NER as applied to documents of the early modern period. The authors also draw attention to the situated nature of authority files, and current conceptualisations of NER, leading them to the conclusion that more robust reporting and critical analysis of NER approaches and findings is required.

Research limitations/implications

This article examines NER as applied to early modern textual sources, which are mostly studied by Humanists. As addressed in this article, detailed reporting of NER processes and outcomes is not necessarily valued by the disciplines of the Humanities, with the result that it can be difficult to locate relevant data and metrics in project outputs. The authors have tried to mitigate this by contacting projects discussed in this paper directly, to further verify the details they report here.

Practical implications

The authors suggest that a forum is needed where tools are evaluated according to community standards. Within the wider NER community, the MUC and ConLL corpora are used for such experimental set-ups and are accompanied by a conference series, and may be seen as a useful model for this. The ultimate nature of such a forum must be discussed with the whole research community of the early modern domain.

Social implications

NER is an algorithmic intervention that transforms data according to certain rules-, patterns- or training data and ultimately affects how the authors interpret the results. The creation, use and promotion of algorithmic technologies like NER is not a neutral process, and neither is their output A more critical understanding of the role and impact of NER on early modern documents and research and focalization of some of the data- and human-centric aspects of NER routines that are currently overlooked are called for in this paper.

Originality/value

This article presents a state of the art snapshot of NER, its applications and potential, in the context of early modern research. It also seeks to inform discussions about the kinds of resources, methods and directions that may be pursued to enrich the application of NER going forward. It draws attention to the situated nature of authority files, and current conceptualisations of NER, and concludes that more robust reporting of NER approaches and findings are urgently required. The Appendix sets out a comprehensive summary of digital tools and resources surveyed in this article.

Details

Journal of Documentation, vol. 77 no. 6

Type: Research Article

DOI:

ISSN: 0022-0418

Keywords

View access options

Article

Publication date: 7 April 2015

Negation detection and word sense disambiguation in digital archaeology reports for the purposes of semantic annotation

Andreas Vlachidis and Douglas Tudhope

The purpose of this paper is to present the role and contribution of natural language processing techniques, in particular negation detection and word sense disambiguation in the…

HTML

PDF (546 KB)

Downloads

890

Abstract

Purpose

The purpose of this paper is to present the role and contribution of natural language processing techniques, in particular negation detection and word sense disambiguation in the process of Semantic Annotation of Archaeological Grey Literature. Archaeological reports contain a great deal of information that conveys facts and findings in different ways. This kind of information is highly relevant to the research and analysis of archaeological evidence but at the same time can be a hindrance for the accurate indexing of documents with respect to positive assertions.

Design/methodology/approach

The paper presents a method for adapting the biomedicine oriented negation algorithm NegEx to the context of archaeology and discusses the evaluation results of the new modified negation detection module. A particular form of polysemy, which is inflicted by the definition of ontology classes and concerning the semantics of small finds in archaeology, is addressed by a domain specific word-sense disambiguation module.

Findings

The performance of the negation dection module is compared against a “Gold Standard” that consists of 300 manually annotated pages of archaeological excavation and evaluation reports. The evaluation results are encouraging, delivering overall 89 per cent precision, 80 per cent recall and 83 per cent F-measure scores. The paper addresses limitations and future improvements of the current work and highlights the need for ontological modelling to accommodate negative assertions.

Originality/value

The discussed NLP modules contribute to the aims of the OPTIMA pipeline delivering an innovative application of such methods in the context of archaeological reports for the semantic annotation of archaeological grey literature with respect to the CIDOC-CRM ontology.

Details

Program, vol. 49 no. 2

Type: Research Article

DOI:

ISSN: 0033-0337

Keywords

View access options

Article

Publication date: 8 July 2010

Excavating grey literature: A case study on the rich indexing of archaeological documents via natural language‐processing techniques and knowledge‐based resources

Andreas Vlachidis, Ceri Binding, Douglas Tudhope and Keith May

This paper sets out to discuss the use of information extraction (IE), a natural language‐processing (NLP) technique to assist “rich” semantic indexing of diverse archaeological…

HTML

PDF (75 KB)

Downloads

904

Abstract

Purpose

This paper sets out to discuss the use of information extraction (IE), a natural language‐processing (NLP) technique to assist “rich” semantic indexing of diverse archaeological text resources. The focus of the research is to direct a semantic‐aware “rich” indexing of diverse natural language resources with properties capable of satisfying information retrieval from online publications and datasets associated with the Semantic Technologies for Archaeological Resources (STAR) project.

Design/methodology/approach

The paper proposes use of the English Heritage extension (CRM‐EH) of the standard core ontology in cultural heritage, CIDOC CRM, and exploitation of domain thesauri resources for driving and enhancing an Ontology‐Oriented Information Extraction process. The process of semantic indexing is based on a rule‐based Information Extraction technique, which is facilitated by the General Architecture of Text Engineering (GATE) toolkit and expressed by Java Annotation Pattern Engine (JAPE) rules.

Findings

Initial results suggest that the combination of information extraction with knowledge resources and standard conceptual models is capable of supporting semantic‐aware term indexing. Additional efforts are required for further exploitation of the technique and adoption of formal evaluation methods for assessing the performance of the method in measurable terms.

Originality/value

The value of the paper lies in the semantic indexing of 535 unpublished online documents often referred to as “Grey Literature”, from the Archaeological Data Service OASIS corpus (Online AccesS to the Index of archaeological investigationS), with respect to the CRM ontological concepts E49.Time Appellation and P19.Physical Object.

Details

Aslib Proceedings, vol. 62 no. 4/5

Type: Research Article

DOI:

ISSN: 0001-253X

Keywords

View access options

Article

Publication date: 26 March 2021

Migrating a complex classification scheme to the semantic web: expressing the Integrative Levels Classification using SKOS RDF

Ceri Binding, Claudio Gnoli and Douglas Tudhope

The Integrative Levels Classification (ILC) is a comprehensive “freely faceted” knowledge organization system not previously expressed as SKOS (Simple Knowledge Organization…

HTML

PDF (1.9 MB)

Downloads

407

Abstract

Purpose

The Integrative Levels Classification (ILC) is a comprehensive “freely faceted” knowledge organization system not previously expressed as SKOS (Simple Knowledge Organization System). This paper reports and reflects on work converting the ILC to SKOS representation.

Design/methodology/approach

The design of the ILC representation and the various steps in the conversion to SKOS are described and located within the context of previous work considering the representation of complex classification schemes in SKOS. Various issues and trade-offs emerging from the conversion are discussed. The conversion implementation employed the STELETO transformation tool.

Findings

The ILC conversion captures some of the ILC facet structure by a limited extension beyond the SKOS standard. SPARQL examples illustrate how this extension could be used to create faceted, compound descriptors when indexing or cataloguing. Basic query patterns are provided that might underpin search systems. Possible routes for reducing complexity are discussed.

Originality/value

Complex classification schemes, such as the ILC, have features which are not straight forward to represent in SKOS and which extend beyond the functionality of the SKOS standard. The ILC's facet indicators are modelled as rdf:Property sub-hierarchies that accompany the SKOS RDF statements. The ILC's top-level fundamental facet relationships are modelled by extensions of the associative relationship – specialised sub-properties of skos:related. An approach for representing faceted compound descriptions in ILC and other faceted classification schemes is proposed.

Details

Journal of Documentation, vol. 77 no. 4

Type: Research Article

DOI:

ISSN: 0022-0418

Keywords

Access

Year

Content type

Article (4)

1 – 4 of 4

Named-entity recognition for early modern textual documents: a review of capabilities and challenges with strategies for the future

Abstract

Purpose

Design/methodology/approach

Findings

Research limitations/implications

Practical implications

Social implications

Originality/value

Details

Keywords

Negation detection and word sense disambiguation in digital archaeology reports for the purposes of semantic annotation

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Excavating grey literature: A case study on the rich indexing of archaeological documents via natural language‐processing techniques and knowledge‐based resources

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Migrating a complex classification scheme to the semantic web: expressing the Integrative Levels Classification using SKOS RDF

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Access

Year

Content type

We’re listening — tell us what you think

Something didn’t work…

All feedback is valuable

Join us on our journey

Platform update page

Questions & More Information