Search results

1 – 10 of over 9000
To view the access options for this content please click here
Article
Publication date: 10 June 2014

Ping Bao and Suoling Zhu

The purpose of this paper is to present a system for recognition of location names in ancient books written in languages, such as Chinese, in which proper names are not…

Abstract

Purpose

The purpose of this paper is to present a system for recognition of location names in ancient books written in languages, such as Chinese, in which proper names are not signaled by an initial capital letter.

Design/methodology/approach

Rule-based and statistical methods were combined to develop a set of rules for identification of product-related location names in the local chronicles of Guangdong. A name recognition system, with functions of document management, information extraction and storage, rule management, location name recognition, and inquiry and statistics, was developed using Microsoft's .NET framework, SQL Server 2005, ADO.NET and XML. The system was evaluated with precision ratio, recall ratio and the comprehensive index, F.

Findings

The system was quite successful at recognizing product-related location names (F was 71.8 percent), demonstrating the potential for application of automatic named entity recognition techniques in digital collation of ancient books such as local chronicles.

Research limitations/implications

Results suffered from limitations in initial digitization of the text. Statistical methods, such as the hidden Markov model, should be combined with an extended set of recognition rules to improve recognition scores and system efficiency.

Practical implications

Electronic access to local chronicles by location name saves time for chorographers and provides researchers with new opportunities.

Social implications

Named entity recognition brings previously isolated ancient documents together in a knowledge base of scholarly and cultural value.

Originality/value

Automatic name recognition can be implemented in information extraction from ancient books in languages other than English. The system described here can also be adapted to modern texts and other named entities.

To view the access options for this content please click here
Article
Publication date: 23 August 2013

Ivo Lašek and Peter Vojtáš

The purpose of this paper is to focus on the problem of named entity disambiguation. The paper disambiguates named entities on a very detailed level. To each entity is…

Abstract

Purpose

The purpose of this paper is to focus on the problem of named entity disambiguation. The paper disambiguates named entities on a very detailed level. To each entity is assigned a concrete identifier of a corresponding Wikipedia article describing the entity.

Design/methodology/approach

For such a fine‐grained disambiguation a correct representation of the context is crucial. The authors compare various context representations: bag of words representation, linguistic representation and structured co‐occurrence representation. Models for each representation are described and evaluated. They also investigate the possibilities of multilingual named entity disambiguation.

Findings

Based on this evaluation, the structured co‐occurrence representation provides the best disambiguation results. It showed up that this method could be successfully applied also on other languages, not only on English.

Research limitations/implications

Despite its good results the structured co‐occurrence context representation has several limitations. It trades precision for recall, which might not be desirable in some use cases. Also it is not able to disambiguate two different types of entities, which are mentioned under the same name in the same text. These limitations can be overcome by combination with other described methods.

Practical implications

The authors provide a ready‐made web service, which can be directly plugged in existing applications using a REST interface.

Originality/value

The paper proposes a new approach to named entity disambiguation exploiting various context representation models (bag of words, linguistic and structural representation). The authors constructed a comprehensive dataset based on all English Wikipedia articles for named entity disambiguation. They evaluated and compared the individual context representation models on this dataset. They evaluate the support of multiple languages.

Details

International Journal of Web Information Systems, vol. 9 no. 3
Type: Research Article
ISSN: 1744-0084

Keywords

To view the access options for this content please click here
Article
Publication date: 1 June 2015

Quang-Minh Nguyen and Tuan-Dung Cao

The purpose of this paper is to propose an automatic method to generate semantic annotations of football transfer in the news. The current automatic news integration…

Abstract

Purpose

The purpose of this paper is to propose an automatic method to generate semantic annotations of football transfer in the news. The current automatic news integration systems on the Web are constantly faced with the challenge of diversity, heterogeneity of sources. The approaches for information representation and storage based on syntax have some certain limitations in news searching, sorting, organizing and linking it appropriately. The models of semantic representation are promising to be the key to solving these problems.

Design/methodology/approach

The approach of the author leverages Semantic Web technologies to improve the performance of detection of hidden annotations in the news. The paper proposes an automatic method to generate semantic annotations based on named entity recognition and rule-based information extraction. The authors have built a domain ontology and knowledge base integrated with the knowledge and information management (KIM) platform to implement the former task (named entity recognition). The semantic extraction rules are constructed based on defined language models and the developed ontology.

Findings

The proposed method is implemented as a part of the sport news semantic annotations-generating prototype BKAnnotation. This component is a part of the sport integration system based on Web Semantics BKSport. The semantic annotations generated are used for improving features of news searching – sorting – association. The experiments on the news data from SkySport (2014) channel showed positive results. The precisions achieved in both cases, with and without integration of the pronoun recognition method, are both over 80 per cent. In particular, the latter helps increase the recall value to around 10 per cent.

Originality/value

This is one of the initial proposals in automatic creation of semantic data about news, football news in particular and sport news in general. The combination of ontology, knowledge base and patterns of language model allows detection of not only entities with corresponding types but also semantic triples. At the same time, the authors propose a pronoun recognition method using extraction rules to improve the relation recognition process.

Details

International Journal of Pervasive Computing and Communications, vol. 11 no. 2
Type: Research Article
ISSN: 1742-7371

Keywords

To view the access options for this content please click here
Article
Publication date: 14 May 2019

Ahsan Mahmood, Hikmat Ullah Khan, Zahoor Ur Rehman, Khalid Iqbal and Ch. Muhmmad Shahzad Faisal

The purpose of this research study is to extract and identify named entities from Hadith literature. Named entity recognition (NER) refers to the identification of the…

Abstract

Purpose

The purpose of this research study is to extract and identify named entities from Hadith literature. Named entity recognition (NER) refers to the identification of the named entities in a computer readable text having an annotation of categorization tags for information extraction. NER is an active research area in information management and information retrieval systems. NER serves as a baseline for machines to understand the context of a given content and helps in knowledge extraction. Although NER is considered as a solved task in major languages such as English, in languages such as Urdu, NER is still a challenging task. Moreover, NER depends on the language and domain of study; thus, it is gaining the attention of researchers in different domains.

Design/methodology/approach

This paper proposes a knowledge extraction framework using finite-state transducers (FSTs) – KEFST – to extract the named entities. KEFST consists of five steps: content extraction, tokenization, part of speech tagging, multi-word detection and NER. An extensive empirical analysis using the data corpus of Urdu translation of Sahih Al-Bukhari, a widely known hadith book, reveals that the proposed method effectively recognizes the entities to obtain better results.

Findings

The significant performance in terms of f-measure, precision and recall validates that the proposed model outperforms the existing methods for NER in the relevant literature.

Originality/value

This research is novel in this regard that no previous work is proposed in the Urdu language to extract named entities using FSTs and no previous work is proposed for Urdu hadith data NER.

Details

The Electronic Library , vol. 37 no. 2
Type: Research Article
ISSN: 0264-0473

Keywords

To view the access options for this content please click here
Article
Publication date: 26 July 2021

Pengcheng Li, Qikai Liu, Qikai Cheng and Wei Lu

This paper aims to identify data set entities in scientific literature. To address poor recognition caused by a lack of training corpora in existing studies, a distant…

Abstract

Purpose

This paper aims to identify data set entities in scientific literature. To address poor recognition caused by a lack of training corpora in existing studies, a distant supervised learning-based approach is proposed to identify data set entities automatically from large-scale scientific literature in an open domain.

Design/methodology/approach

Firstly, the authors use a dictionary combined with a bootstrapping strategy to create a labelled corpus to apply supervised learning. Secondly, a bidirectional encoder representation from transformers (BERT)-based neural model was applied to identify data set entities in the scientific literature automatically. Finally, two data augmentation techniques, entity replacement and entity masking, were introduced to enhance the model generalisability and improve the recognition of data set entities.

Findings

In the absence of training data, the proposed method can effectively identify data set entities in large-scale scientific papers. The BERT-based vectorised representation and data augmentation techniques enable significant improvements in the generality and robustness of named entity recognition models, especially in long-tailed data set entity recognition.

Originality/value

This paper provides a practical research method for automatically recognising data set entities in scientific literature. To the best of the authors’ knowledge, this is the first attempt to apply distant learning to the study of data set entity recognition. The authors introduce a robust vectorised representation and two data augmentation strategies (entity replacement and entity masking) to address the problem inherent in distant supervised learning methods, which the existing research has mostly ignored. The experimental results demonstrate that our approach effectively improves the recognition of data set entities, especially long-tailed data set entities.

To view the access options for this content please click here
Article
Publication date: 7 June 2021

Marco Humbel, Julianne Nyhan, Andreas Vlachidis, Kim Sloan and Alexandra Ortolja-Baird

By mapping-out the capabilities, challenges and limitations of named-entity recognition (NER), this article aims to synthesise the state of the art of NER in the context…

Abstract

Purpose

By mapping-out the capabilities, challenges and limitations of named-entity recognition (NER), this article aims to synthesise the state of the art of NER in the context of the early modern research field and to inform discussions about the kind of resources, methods and directions that may be pursued to enrich the application of the technique going forward.

Design/methodology/approach

Through an extensive literature review, this article maps out the current capabilities, challenges and limitations of NER and establishes the state of the art of the technique in the context of the early modern, digitally augmented research field. It also presents a new case study of NER research undertaken by Enlightenment Architectures: Sir Hans Sloane's Catalogues of his Collections (2016–2021), a Leverhulme funded research project and collaboration between the British Museum and University College London, with contributing expertise from the British Library and the Natural History Museum.

Findings

Currently, it is not possible to benchmark the capabilities of NER as applied to documents of the early modern period. The authors also draw attention to the situated nature of authority files, and current conceptualisations of NER, leading them to the conclusion that more robust reporting and critical analysis of NER approaches and findings is required.

Research limitations/implications

This article examines NER as applied to early modern textual sources, which are mostly studied by Humanists. As addressed in this article, detailed reporting of NER processes and outcomes is not necessarily valued by the disciplines of the Humanities, with the result that it can be difficult to locate relevant data and metrics in project outputs. The authors have tried to mitigate this by contacting projects discussed in this paper directly, to further verify the details they report here.

Practical implications

The authors suggest that a forum is needed where tools are evaluated according to community standards. Within the wider NER community, the MUC and ConLL corpora are used for such experimental set-ups and are accompanied by a conference series, and may be seen as a useful model for this. The ultimate nature of such a forum must be discussed with the whole research community of the early modern domain.

Social implications

NER is an algorithmic intervention that transforms data according to certain rules-, patterns- or training data and ultimately affects how the authors interpret the results. The creation, use and promotion of algorithmic technologies like NER is not a neutral process, and neither is their output A more critical understanding of the role and impact of NER on early modern documents and research and focalization of some of the data- and human-centric aspects of NER routines that are currently overlooked are called for in this paper.

Originality/value

This article presents a state of the art snapshot of NER, its applications and potential, in the context of early modern research. It also seeks to inform discussions about the kinds of resources, methods and directions that may be pursued to enrich the application of NER going forward. It draws attention to the situated nature of authority files, and current conceptualisations of NER, and concludes that more robust reporting of NER approaches and findings are urgently required. The Appendix sets out a comprehensive summary of digital tools and resources surveyed in this article.

Details

Journal of Documentation, vol. 77 no. 6
Type: Research Article
ISSN: 0022-0418

Keywords

To view the access options for this content please click here
Article
Publication date: 20 April 2015

Abubakar Roko, Shyamala Doraisamy, Azrul Hazri Jantan and Azreen Azman

The purpose of this paper is to propose and evaluate XKQSS, a query structuring method that relegates the task of generating structured queries from a user to a search…

Abstract

Purpose

The purpose of this paper is to propose and evaluate XKQSS, a query structuring method that relegates the task of generating structured queries from a user to a search engine while retaining the simple keyword search query interface. A more effective way for searching XML database is to use structured queries. However, using query languages to express queries prove to be difficult for most users since this requires learning a query language and knowledge of the underlying data schema. On the other hand, the success of Web search engines has made many users to be familiar with keyword search and, therefore, they prefer to use a keyword search query interface to search XML data.

Design/methodology/approach

Existing query structuring approaches require users to provide structural hints in their input keyword queries even though their interface is keyword base. Other problems with existing systems include their inability to put keyword query ambiguities into consideration during query structuring and how to select the best generated structure query that best represents a given keyword query. To address these problems, this study allows users to submit a schema independent keyword query, use named entity recognition (NER) to categorize query keywords to resolve query ambiguities and compute semantic information for a node from its data content. Algorithms were proposed that find user search intentions and convert the intentions into a set of ranked structured queries.

Findings

Experiments with Sigmod and IMDB datasets were conducted to evaluate the effectiveness of the method. The experimental result shows that the XKQSS is about 20 per cent more effective than XReal in terms of return nodes identification, a state-of-art systems for XML retrieval.

Originality/value

Existing systems do not take keyword query ambiguities into account. XKSS consists of two guidelines based on NER that help to resolve these ambiguities before converting the submitted query. It also include a ranking function computes a score for each generated query by using both semantic information and data statistic, as opposed to data statistic only approach used by the existing approaches.

Details

International Journal of Web Information Systems, vol. 11 no. 1
Type: Research Article
ISSN: 1744-0084

Keywords

To view the access options for this content please click here
Article
Publication date: 1 October 2005

Hamish Cunningham, Kalina Bontcheva and Yaoyong Li

Seeks to explore the gap that exists between knowledge management (KM) systems and the natural language materials that form almost all corporate data stores.

Downloads
2583

Abstract

Purpose

Seeks to explore the gap that exists between knowledge management (KM) systems and the natural language materials that form almost all corporate data stores.

Design/methodology/approach

A conceptual discussion and approach are taken using recent scientific results in the fields of the semantic web and ontology‐based information extraction.

Findings

Provides a high‐level introduction to information extraction (IE) and descriptions of application scenarios for KM tools that exploit IE, a form of natural language analysis to link semantic web models with documents. The paper presents some examples of ontology‐based IE systems, one of which, KIM, is under development in the SEKT Project. KIM offers IE‐based facilities for metadata creation, storage and conceptual search. The system can be used by diverse applications for annotating and querying documents.

Originality/value

Focuses on technologies and facilities that will become an important part of next‐generation KM applications.

Details

Journal of Knowledge Management, vol. 9 no. 5
Type: Research Article
ISSN: 1367-3270

Keywords

To view the access options for this content please click here
Article
Publication date: 14 May 2018

Anne Chardonnens, Ettore Rizza, Mathias Coeckelbergs and Seth van Hooland

Advanced usage of web analytics tools allows to capture the content of user queries. Despite their relevant nature, the manual analysis of large volumes of user queries is…

Abstract

Purpose

Advanced usage of web analytics tools allows to capture the content of user queries. Despite their relevant nature, the manual analysis of large volumes of user queries is problematic. The purpose of this paper is to address the problem of named entity recognition in digital library user queries.

Design/methodology/approach

The paper presents a large-scale case study conducted at the Royal Library of Belgium in its online historical newspapers platform BelgicaPress. The object of the study is a data set of 83,854 queries resulting from 29,812 visits over a 12-month period. By making use of information extraction methods, knowledge bases (KBs) and various authority files, this paper presents the possibilities and limits to identify what percentage of end users are looking for person and place names.

Findings

Based on a quantitative assessment, the method can successfully identify the majority of person and place names from user queries. Due to the specific character of user queries and the nature of the KBs used, a limited amount of queries remained too ambiguous to be treated in an automated manner.

Originality/value

This paper demonstrates in an empirical manner how user queries can be extracted from a web analytics tool and how named entities can then be mapped with KBs and authority files, in order to facilitate automated analysis of their content. Methods and tools used are generalisable and can be reused by other collection holders.

Details

Journal of Documentation, vol. 74 no. 5
Type: Research Article
ISSN: 0022-0418

Keywords

To view the access options for this content please click here
Article
Publication date: 19 April 2013

Silvio Moreira, David S. Batista, Paula Carvalho, Francisco M. Couto and Mario J. Silva

POWER is an ontology of political processes and entities. It is designed for tracking politicians, political organizations and elections, both in mainstream and social…

Abstract

Purpose

POWER is an ontology of political processes and entities. It is designed for tracking politicians, political organizations and elections, both in mainstream and social media. The aim of this paper is to propose a data model to describe political agents and their relations over time.

Design/methodology/approach

The authors propose a data model to describe political agents (politicans, political instutions and political associations) and their relations over time. The model is formalized as an ontology using the RDF format and the population is performed in two steps. First, a bootstrap process loads data collected from authoritative sources. Then, the ontology is enriched with alternative media names extracted from the web.

Findings

The ontology is published as a public resource following the guidelines of linked data and semantic web standards can be accessed via SPARQL endpoint.

Originality/value

The authors have developed an ontology for the political domain tailored to aid in the tasks of named entity recognition and resolution. It represents the complexity and dynamic nature of relations between political agents (politicians, political associations and political institutions) over time.

Details

Program, vol. 47 no. 2
Type: Research Article
ISSN: 0033-0337

Keywords

1 – 10 of over 9000