Search results

1 – 10 of over 10000
Article
Publication date: 10 June 2014

Ping Bao and Suoling Zhu

The purpose of this paper is to present a system for recognition of location names in ancient books written in languages, such as Chinese, in which proper names are not signaled…

Abstract

Purpose

The purpose of this paper is to present a system for recognition of location names in ancient books written in languages, such as Chinese, in which proper names are not signaled by an initial capital letter.

Design/methodology/approach

Rule-based and statistical methods were combined to develop a set of rules for identification of product-related location names in the local chronicles of Guangdong. A name recognition system, with functions of document management, information extraction and storage, rule management, location name recognition, and inquiry and statistics, was developed using Microsoft's .NET framework, SQL Server 2005, ADO.NET and XML. The system was evaluated with precision ratio, recall ratio and the comprehensive index, F.

Findings

The system was quite successful at recognizing product-related location names (F was 71.8 percent), demonstrating the potential for application of automatic named entity recognition techniques in digital collation of ancient books such as local chronicles.

Research limitations/implications

Results suffered from limitations in initial digitization of the text. Statistical methods, such as the hidden Markov model, should be combined with an extended set of recognition rules to improve recognition scores and system efficiency.

Practical implications

Electronic access to local chronicles by location name saves time for chorographers and provides researchers with new opportunities.

Social implications

Named entity recognition brings previously isolated ancient documents together in a knowledge base of scholarly and cultural value.

Originality/value

Automatic name recognition can be implemented in information extraction from ancient books in languages other than English. The system described here can also be adapted to modern texts and other named entities.

Article
Publication date: 23 August 2013

Ivo Lašek and Peter Vojtáš

The purpose of this paper is to focus on the problem of named entity disambiguation. The paper disambiguates named entities on a very detailed level. To each entity is assigned a…

Abstract

Purpose

The purpose of this paper is to focus on the problem of named entity disambiguation. The paper disambiguates named entities on a very detailed level. To each entity is assigned a concrete identifier of a corresponding Wikipedia article describing the entity.

Design/methodology/approach

For such a fine‐grained disambiguation a correct representation of the context is crucial. The authors compare various context representations: bag of words representation, linguistic representation and structured co‐occurrence representation. Models for each representation are described and evaluated. They also investigate the possibilities of multilingual named entity disambiguation.

Findings

Based on this evaluation, the structured co‐occurrence representation provides the best disambiguation results. It showed up that this method could be successfully applied also on other languages, not only on English.

Research limitations/implications

Despite its good results the structured co‐occurrence context representation has several limitations. It trades precision for recall, which might not be desirable in some use cases. Also it is not able to disambiguate two different types of entities, which are mentioned under the same name in the same text. These limitations can be overcome by combination with other described methods.

Practical implications

The authors provide a ready‐made web service, which can be directly plugged in existing applications using a REST interface.

Originality/value

The paper proposes a new approach to named entity disambiguation exploiting various context representation models (bag of words, linguistic and structural representation). The authors constructed a comprehensive dataset based on all English Wikipedia articles for named entity disambiguation. They evaluated and compared the individual context representation models on this dataset. They evaluate the support of multiple languages.

Details

International Journal of Web Information Systems, vol. 9 no. 3
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 1 June 2015

Quang-Minh Nguyen and Tuan-Dung Cao

The purpose of this paper is to propose an automatic method to generate semantic annotations of football transfer in the news. The current automatic news integration systems on…

Abstract

Purpose

The purpose of this paper is to propose an automatic method to generate semantic annotations of football transfer in the news. The current automatic news integration systems on the Web are constantly faced with the challenge of diversity, heterogeneity of sources. The approaches for information representation and storage based on syntax have some certain limitations in news searching, sorting, organizing and linking it appropriately. The models of semantic representation are promising to be the key to solving these problems.

Design/methodology/approach

The approach of the author leverages Semantic Web technologies to improve the performance of detection of hidden annotations in the news. The paper proposes an automatic method to generate semantic annotations based on named entity recognition and rule-based information extraction. The authors have built a domain ontology and knowledge base integrated with the knowledge and information management (KIM) platform to implement the former task (named entity recognition). The semantic extraction rules are constructed based on defined language models and the developed ontology.

Findings

The proposed method is implemented as a part of the sport news semantic annotations-generating prototype BKAnnotation. This component is a part of the sport integration system based on Web Semantics BKSport. The semantic annotations generated are used for improving features of news searching – sorting – association. The experiments on the news data from SkySport (2014) channel showed positive results. The precisions achieved in both cases, with and without integration of the pronoun recognition method, are both over 80 per cent. In particular, the latter helps increase the recall value to around 10 per cent.

Originality/value

This is one of the initial proposals in automatic creation of semantic data about news, football news in particular and sport news in general. The combination of ontology, knowledge base and patterns of language model allows detection of not only entities with corresponding types but also semantic triples. At the same time, the authors propose a pronoun recognition method using extraction rules to improve the relation recognition process.

Details

International Journal of Pervasive Computing and Communications, vol. 11 no. 2
Type: Research Article
ISSN: 1742-7371

Keywords

Article
Publication date: 14 May 2019

Ahsan Mahmood, Hikmat Ullah Khan, Zahoor Ur Rehman, Khalid Iqbal and Ch. Muhmmad Shahzad Faisal

The purpose of this research study is to extract and identify named entities from Hadith literature. Named entity recognition (NER) refers to the identification of the named…

Abstract

Purpose

The purpose of this research study is to extract and identify named entities from Hadith literature. Named entity recognition (NER) refers to the identification of the named entities in a computer readable text having an annotation of categorization tags for information extraction. NER is an active research area in information management and information retrieval systems. NER serves as a baseline for machines to understand the context of a given content and helps in knowledge extraction. Although NER is considered as a solved task in major languages such as English, in languages such as Urdu, NER is still a challenging task. Moreover, NER depends on the language and domain of study; thus, it is gaining the attention of researchers in different domains.

Design/methodology/approach

This paper proposes a knowledge extraction framework using finite-state transducers (FSTs) – KEFST – to extract the named entities. KEFST consists of five steps: content extraction, tokenization, part of speech tagging, multi-word detection and NER. An extensive empirical analysis using the data corpus of Urdu translation of Sahih Al-Bukhari, a widely known hadith book, reveals that the proposed method effectively recognizes the entities to obtain better results.

Findings

The significant performance in terms of f-measure, precision and recall validates that the proposed model outperforms the existing methods for NER in the relevant literature.

Originality/value

This research is novel in this regard that no previous work is proposed in the Urdu language to extract named entities using FSTs and no previous work is proposed for Urdu hadith data NER.

Details

The Electronic Library , vol. 37 no. 2
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 26 July 2021

Pengcheng Li, Qikai Liu, Qikai Cheng and Wei Lu

This paper aims to identify data set entities in scientific literature. To address poor recognition caused by a lack of training corpora in existing studies, a distant supervised…

Abstract

Purpose

This paper aims to identify data set entities in scientific literature. To address poor recognition caused by a lack of training corpora in existing studies, a distant supervised learning-based approach is proposed to identify data set entities automatically from large-scale scientific literature in an open domain.

Design/methodology/approach

Firstly, the authors use a dictionary combined with a bootstrapping strategy to create a labelled corpus to apply supervised learning. Secondly, a bidirectional encoder representation from transformers (BERT)-based neural model was applied to identify data set entities in the scientific literature automatically. Finally, two data augmentation techniques, entity replacement and entity masking, were introduced to enhance the model generalisability and improve the recognition of data set entities.

Findings

In the absence of training data, the proposed method can effectively identify data set entities in large-scale scientific papers. The BERT-based vectorised representation and data augmentation techniques enable significant improvements in the generality and robustness of named entity recognition models, especially in long-tailed data set entity recognition.

Originality/value

This paper provides a practical research method for automatically recognising data set entities in scientific literature. To the best of the authors’ knowledge, this is the first attempt to apply distant learning to the study of data set entity recognition. The authors introduce a robust vectorised representation and two data augmentation strategies (entity replacement and entity masking) to address the problem inherent in distant supervised learning methods, which the existing research has mostly ignored. The experimental results demonstrate that our approach effectively improves the recognition of data set entities, especially long-tailed data set entities.

Article
Publication date: 31 January 2023

Mrinalini Luthra, Konstantin Todorov, Charles Jeurgens and Giovanni Colavizza

This paper aims to expand the scope and mitigate the biases of extant archival indexes.

Abstract

Purpose

This paper aims to expand the scope and mitigate the biases of extant archival indexes.

Design/methodology/approach

The authors use automatic entity recognition on the archives of the Dutch East India Company to extract mentions of underrepresented people.

Findings

The authors release an annotated corpus and baselines for a shared task and show that the proposed goal is feasible.

Originality/value

Colonial archives are increasingly a focus of attention for historians and the public, broadening access to them is a pressing need for archives.

Details

Journal of Documentation, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 7 June 2021

Marco Humbel, Julianne Nyhan, Andreas Vlachidis, Kim Sloan and Alexandra Ortolja-Baird

By mapping-out the capabilities, challenges and limitations of named-entity recognition (NER), this article aims to synthesise the state of the art of NER in the context of the…

Abstract

Purpose

By mapping-out the capabilities, challenges and limitations of named-entity recognition (NER), this article aims to synthesise the state of the art of NER in the context of the early modern research field and to inform discussions about the kind of resources, methods and directions that may be pursued to enrich the application of the technique going forward.

Design/methodology/approach

Through an extensive literature review, this article maps out the current capabilities, challenges and limitations of NER and establishes the state of the art of the technique in the context of the early modern, digitally augmented research field. It also presents a new case study of NER research undertaken by Enlightenment Architectures: Sir Hans Sloane's Catalogues of his Collections (2016–2021), a Leverhulme funded research project and collaboration between the British Museum and University College London, with contributing expertise from the British Library and the Natural History Museum.

Findings

Currently, it is not possible to benchmark the capabilities of NER as applied to documents of the early modern period. The authors also draw attention to the situated nature of authority files, and current conceptualisations of NER, leading them to the conclusion that more robust reporting and critical analysis of NER approaches and findings is required.

Research limitations/implications

This article examines NER as applied to early modern textual sources, which are mostly studied by Humanists. As addressed in this article, detailed reporting of NER processes and outcomes is not necessarily valued by the disciplines of the Humanities, with the result that it can be difficult to locate relevant data and metrics in project outputs. The authors have tried to mitigate this by contacting projects discussed in this paper directly, to further verify the details they report here.

Practical implications

The authors suggest that a forum is needed where tools are evaluated according to community standards. Within the wider NER community, the MUC and ConLL corpora are used for such experimental set-ups and are accompanied by a conference series, and may be seen as a useful model for this. The ultimate nature of such a forum must be discussed with the whole research community of the early modern domain.

Social implications

NER is an algorithmic intervention that transforms data according to certain rules-, patterns- or training data and ultimately affects how the authors interpret the results. The creation, use and promotion of algorithmic technologies like NER is not a neutral process, and neither is their output A more critical understanding of the role and impact of NER on early modern documents and research and focalization of some of the data- and human-centric aspects of NER routines that are currently overlooked are called for in this paper.

Originality/value

This article presents a state of the art snapshot of NER, its applications and potential, in the context of early modern research. It also seeks to inform discussions about the kinds of resources, methods and directions that may be pursued to enrich the application of NER going forward. It draws attention to the situated nature of authority files, and current conceptualisations of NER, and concludes that more robust reporting of NER approaches and findings are urgently required. The Appendix sets out a comprehensive summary of digital tools and resources surveyed in this article.

Details

Journal of Documentation, vol. 77 no. 6
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 4 July 2023

Maojian Chen, Xiong Luo, Hailun Shen, Ziyang Huang, Qiaojuan Peng and Yuqi Yuan

This study aims to introduce an innovative approach that uses a decoder with multiple layers to accurately identify Chinese nested entities across various nesting depths. To…

Abstract

Purpose

This study aims to introduce an innovative approach that uses a decoder with multiple layers to accurately identify Chinese nested entities across various nesting depths. To address potential human intervention, an advanced optimization algorithm is used to fine-tune the decoder based on the depth of nested entities present in the data set. With this approach, this study achieves remarkable performance in recognizing Chinese nested entities.

Design/methodology/approach

This study provides a framework for Chinese nested named entity recognition (NER) based on sequence labeling methods. Similar to existing approaches, the framework uses an advanced pre-training model as the backbone to extract semantic features from the text. Then a decoder comprising multiple conditional random field (CRF) algorithms is used to learn the associations between granularity labels. To minimize the need for manual intervention, the Jaya algorithm is used to optimize the number of CRF layers. Experimental results validate the effectiveness of the proposed approach, demonstrating its superior performance on both Chinese nested NER and flat NER tasks.

Findings

The experimental findings illustrate that the proposed methodology can achieve a remarkable 4.32% advancement in nested NER performance on the People’s Daily corpus compared to existing models.

Originality/value

This study explores a Chinese NER methodology based on the sequence labeling ideology for recognizing sophisticated Chinese nested entities with remarkable accuracy.

Details

International Journal of Web Information Systems, vol. 19 no. 1
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 20 April 2015

Abubakar Roko, Shyamala Doraisamy, Azrul Hazri Jantan and Azreen Azman

The purpose of this paper is to propose and evaluate XKQSS, a query structuring method that relegates the task of generating structured queries from a user to a search engine…

Abstract

Purpose

The purpose of this paper is to propose and evaluate XKQSS, a query structuring method that relegates the task of generating structured queries from a user to a search engine while retaining the simple keyword search query interface. A more effective way for searching XML database is to use structured queries. However, using query languages to express queries prove to be difficult for most users since this requires learning a query language and knowledge of the underlying data schema. On the other hand, the success of Web search engines has made many users to be familiar with keyword search and, therefore, they prefer to use a keyword search query interface to search XML data.

Design/methodology/approach

Existing query structuring approaches require users to provide structural hints in their input keyword queries even though their interface is keyword base. Other problems with existing systems include their inability to put keyword query ambiguities into consideration during query structuring and how to select the best generated structure query that best represents a given keyword query. To address these problems, this study allows users to submit a schema independent keyword query, use named entity recognition (NER) to categorize query keywords to resolve query ambiguities and compute semantic information for a node from its data content. Algorithms were proposed that find user search intentions and convert the intentions into a set of ranked structured queries.

Findings

Experiments with Sigmod and IMDB datasets were conducted to evaluate the effectiveness of the method. The experimental result shows that the XKQSS is about 20 per cent more effective than XReal in terms of return nodes identification, a state-of-art systems for XML retrieval.

Originality/value

Existing systems do not take keyword query ambiguities into account. XKSS consists of two guidelines based on NER that help to resolve these ambiguities before converting the submitted query. It also include a ranking function computes a score for each generated query by using both semantic information and data statistic, as opposed to data statistic only approach used by the existing approaches.

Details

International Journal of Web Information Systems, vol. 11 no. 1
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 14 November 2023

Shaodan Sun, Jun Deng and Xugong Qin

This paper aims to amplify the retrieval and utilization of historical newspapers through the application of semantic organization, all from the vantage point of a fine-grained…

Abstract

Purpose

This paper aims to amplify the retrieval and utilization of historical newspapers through the application of semantic organization, all from the vantage point of a fine-grained knowledge element perspective. This endeavor seeks to unlock the latent value embedded within newspaper contents while simultaneously furnishing invaluable guidance within methodological paradigms for research in the humanities domain.

Design/methodology/approach

According to the semantic organization process and knowledge element concept, this study proposes a holistic framework, including four pivotal stages: knowledge element description, extraction, association and application. Initially, a semantic description model dedicated to knowledge elements is devised. Subsequently, harnessing the advanced deep learning techniques, the study delves into the realm of entity recognition and relationship extraction. These techniques are instrumental in identifying entities within the historical newspaper contents and capturing the interdependencies that exist among them. Finally, an online platform based on Flask is developed to enable the recognition of entities and relationships within historical newspapers.

Findings

This article utilized the Shengjing Times·Changchun Compilation as the datasets for describing, extracting, associating and applying newspapers contents. Regarding knowledge element extraction, the BERT + BS consistently outperforms Bi-LSTM, CRF++ and even BERT in terms of Recall and F1 scores, making it a favorable choice for entity recognition in this context. Particularly noteworthy is the Bi-LSTM-Pro model, which stands out with the highest scores across all metrics, notably achieving an exceptional F1 score in knowledge element relationship recognition.

Originality/value

Historical newspapers transcend their status as mere artifacts, evolving into invaluable reservoirs safeguarding the societal and historical memory. Through semantic organization from a fine-grained knowledge element perspective, it can facilitate semantic retrieval, semantic association, information visualization and knowledge discovery services for historical newspapers. In practice, it can empower researchers to unearth profound insights within the historical and cultural context, broadening the landscape of digital humanities research and practical applications.

Details

Aslib Journal of Information Management, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2050-3806

Keywords

1 – 10 of over 10000