Search results

21 – 30 of over 10000
Article
Publication date: 3 April 2020

Abdelhalim Saadi and Hacene Belhadef

The purpose of this paper is to present a system based on deep neural networks to extract particular entities from natural language text, knowing that a massive amount of textual…

Abstract

Purpose

The purpose of this paper is to present a system based on deep neural networks to extract particular entities from natural language text, knowing that a massive amount of textual information is electronically available at present. Notably, a large amount of electronic text data indicates great difficulty in finding or extracting relevant information from them.

Design/methodology/approach

This study presents an original system to extract Arabic-named entities by combining a deep neural network-based part-of-speech tagger and a neural network-based named entity extractor. Firstly, the system extracts the grammatical classes of the words with high precision depending on the context of the word. This module plays the role of the disambiguation process. Then, a second module is used to extract the named entities.

Findings

Using deep neural networks in natural language processing, requires tuning many hyperparameters, which is a time-consuming process. To deal with this problem, applying statistical methods like the Taguchi method is much requested. In this study, the system is successfully applied to the Arabic-named entities recognition, where accuracy of 96.81 per cent was reported, which is better than the state-of-the-art results.

Research limitations/implications

The system is designed and trained for the Arabic language, but the architecture can be used for other languages.

Practical implications

Information extraction systems are developed for different applications, such as analysing newspaper articles and databases for commercial, political and social objectives. Information extraction systems also can be built over an information retrieval (IR) system. The IR system eliminates irrelevant documents and paragraphs.

Originality/value

The proposed system can be regarded as the first attempt to use double deep neural networks to increase the accuracy. It also can be built over an IR system. The IR system eliminates irrelevant documents and paragraphs. This process reduces the mass number of documents from which the authors wish to extract the relevant information using an information extraction system.

Details

Smart and Sustainable Built Environment, vol. 9 no. 4
Type: Research Article
ISSN: 2046-6099

Keywords

Article
Publication date: 20 May 2020

Tim Hutchinson

This study aims to provide an overview of recent efforts relating to natural language processing (NLP) and machine learning applied to archival processing, particularly appraisal…

1266

Abstract

Purpose

This study aims to provide an overview of recent efforts relating to natural language processing (NLP) and machine learning applied to archival processing, particularly appraisal and sensitivity reviews, and propose functional requirements and workflow considerations for transitioning from experimental to operational use of these tools.

Design/methodology/approach

The paper has four main sections. 1) A short overview of the NLP and machine learning concepts referenced in the paper. 2) A review of the literature reporting on NLP and machine learning applied to archival processes. 3) An overview and commentary on key existing and developing tools that use NLP or machine learning techniques for archives. 4) This review and analysis will inform a discussion of functional requirements and workflow considerations for NLP and machine learning tools for archival processing.

Findings

Applications for processing e-mail have received the most attention so far, although most initiatives have been experimental or project based. It now seems feasible to branch out to develop more generalized tools for born-digital, unstructured records. Effective NLP and machine learning tools for archival processing should be usable, interoperable, flexible, iterative and configurable.

Originality/value

Most implementations of NLP for archives have been experimental or project based. The main exception that has moved into production is ePADD, which includes robust NLP features through its named entity recognition module. This paper takes a broader view, assessing the prospects and possible directions for integrating NLP tools and techniques into archival workflows.

Article
Publication date: 12 September 2023

Wenjing Wu, Caifeng Wen, Qi Yuan, Qiulan Chen and Yunzhong Cao

Learning from safety accidents and sharing safety knowledge has become an important part of accident prevention and improving construction safety management. Considering the…

Abstract

Purpose

Learning from safety accidents and sharing safety knowledge has become an important part of accident prevention and improving construction safety management. Considering the difficulty of reusing unstructured data in the construction industry, the knowledge in it is difficult to be used directly for safety analysis. The purpose of this paper is to explore the construction of construction safety knowledge representation model and safety accident graph through deep learning methods, extract construction safety knowledge entities through BERT-BiLSTM-CRF model and propose a data management model of data–knowledge–services.

Design/methodology/approach

The ontology model of knowledge representation of construction safety accidents is constructed by integrating entity relation and logic evolution. Then, the database of safety incidents in the architecture, engineering and construction (AEC) industry is established based on the collected construction safety incident reports and related dispute cases. The construction method of construction safety accident knowledge graph is studied, and the precision of BERT-BiLSTM-CRF algorithm in information extraction is verified through comparative experiments. Finally, a safety accident report is used as an example to construct the AEC domain construction safety accident knowledge graph (AEC-KG), which provides visual query knowledge service and verifies the operability of knowledge management.

Findings

The experimental results show that the combined BERT-BiLSTM-CRF algorithm has a precision of 84.52%, a recall of 92.35%, and an F1 value of 88.26% in named entity recognition from the AEC domain database. The construction safety knowledge representation model and safety incident knowledge graph realize knowledge visualization.

Originality/value

The proposed framework provides a new knowledge management approach to improve the safety management of practitioners and also enriches the application scenarios of knowledge graph. On the one hand, it innovatively proposes a data application method and knowledge management method of safety accident report that integrates entity relationship and matter evolution logic. On the other hand, the legal adjudication dimension is innovatively added to the knowledge graph in the construction safety field as the basis for the postincident disposal measures of safety accidents, which provides reference for safety managers' decision-making in all aspects.

Details

Engineering, Construction and Architectural Management, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 0969-9988

Keywords

Article
Publication date: 1 March 1998

Robert Gaizauskas and Yorick Wilks

In this paper we give a synoptic view of the growth of the text processing technology of information extraction (IE) whose function is to extract information about a pre‐specified…

1404

Abstract

In this paper we give a synoptic view of the growth of the text processing technology of information extraction (IE) whose function is to extract information about a pre‐specified set of entities, relations or events from natural language texts and to record this information in structured representations called templates. Here we describe the nature of the IE task, review the history of the area from its origins in AI work in the 1960s and 70s till the present, discuss the techniques being used to carry out the task, describe application areas where IE systems are or are about to be at work, and conclude with a discussion of the challenges facing the area. What emerges is a picture of an exciting new text processing technology with a host of new applications, both on its own and in conjunction with other technologies, such as information retrieval, machine translation and data mining.

Details

Journal of Documentation, vol. 54 no. 1
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 2 May 2023

Giovanna Aracri, Antonietta Folino and Stefano Silvestri

The purpose of this paper is to propose a methodology for the enrichment and tailoring of a knowledge organization system (KOS), in order to support the information extraction…

Abstract

Purpose

The purpose of this paper is to propose a methodology for the enrichment and tailoring of a knowledge organization system (KOS), in order to support the information extraction (IE) task for the analysis of documents in the tourism domain. In particular, the KOS is used to develop a named entity recognition (NER) system.

Design/methodology/approach

A method to improve and customize an available thesaurus by leveraging documents related to the tourism in Italy is firstly presented. Then, the obtained thesaurus is used to create an annotated NER corpus, exploiting both distant supervision, deep learning and a light human supervision.

Findings

The study shows that a customized KOS can effectively support IE tasks when applied to documents belonging to the same domains and types used for its construction. Moreover, it is very useful to support and ease the annotation task using the proposed methodology, allowing to annotate a corpus with a fraction of the effort required for a manual annotation.

Originality/value

The paper explores an alternative use of a KOS, proposing an innovative NER corpus annotation methodology. Moreover, the KOS and the annotated NER data set will be made publicly available.

Details

Journal of Documentation, vol. 79 no. 6
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 28 October 2020

Ivana Tanasijević and Gordana Pavlović-Lažetić

The purpose of this paper is to provide a methodology for automatic annotation of a multimedia collection of intangible cultural heritage mostly in the form of interviews…

Abstract

Purpose

The purpose of this paper is to provide a methodology for automatic annotation of a multimedia collection of intangible cultural heritage mostly in the form of interviews. Assigned annotations provide a way to search the collection.

Design/methodology/approach

Annotation is based on automatic extraction of metadata and is conducted by named entity and topic extraction from textual descriptions with a rule-based approach supported by vocabulary resources, a compiled domain-specific classification scheme and domain-oriented corpus analysis.

Findings

The proposed methodology for automatic annotation of a collection of intangible cultural heritage, applied on the cultural heritage of the Balkans, has very good results according to F measure, which is 0.87 for the named entity and 0.90 for topic annotation. The overall methodology enables encapsulating domain-specific and language-specific knowledge into collections of finite state transducers and allows further improvements.

Originality/value

Although cultural heritage has a significant role in the development of identity of a group or an individual, it is one of those specific domains that have not yet been fully explored in case of many languages. A methodology is proposed that can be used for incorporating natural language processing techniques into digital libraries of cultural heritage.

Details

The Electronic Library , vol. 38 no. 5/6
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 26 August 2022

Satanu Ghosh and Kun Lu

The purpose of this paper is to present a preliminary work on extracting band gap information of materials from academic papers. With increasing demand for renewable energy, band…

Abstract

Purpose

The purpose of this paper is to present a preliminary work on extracting band gap information of materials from academic papers. With increasing demand for renewable energy, band gap information will help material scientists design and implement novel photovoltaic (PV) cells.

Design/methodology/approach

The authors collected 1.44 million titles and abstracts of scholarly articles related to materials science, and then filtered the collection to 11,939 articles that potentially contain relevant information about materials and their band gap values. ChemDataExtractor was extended to extract information about PV materials and their band gap information. Evaluation was performed on randomly sampled information records of 415 papers.

Findings

The findings of this study show that the current system is able to correctly extract information for 51.32% articles, with partially correct extraction for 36.62% articles and incorrect for 12.04%. The authors have also identified the errors belonging to three main categories pertaining to chemical entity identification, band gap information and interdependency resolution. Future work will focus on addressing these errors to improve the performance of the system.

Originality/value

The authors did not find any literature to date on band gap information extraction from academic text using automated methods. This work is unique and original. Band gap information is of importance to materials scientists in applications such as solar cells, light emitting diodes and laser diodes.

Details

Aslib Journal of Information Management, vol. 75 no. 3
Type: Research Article
ISSN: 2050-3806

Keywords

Article
Publication date: 11 May 2022

Chih-Ming Chen, Tek-Soon Ling, Chung Chang, Chih-Fan Hsu and Chia-Pei Lim

Digital humanities research platform for biographies of Malaysia personalities (DHRP-BMP) was collaboratively developed by the Research Center for Chinese Cultural Subjectivity in…

Abstract

Purpose

Digital humanities research platform for biographies of Malaysia personalities (DHRP-BMP) was collaboratively developed by the Research Center for Chinese Cultural Subjectivity in Taiwan, the Federation of Heng Ann Association Malaysia, and the Malaysian Chinese Research Center of Universiti Malaya in this study. Using The Biographies of Malaysia Henghua Personalities as the main archival sources, DHRP-BMP adopted the Omeka S, which is a next-generation Web publishing platform for institutions interested in connecting digital cultural heritage collections with other resources online, as the basic development system of the platform, to develop the functions of close reading and distant reading both combined together as the foundation of its digital humanities tools.

Design/methodology/approach

The results of the first-stage development are introduced in this study, and a case study of qualitative analysis is provided to describe the research process by a humanist scholar who used DHRP-BMP to discover the character relationships and contexts hidden in The Biographies of Malaysia Henghua Personalities.

Findings

Close reading provided by DHRP-BMP was able to support humanities scholars on comprehending full text contents through a user-friendly reading interface while distant reading developed in DHRP-BMP could assist humanities scholars on interpreting texts from a rather macro perspective through text analysis, with the functions such as keyword search, geographic information and social networks analysis for humanities scholars to master on the character relationships and geographic distribution from personality biographies, thus accelerating their text interpretation efficiency and uncovering the hidden context.

Originality/value

At present, a digital humanities research platform with real-time characters’ relationships analysis tool that can automatically generate visualized character relationship graphs based on Chinese named entity recognition (CNER) and character relationship identification technologies to effectively assist humanities scholars in interpreting characters’ relationships for digital humanities research is still lacking so far. This study thus presents the DHRP-BMP that offers the key features that can automatically identify characters’ names and characters’ relationships from personality biographies and provide a user-friendly visualization interface of characters’ relationships for supporting digital humanities research, so that humanities scholars could more efficiently and accurately explore characters’ relationships from the analyzed texts to explore complicated characters’ relationships and find out useful research findings.

Article
Publication date: 3 January 2024

Abba Suganda Girsang and Bima Krisna Noveta

The purpose of this study is to provide the location of natural disasters that are poured into maps by extracting Twitter data. The Twitter text is extracted by using named entity

Abstract

Purpose

The purpose of this study is to provide the location of natural disasters that are poured into maps by extracting Twitter data. The Twitter text is extracted by using named entity recognition (NER) with six classes hierarchy location in Indonesia. Moreover, the tweet then is classified into eight classes of natural disasters using the support vector machine (SVM). Overall, the system is able to classify tweet and mapping the position of the content tweet.

Design/methodology/approach

This research builds a model to map the geolocation of tweet data using NER. This research uses six classes of NER which is based on region Indonesia. This data is then classified into eight classes of natural disasters using the SVM.

Findings

Experiment results demonstrate that the proposed NER with six special classes based on the regional level in Indonesia is able to map the location of the disaster based on data Twitter. The results also show good performance in geocoding such as match rate, match score and match type. Moreover, with SVM, this study can also classify tweet into eight classes of types of natural disasters specifically for the Indonesian region, which originate from the tweets collected.

Research limitations/implications

This study implements in Indonesia region.

Originality/value

(a)NER with six classes is used to create a location classification model with StanfordNER and ArcGIS tools. The use of six location classes is based on the Indonesia regional which has the large area. Hence, it has many levels in its regional location, such as province, district/city, sub-district, village, road and place names. (b) SVM is used to classify natural disasters. Classification of types of natural disasters is divided into eight: floods, earthquakes, landslides, tsunamis, hurricanes, forest fires, droughts and volcanic eruptions.

Details

International Journal of Intelligent Computing and Cybernetics, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 25 October 2022

Victor Diogho Heuer de Carvalho and Ana Paula Cabral Seixas Costa

This article presents two Brazilian Portuguese corpora collected from different media concerning public security issues in a specific location. The primary motivation is…

Abstract

Purpose

This article presents two Brazilian Portuguese corpora collected from different media concerning public security issues in a specific location. The primary motivation is supporting analyses, so security authorities can make appropriate decisions about their actions.

Design/methodology/approach

The corpora were obtained through web scraping from a newspaper's website and tweets from a Brazilian metropolitan region. Natural language processing was applied considering: text cleaning, lemmatization, summarization, part-of-speech and dependencies parsing, named entities recognition, and topic modeling.

Findings

Several results were obtained based on the methodology used, highlighting some: an example of a summarization using an automated process; dependency parsing; the most common topics in each corpus; the forty named entities and the most common slogans were extracted, highlighting those linked to public security.

Research limitations/implications

Some critical tasks were identified for the research perspective, related to the applied methodology: the treatment of noise from obtaining news on their source websites, passing through textual elements quite present in social network posts such as abbreviations, emojis/emoticons, and even writing errors; the treatment of subjectivity, to eliminate noise from irony and sarcasm; the search for authentic news of issues within the target domain. All these tasks aim to improve the process to enable interested authorities to perform accurate analyses.

Practical implications

The corpora dedicated to the public security domain enable several analyses, such as mining public opinion on security actions in a given location; understanding criminals' behaviors reported in the news or even on social networks and drawing their attitudes timeline; detecting movements that may cause damage to public property and people welfare through texts from social networks; extracting the history and repercussions of police actions, crossing news with records on social networks; among many other possibilities.

Originality/value

The work on behalf of the corpora reported in this text represents one of the first initiatives to create textual bases in Portuguese, dedicated to Brazil's specific public security domain.

Details

Library Hi Tech, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 0737-8831

Keywords

21 – 30 of over 10000