Search results

1 – 10 of 259
Article
Publication date: 2 September 2019

Jelena Andonovski, Branislava Šandrih and Olivera Kitanović

This paper aims to describe the structure of an aligned Serbian-German literary corpus (SrpNemKor) contained in a digital library Bibliša. The goal of the research was to create a…

Abstract

Purpose

This paper aims to describe the structure of an aligned Serbian-German literary corpus (SrpNemKor) contained in a digital library Bibliša. The goal of the research was to create a benchmark Serbian-German annotated corpus searchable with various query expansions.

Design/methodology/approach

The presented research is particularly focused on the enhancement of bilingual search queries in a full-text search of aligned SrpNemKor collection. The enhancement is based on using existing lexical resources such as Serbian morphological electronic dictionaries and the bilingual lexical database Termi.

Findings

For the purpose of this research, the lexical database Termi is enriched with a bilingual list of German-Serbian translated pairs of lexical units. The list of correct translation pairs was extracted from SrpNemKor, evaluated and integrated into Termi. Also, Serbian morphological e-dictionaries are updated with new entries extracted from the Serbian part of the corpus.

Originality/value

A bilingual search of SrpNemKor in Bibliša is available within the user-friendly platform. The enriched database Termi enables semantic enhancement and refinement of user’s search query based on synonyms both in Serbian and German at a very high level. Serbian morphological e-dictionaries facilitate the morphological expansion of search queries in Serbian, thereby enabling the analysis of concepts and concept structures by identifying terms assigned to the concept, and by establishing relations between terms in Serbian and German which makes Bibliša a valuable Web tool that can support research and analysis of SrpNemKor.

Details

The Electronic Library , vol. 37 no. 4
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 1 March 1998

Robert Gaizauskas and Yorick Wilks

In this paper we give a synoptic view of the growth of the text processing technology of information extraction (IE) whose function is to extract information about a pre‐specified…

1399

Abstract

In this paper we give a synoptic view of the growth of the text processing technology of information extraction (IE) whose function is to extract information about a pre‐specified set of entities, relations or events from natural language texts and to record this information in structured representations called templates. Here we describe the nature of the IE task, review the history of the area from its origins in AI work in the 1960s and 70s till the present, discuss the techniques being used to carry out the task, describe application areas where IE systems are or are about to be at work, and conclude with a discussion of the challenges facing the area. What emerges is a picture of an exciting new text processing technology with a host of new applications, both on its own and in conjunction with other technologies, such as information retrieval, machine translation and data mining.

Details

Journal of Documentation, vol. 54 no. 1
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 1 August 2005

Carmen Galvez, Félix de Moya‐Anegón and Víctor H. Solana

To propose a categorization of the different conflation procedures at the two basic approaches, non‐linguistic and linguistic techniques, and to justify the application of…

1323

Abstract

Purpose

To propose a categorization of the different conflation procedures at the two basic approaches, non‐linguistic and linguistic techniques, and to justify the application of normalization methods within the framework of linguistic techniques.

Design/methodology/approach

Presents a range of term conflation methods, that can be used in information retrieval. The uniterm and multiterm variants can be considered equivalent units for the purposes of automatic indexing. Stemming algorithms, segmentation rules, association measures and clustering techniques are well evaluated non‐linguistic methods, and experiments with these techniques show a wide variety of results. Alternatively, the lemmatisation and the use of syntactic pattern‐matching, through equivalence relations represented in finite‐state transducers (FST), are emerging methods for the recognition and standardization of terms.

Findings

The survey attempts to point out the positive and negative effects of the linguistic approach and its potential as a term conflation method.

Originality/value

Outlines the importance of FSTs for the normalization of term variants.

Details

Journal of Documentation, vol. 61 no. 4
Type: Research Article
ISSN: 0022-0418

Keywords

Open Access
Article
Publication date: 19 July 2022

Shreyesh Doppalapudi, Tingyan Wang and Robin Qiu

Clinical notes typically contain medical jargons and specialized words and phrases that are complicated and technical to most people, which is one of the most challenging…

1046

Abstract

Purpose

Clinical notes typically contain medical jargons and specialized words and phrases that are complicated and technical to most people, which is one of the most challenging obstacles in health information dissemination to consumers by healthcare providers. The authors aim to investigate how to leverage machine learning techniques to transform clinical notes of interest into understandable expressions.

Design/methodology/approach

The authors propose a natural language processing pipeline that is capable of extracting relevant information from long unstructured clinical notes and simplifying lexicons by replacing medical jargons and technical terms. Particularly, the authors develop an unsupervised keywords matching method to extract relevant information from clinical notes. To automatically evaluate completeness of the extracted information, the authors perform a multi-label classification task on the relevant texts. To simplify lexicons in the relevant text, the authors identify complex words using a sequence labeler and leverage transformer models to generate candidate words for substitution. The authors validate the proposed pipeline using 58,167 discharge summaries from critical care services.

Findings

The results show that the proposed pipeline can identify relevant information with high completeness and simplify complex expressions in clinical notes so that the converted notes have a high level of readability but a low degree of meaning change.

Social implications

The proposed pipeline can help healthcare consumers well understand their medical information and therefore strengthen communications between healthcare providers and consumers for better care.

Originality/value

An innovative pipeline approach is developed to address the health literacy problem confronted by healthcare providers and consumers in the ongoing digital transformation process in the healthcare industry.

Article
Publication date: 15 August 2016

Ioana Barbantan, Mihaela Porumb, Camelia Lemnaru and Rodica Potolea

Improving healthcare services by developing assistive technologies includes both the health aid devices and the analysis of the data collected by them. The acquired data modeled…

Abstract

Purpose

Improving healthcare services by developing assistive technologies includes both the health aid devices and the analysis of the data collected by them. The acquired data modeled as a knowledge base give more insight into each patient’s health status and needs. Therefore, the ultimate goal of a health-care system is obtaining recommendations provided by an assistive decision support system using such knowledge base, benefiting the patients, the physicians and the healthcare industry. This paper aims to define the knowledge flow for a medical assistive decision support system by structuring raw medical data and leveraging the knowledge contained in the data proposing solutions for efficient data search, medical investigation or diagnosis and medication prediction and relationship identification.

Design/methodology/approach

The solution this paper proposes for implementing a medical assistive decision support system can analyze any type of unstructured medical documents which are processed by applying Natural Language Processing (NLP) tasks followed by semantic analysis, leading to the medical concept identification, thus imposing a structure on the input documents. The structured information is filtered and classified such that custom decisions regarding patients’ health status can be made. The current research focuses on identifying the relationships between medical concepts as defined by the REMed (Relation Extraction from Medical documents) solution that aims at finding the patterns that lead to the classification of concept pairs into concept-to-concept relations.

Findings

This paper proposed the REMed solution expressed as a multi-class classification problem tackled using the support vector machine classifier. Experimentally, this paper determined the most appropriate setup for the multi-class classification problem which is a combination of lexical, context, syntactic and grammatical features, as each feature category is good at representing particular relations, but not all. The best results we obtained are expressed as F1-measure of 74.9 per cent which is 1.4 per cent better than the results reported by similar systems.

Research limitations/implications

The difficulty to discriminate between TrIP and TrAP relations revolves around the hierarchical relationship between the two classes as TrIP is a particular type (an instance) of TrAP. The intuition behind this behavior was that the classifier cannot discern the correct relations because of the bias toward the majority classes. The analysis was conducted by using only sentences from electronic health record that contain at least two medical concepts. This limitation was introduced by the availability of the annotated data with reported results, as relations were defined at sentence level.

Originality/value

The originality of the proposed solution lies in the methodology to extract valuable information from the medical records via semantic searches; concept-to-concept relation identification; and recommendations for diagnosis, treatment and further investigations. The REMed solution introduces a learning-based approach for the automatic discovery of relations between medical concepts. We propose an original list of features: lexical – 3, context – 6, grammatical – 4 and syntactic – 4. The similarity feature introduced in this paper has a significant influence on the classification, and, to the best of the authors’ knowledge, it has not been used as feature in similar solutions.

Details

International Journal of Web Information Systems, vol. 12 no. 3
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 12 November 2018

Aleksandra Tomašević, Ranka Stanković, Miloš Utvić, Ivan Obradović and Božo Kolonja

This paper aims to develop a system, which would enable efficient management and exploitation of documentation in electronic form, related to mining projects, with information…

Abstract

Purpose

This paper aims to develop a system, which would enable efficient management and exploitation of documentation in electronic form, related to mining projects, with information retrieval and information extraction (IE) features, using various language resources and natural language processing.

Design/methodology/approach

The system is designed to integrate textual, lexical, semantic and terminological resources, enabling advanced document search and extraction of information. These resources are integrated with a set of Web services and applications, for different user profiles and use-cases.

Findings

The use of the system is illustrated by examples demonstrating keyword search supported by Web query expansion services, search based on regular expressions, corpus search based on local grammars, followed by extraction of information based on this search and finally, search with lexical masks using domain and semantic markers.

Originality/value

The presented system is the first software solution for implementation of human language technology in management of documentation from the mining engineering domain, but it is also applicable to other engineering and non-engineering domains. The system is independent of the type of alphabet (Cyrillic and Latin), which makes it applicable to other languages of the Balkan region related to Serbian, and its support for morphological dictionaries can be applied in most morphologically complex languages, such as Slavic languages. Significant search improvements and the efficiency of IE are based on semantic networks and terminology dictionaries, with the support of local grammars.

Details

The Electronic Library, vol. 36 no. 6
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 9 February 2015

Bhaskar Sinha, Somnath Chandra and Megha Garg

The purpose of this explorative research study is to focus on the implementation of semantic Web technology on agriculture domain of e-governance data. The study contributes to an…

1143

Abstract

Purpose

The purpose of this explorative research study is to focus on the implementation of semantic Web technology on agriculture domain of e-governance data. The study contributes to an understanding of problems and difficulties in implantations of unstructured and unformatted unique datasets of multilingual local language-based electronic dictionary (IndoWordnet).

Design/methodology/approach

An approach to an implementation in the perspective of conceptual logical concept to realization of agriculture-based terms and terminology extracted from linked multilingual IndoWordNet while maintaining the support and specification of the World Wide Web Consortium (W3C) standard of semantic Web technology to generate ontology and uniform unicode structured datasets.

Findings

The findings reveal the fact about partial support of extraction of terms, relations and concepts while linking to IndoWordNet, resulting in the form of SynSets, lexical relations of Words and relations between themselves. This helped in generation of ontology, hierarchical modeling and creation of structured metadata datasets.

Research limitations/implications

IndoWordNet has limitations, as it is not fully revised version due to diversified cultural base in India, and the new version is yet to be released in due time span. As mentioned in Section 5, implications of these ideas and experiments will have good impact in doing more exploration and better applications using such wordnet.

Practical implications

Language developer tools and frameworks have been used to get tagged annotated raw data processed and get intermediate results, which provides as a source for the generation of ontology and dynamic metadata.

Social implications

The results are expected to be applied for other e-governance applications. Better use of applications in social and government departments.

Originality/value

The authors have worked out experimental facts and raw information source datasets, revealing satisfactory results such as SynSets, sensecount, semantic and lexical relations, class concepts hierarchy and other related output, which helped in developing ontology of domain interest and, hence, creation of a dynamic metadata which can be globally used to facilitate various applications support.

Details

Journal of Knowledge Management, vol. 19 no. 1
Type: Research Article
ISSN: 1367-3270

Keywords

Article
Publication date: 1 March 2006

Fidelia Ibekwe‐SanJuan

To propose a comprehensive and semi‐automatic method for constructing or updating knowledge organization tools such as thesauri.

1624

Abstract

Purpose

To propose a comprehensive and semi‐automatic method for constructing or updating knowledge organization tools such as thesauri.

Design/methodology/approach

The paper proposes a comprehensive methodology for thesaurus construction and maintenance combining shallow NLP with a clustering algorithm and an information visualization interface. The resulting system TermWatch, extracts terms from a text collection, mines semantic relations between them using complementary linguistic approaches and clusters terms using these semantic relations. The clusters are mapped onto a 2D using an integrated visualization tool.

Findings

The clusters formed exhibit the different relations necessary to populate a thesaurus or ontology: synonymy, generic/specific and relatedness. The clusters represent, for a given term, its closest neighbours in terms of semantic relations.

Practical implications

This could change the way in which information professionals (librarians and documentalists) undertake knowledge organization tasks. TermWatch can be useful either as a starting point for grasping the conceptual organization of knowledge in a huge text collection without having to read the texts, then actually serving as a suggestive tool for populating different hierarchies of a thesaurus or an ontology because its clusters are based on semantic relations.

Originality/value

This lies in several points: combined use of linguistic relations with an adapted clustering algorithm, which is scalable and can handle sparse data. The paper proposes a comprehensive approach to semantic relations acquisition whereas existing studies often use one or two approaches. The domain knowledge maps produced by the system represents an added advantage over existing approaches to automatic thesaurus construction in that clusters are formed using semantic relations between domain terms. Thus while offering a meaningful synthesis of the information contained in the original corpus through clustering, the results can be used for knowledge organization tasks (thesaurus building and ontology population) The system also constitutes a platform for performing several knowledge‐oriented tasks like science and technology watch, textmining, query refinement.

Details

Journal of Documentation, vol. 62 no. 2
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 17 May 2022

Qiucheng Liu

In order to analyze the text complexity of Chinese and foreign academic English writings, the artificial neural network (ANN) under deep learning (DL) is applied to the study of…

Abstract

Purpose

In order to analyze the text complexity of Chinese and foreign academic English writings, the artificial neural network (ANN) under deep learning (DL) is applied to the study of text complexity. Firstly, the research status and existing problems of text complexity are introduced based on DL. Secondly, based on Back Propagation Neural Network (BPNN) algorithm, analyzation is made on the text complexity of Chinese and foreign academic English writings. And the research establishes a BPNN syntactic complexity evaluation system. Thirdly, MATLAB2013b is used for simulation analysis of the model. The proposed model algorithm BPANN is compared with other classical algorithms, and the weight value of each index and the model training effect are further analyzed by statistical methods. Finally, L2 Syntactic Complexity Analyzer (L2SCA) is used to calculate the syntactic complexity of the two libraries, and Mann–Whitney U test is used to compare the syntactic complexity of Chinese English learners and native English speakers. The experimental results show that compared with the shallow neural network, the deep neural network algorithm has more hidden layers and richer features, and better performance of feature extraction. BPNN algorithm shows excellent performance in the training process, and the actual output value is very close to the expected value. Meantime, the error of sample test is analyzed, and it is found that the evaluation error of BPNN algorithm is less than 1.8%, of high accuracy. However, there are significant differences in grammatical complexity among students with different English writing proficiency. Some measurement methods cannot effectively reflect the types and characteristics of written language, or may have a negative relationship with writing quality. In addition, the research also finds that the measurement of syntactic complexity is more sensitive to the language ability of writing. Therefore, BPNN algorithm can effectively analyze the text complexity of academic English writing. The results of the research provide reference for improving the evaluation system of text complexity of academic paper writing.

Design/methodology/approach

In order to analyze the text complexity of Chinese and foreign academic English writings, the artificial neural network (ANN) under deep learning (DL) is applied to the study of text complexity. Firstly, the research status and existing problems of text complexity are introduced based on DL. Secondly, based on Back Propagation Neural Network (BPNN) algorithm, analyzation is made on the text complexity of Chinese and foreign academic English writings. And the research establishes a BPNN syntactic complexity evaluation system. Thirdly, MATLAB2013b is used for simulation analysis of the model. The proposed model algorithm BPANN is compared with other classical algorithms, and the weight value of each index and the model training effect are further analyzed by statistical methods. Finally, L2 Syntactic Complexity Analyzer (L2SCA) is used to calculate the syntactic complexity of the two libraries, and Mann–Whitney U test is used to compare the syntactic complexity of Chinese English learners and native English speakers. The experimental results show that compared with the shallow neural network, the deep neural network algorithm has more hidden layers and richer features, and better performance of feature extraction. BPNN algorithm shows excellent performance in the training process, and the actual output value is very close to the expected value. Meantime, the error of sample test is analyzed, and it is found that the evaluation error of BPNN algorithm is less than 1.8%, of high accuracy. However, there are significant differences in grammatical complexity among students with different English writing proficiency. Some measurement methods cannot effectively reflect the types and characteristics of written language, or may have a negative relationship with writing quality. In addition, the research also finds that the measurement of syntactic complexity is more sensitive to the language ability of writing. Therefore, BPNN algorithm can effectively analyze the text complexity of academic English writing. The results of the research provide reference for improving the evaluation system of text complexity of academic paper writing.

Findings

In order to analyze the text complexity of Chinese and foreign academic English writings, the artificial neural network (ANN) under deep learning (DL) is applied to the study of text complexity. Firstly, the research status and existing problems of text complexity are introduced based on DL. Secondly, based on Back Propagation Neural Network (BPNN) algorithm, analyzation is made on the text complexity of Chinese and foreign academic English writings. And the research establishes a BPNN syntactic complexity evaluation system. Thirdly, MATLAB2013b is used for simulation analysis of the model. The proposed model algorithm BPANN is compared with other classical algorithms, and the weight value of each index and the model training effect are further analyzed by statistical methods. Finally, L2 Syntactic Complexity Analyzer (L2SCA) is used to calculate the syntactic complexity of the two libraries, and Mann–Whitney U test is used to compare the syntactic complexity of Chinese English learners and native English speakers. The experimental results show that compared with the shallow neural network, the deep neural network algorithm has more hidden layers and richer features, and better performance of feature extraction. BPNN algorithm shows excellent performance in the training process, and the actual output value is very close to the expected value. Meantime, the error of sample test is analyzed, and it is found that the evaluation error of BPNN algorithm is less than 1.8%, of high accuracy. However, there are significant differences in grammatical complexity among students with different English writing proficiency. Some measurement methods cannot effectively reflect the types and characteristics of written language, or may have a negative relationship with writing quality. In addition, the research also finds that the measurement of syntactic complexity is more sensitive to the language ability of writing. Therefore, BPNN algorithm can effectively analyze the text complexity of academic English writing. The results of the research provide reference for improving the evaluation system of text complexity of academic paper writing.

Originality/value

In order to analyze the text complexity of Chinese and foreign academic English writings, the artificial neural network (ANN) under deep learning (DL) is applied to the study of text complexity. Firstly, the research status and existing problems of text complexity are introduced based on DL. Secondly, based on Back Propagation Neural Network (BPNN) algorithm, analyzation is made on the text complexity of Chinese and foreign academic English writings. And the research establishes a BPNN syntactic complexity evaluation system. Thirdly, MATLAB2013b is used for simulation analysis of the model. The proposed model algorithm BPANN is compared with other classical algorithms, and the weight value of each index and the model training effect are further analyzed by statistical methods. Finally, L2 Syntactic Complexity Analyzer (L2SCA) is used to calculate the syntactic complexity of the two libraries, and Mann–Whitney U test is used to compare the syntactic complexity of Chinese English learners and native English speakers. The experimental results show that compared with the shallow neural network, the deep neural network algorithm has more hidden layers and richer features, and better performance of feature extraction. BPNN algorithm shows excellent performance in the training process, and the actual output value is very close to the expected value. Meantime, the error of sample test is analyzed, and it is found that the evaluation error of BPNN algorithm is less than 1.8%, of high accuracy. However, there are significant differences in grammatical complexity among students with different English writing proficiency. Some measurement methods cannot effectively reflect the types and characteristics of written language, or may have a negative relationship with writing quality. In addition, the research also finds that the measurement of syntactic complexity is more sensitive to the language ability of writing. Therefore, BPNN algorithm can effectively analyze the text complexity of academic English writing. The results of the research provide reference for improving the evaluation system of text complexity of academic paper writing.

Details

Library Hi Tech, vol. 41 no. 5
Type: Research Article
ISSN: 0737-8831

Keywords

Article
Publication date: 1 May 2006

Carmen Galvez and Félix de Moya‐Anegón

To evaluate the accuracy of conflation methods based on finite‐state transducers (FSTs).

Abstract

Purpose

To evaluate the accuracy of conflation methods based on finite‐state transducers (FSTs).

Design/methodology/approach

Incorrectly lemmatized and stemmed forms may lead to the retrieval of inappropriate documents. Experimental studies to date have focused on retrieval performance, but very few on conflation performance. The process of normalization we used involved a linguistic toolbox that allowed us to construct, through graphic interfaces, electronic dictionaries represented internally by FSTs. The lexical resources developed were applied to a Spanish test corpus for merging term variants in canonical lemmatized forms. Conflation performance was evaluated in terms of an adaptation of recall and precision measures, based on accuracy and coverage, not actual retrieval. The results were compared with those obtained using a Spanish version of the Porter algorithm.

Findings

The conclusion is that the main strength of lemmatization is its accuracy, whereas its main limitation is the underanalysis of variant forms.

Originality/value

The report outlines the potential of transducers in their application to normalization processes.

Details

Journal of Documentation, vol. 62 no. 3
Type: Research Article
ISSN: 0022-0418

Keywords

1 – 10 of 259