Search results

1 – 10 of 175
To view the access options for this content please click here
Article

Jelena Andonovski, Branislava Šandrih and Olivera Kitanović

This paper aims to describe the structure of an aligned Serbian-German literary corpus (SrpNemKor) contained in a digital library Bibliša. The goal of the research was to…

Abstract

Purpose

This paper aims to describe the structure of an aligned Serbian-German literary corpus (SrpNemKor) contained in a digital library Bibliša. The goal of the research was to create a benchmark Serbian-German annotated corpus searchable with various query expansions.

Design/methodology/approach

The presented research is particularly focused on the enhancement of bilingual search queries in a full-text search of aligned SrpNemKor collection. The enhancement is based on using existing lexical resources such as Serbian morphological electronic dictionaries and the bilingual lexical database Termi.

Findings

For the purpose of this research, the lexical database Termi is enriched with a bilingual list of German-Serbian translated pairs of lexical units. The list of correct translation pairs was extracted from SrpNemKor, evaluated and integrated into Termi. Also, Serbian morphological e-dictionaries are updated with new entries extracted from the Serbian part of the corpus.

Originality/value

A bilingual search of SrpNemKor in Bibliša is available within the user-friendly platform. The enriched database Termi enables semantic enhancement and refinement of user’s search query based on synonyms both in Serbian and German at a very high level. Serbian morphological e-dictionaries facilitate the morphological expansion of search queries in Serbian, thereby enabling the analysis of concepts and concept structures by identifying terms assigned to the concept, and by establishing relations between terms in Serbian and German which makes Bibliša a valuable Web tool that can support research and analysis of SrpNemKor.

Details

The Electronic Library , vol. 37 no. 4
Type: Research Article
ISSN: 0264-0473

Keywords

To view the access options for this content please click here
Article

Robert Gaizauskas and Yorick Wilks

In this paper we give a synoptic view of the growth of the text processing technology of information extraction (IE) whose function is to extract information about a…

Abstract

In this paper we give a synoptic view of the growth of the text processing technology of information extraction (IE) whose function is to extract information about a pre‐specified set of entities, relations or events from natural language texts and to record this information in structured representations called templates. Here we describe the nature of the IE task, review the history of the area from its origins in AI work in the 1960s and 70s till the present, discuss the techniques being used to carry out the task, describe application areas where IE systems are or are about to be at work, and conclude with a discussion of the challenges facing the area. What emerges is a picture of an exciting new text processing technology with a host of new applications, both on its own and in conjunction with other technologies, such as information retrieval, machine translation and data mining.

Details

Journal of Documentation, vol. 54 no. 1
Type: Research Article
ISSN: 0022-0418

Keywords

To view the access options for this content please click here
Article

Carmen Galvez, Félix de Moya‐Anegón and Víctor H. Solana

To propose a categorization of the different conflation procedures at the two basic approaches, non‐linguistic and linguistic techniques, and to justify the application of…

Abstract

Purpose

To propose a categorization of the different conflation procedures at the two basic approaches, non‐linguistic and linguistic techniques, and to justify the application of normalization methods within the framework of linguistic techniques.

Design/methodology/approach

Presents a range of term conflation methods, that can be used in information retrieval. The uniterm and multiterm variants can be considered equivalent units for the purposes of automatic indexing. Stemming algorithms, segmentation rules, association measures and clustering techniques are well evaluated non‐linguistic methods, and experiments with these techniques show a wide variety of results. Alternatively, the lemmatisation and the use of syntactic pattern‐matching, through equivalence relations represented in finite‐state transducers (FST), are emerging methods for the recognition and standardization of terms.

Findings

The survey attempts to point out the positive and negative effects of the linguistic approach and its potential as a term conflation method.

Originality/value

Outlines the importance of FSTs for the normalization of term variants.

Details

Journal of Documentation, vol. 61 no. 4
Type: Research Article
ISSN: 0022-0418

Keywords

To view the access options for this content please click here
Article

Ioana Barbantan, Mihaela Porumb, Camelia Lemnaru and Rodica Potolea

Improving healthcare services by developing assistive technologies includes both the health aid devices and the analysis of the data collected by them. The acquired data…

Abstract

Purpose

Improving healthcare services by developing assistive technologies includes both the health aid devices and the analysis of the data collected by them. The acquired data modeled as a knowledge base give more insight into each patient’s health status and needs. Therefore, the ultimate goal of a health-care system is obtaining recommendations provided by an assistive decision support system using such knowledge base, benefiting the patients, the physicians and the healthcare industry. This paper aims to define the knowledge flow for a medical assistive decision support system by structuring raw medical data and leveraging the knowledge contained in the data proposing solutions for efficient data search, medical investigation or diagnosis and medication prediction and relationship identification.

Design/methodology/approach

The solution this paper proposes for implementing a medical assistive decision support system can analyze any type of unstructured medical documents which are processed by applying Natural Language Processing (NLP) tasks followed by semantic analysis, leading to the medical concept identification, thus imposing a structure on the input documents. The structured information is filtered and classified such that custom decisions regarding patients’ health status can be made. The current research focuses on identifying the relationships between medical concepts as defined by the REMed (Relation Extraction from Medical documents) solution that aims at finding the patterns that lead to the classification of concept pairs into concept-to-concept relations.

Findings

This paper proposed the REMed solution expressed as a multi-class classification problem tackled using the support vector machine classifier. Experimentally, this paper determined the most appropriate setup for the multi-class classification problem which is a combination of lexical, context, syntactic and grammatical features, as each feature category is good at representing particular relations, but not all. The best results we obtained are expressed as F1-measure of 74.9 per cent which is 1.4 per cent better than the results reported by similar systems.

Research limitations/implications

The difficulty to discriminate between TrIP and TrAP relations revolves around the hierarchical relationship between the two classes as TrIP is a particular type (an instance) of TrAP. The intuition behind this behavior was that the classifier cannot discern the correct relations because of the bias toward the majority classes. The analysis was conducted by using only sentences from electronic health record that contain at least two medical concepts. This limitation was introduced by the availability of the annotated data with reported results, as relations were defined at sentence level.

Originality/value

The originality of the proposed solution lies in the methodology to extract valuable information from the medical records via semantic searches; concept-to-concept relation identification; and recommendations for diagnosis, treatment and further investigations. The REMed solution introduces a learning-based approach for the automatic discovery of relations between medical concepts. We propose an original list of features: lexical – 3, context – 6, grammatical – 4 and syntactic – 4. The similarity feature introduced in this paper has a significant influence on the classification, and, to the best of the authors’ knowledge, it has not been used as feature in similar solutions.

Details

International Journal of Web Information Systems, vol. 12 no. 3
Type: Research Article
ISSN: 1744-0084

Keywords

To view the access options for this content please click here
Article

Aleksandra Tomašević, Ranka Stanković, Miloš Utvić, Ivan Obradović and Božo Kolonja

This paper aims to develop a system, which would enable efficient management and exploitation of documentation in electronic form, related to mining projects, with…

Abstract

Purpose

This paper aims to develop a system, which would enable efficient management and exploitation of documentation in electronic form, related to mining projects, with information retrieval and information extraction (IE) features, using various language resources and natural language processing.

Design/methodology/approach

The system is designed to integrate textual, lexical, semantic and terminological resources, enabling advanced document search and extraction of information. These resources are integrated with a set of Web services and applications, for different user profiles and use-cases.

Findings

The use of the system is illustrated by examples demonstrating keyword search supported by Web query expansion services, search based on regular expressions, corpus search based on local grammars, followed by extraction of information based on this search and finally, search with lexical masks using domain and semantic markers.

Originality/value

The presented system is the first software solution for implementation of human language technology in management of documentation from the mining engineering domain, but it is also applicable to other engineering and non-engineering domains. The system is independent of the type of alphabet (Cyrillic and Latin), which makes it applicable to other languages of the Balkan region related to Serbian, and its support for morphological dictionaries can be applied in most morphologically complex languages, such as Slavic languages. Significant search improvements and the efficiency of IE are based on semantic networks and terminology dictionaries, with the support of local grammars.

Details

The Electronic Library, vol. 36 no. 6
Type: Research Article
ISSN: 0264-0473

Keywords

To view the access options for this content please click here
Article

Bhaskar Sinha, Somnath Chandra and Megha Garg

The purpose of this explorative research study is to focus on the implementation of semantic Web technology on agriculture domain of e-governance data. The study…

Abstract

Purpose

The purpose of this explorative research study is to focus on the implementation of semantic Web technology on agriculture domain of e-governance data. The study contributes to an understanding of problems and difficulties in implantations of unstructured and unformatted unique datasets of multilingual local language-based electronic dictionary (IndoWordnet).

Design/methodology/approach

An approach to an implementation in the perspective of conceptual logical concept to realization of agriculture-based terms and terminology extracted from linked multilingual IndoWordNet while maintaining the support and specification of the World Wide Web Consortium (W3C) standard of semantic Web technology to generate ontology and uniform unicode structured datasets.

Findings

The findings reveal the fact about partial support of extraction of terms, relations and concepts while linking to IndoWordNet, resulting in the form of SynSets, lexical relations of Words and relations between themselves. This helped in generation of ontology, hierarchical modeling and creation of structured metadata datasets.

Research limitations/implications

IndoWordNet has limitations, as it is not fully revised version due to diversified cultural base in India, and the new version is yet to be released in due time span. As mentioned in Section 5, implications of these ideas and experiments will have good impact in doing more exploration and better applications using such wordnet.

Practical implications

Language developer tools and frameworks have been used to get tagged annotated raw data processed and get intermediate results, which provides as a source for the generation of ontology and dynamic metadata.

Social implications

The results are expected to be applied for other e-governance applications. Better use of applications in social and government departments.

Originality/value

The authors have worked out experimental facts and raw information source datasets, revealing satisfactory results such as SynSets, sensecount, semantic and lexical relations, class concepts hierarchy and other related output, which helped in developing ontology of domain interest and, hence, creation of a dynamic metadata which can be globally used to facilitate various applications support.

Details

Journal of Knowledge Management, vol. 19 no. 1
Type: Research Article
ISSN: 1367-3270

Keywords

To view the access options for this content please click here
Article

Fidelia Ibekwe‐SanJuan

To propose a comprehensive and semi‐automatic method for constructing or updating knowledge organization tools such as thesauri.

Abstract

Purpose

To propose a comprehensive and semi‐automatic method for constructing or updating knowledge organization tools such as thesauri.

Design/methodology/approach

The paper proposes a comprehensive methodology for thesaurus construction and maintenance combining shallow NLP with a clustering algorithm and an information visualization interface. The resulting system TermWatch, extracts terms from a text collection, mines semantic relations between them using complementary linguistic approaches and clusters terms using these semantic relations. The clusters are mapped onto a 2D using an integrated visualization tool.

Findings

The clusters formed exhibit the different relations necessary to populate a thesaurus or ontology: synonymy, generic/specific and relatedness. The clusters represent, for a given term, its closest neighbours in terms of semantic relations.

Practical implications

This could change the way in which information professionals (librarians and documentalists) undertake knowledge organization tasks. TermWatch can be useful either as a starting point for grasping the conceptual organization of knowledge in a huge text collection without having to read the texts, then actually serving as a suggestive tool for populating different hierarchies of a thesaurus or an ontology because its clusters are based on semantic relations.

Originality/value

This lies in several points: combined use of linguistic relations with an adapted clustering algorithm, which is scalable and can handle sparse data. The paper proposes a comprehensive approach to semantic relations acquisition whereas existing studies often use one or two approaches. The domain knowledge maps produced by the system represents an added advantage over existing approaches to automatic thesaurus construction in that clusters are formed using semantic relations between domain terms. Thus while offering a meaningful synthesis of the information contained in the original corpus through clustering, the results can be used for knowledge organization tasks (thesaurus building and ontology population) The system also constitutes a platform for performing several knowledge‐oriented tasks like science and technology watch, textmining, query refinement.

Details

Journal of Documentation, vol. 62 no. 2
Type: Research Article
ISSN: 0022-0418

Keywords

To view the access options for this content please click here
Article

Carmen Galvez and Félix de Moya‐Anegón

To evaluate the accuracy of conflation methods based on finite‐state transducers (FSTs).

Abstract

Purpose

To evaluate the accuracy of conflation methods based on finite‐state transducers (FSTs).

Design/methodology/approach

Incorrectly lemmatized and stemmed forms may lead to the retrieval of inappropriate documents. Experimental studies to date have focused on retrieval performance, but very few on conflation performance. The process of normalization we used involved a linguistic toolbox that allowed us to construct, through graphic interfaces, electronic dictionaries represented internally by FSTs. The lexical resources developed were applied to a Spanish test corpus for merging term variants in canonical lemmatized forms. Conflation performance was evaluated in terms of an adaptation of recall and precision measures, based on accuracy and coverage, not actual retrieval. The results were compared with those obtained using a Spanish version of the Porter algorithm.

Findings

The conclusion is that the main strength of lemmatization is its accuracy, whereas its main limitation is the underanalysis of variant forms.

Originality/value

The report outlines the potential of transducers in their application to normalization processes.

Details

Journal of Documentation, vol. 62 no. 3
Type: Research Article
ISSN: 0022-0418

Keywords

To view the access options for this content please click here
Article

Alexander Mehler and Ulli Waltinger

The purpose of this paper is to present a topic classification model using the Dewey Decimal Classification (DDC) as the target scheme. This is to be done by exploring…

Abstract

Purpose

The purpose of this paper is to present a topic classification model using the Dewey Decimal Classification (DDC) as the target scheme. This is to be done by exploring metadata as provided by the Open Archives Initiative (OAI) to derive document snippets as minimal document representations. The reason is to reduce the effort of document processing in digital libraries. Further, the paper seeks to perform feature selection and extension by means of social ontologies and related web‐based lexical resources. This is done to provide reliable topic‐related classifications while circumventing the problem of data sparseness. Finally, the paper aims to evaluate the model by means of two language‐specific corpora. The paper bridges digital libraries, on the one hand, and computational linguistics, on the other. The aim is to make accessible computational linguistic methods to provide thematic classifications in digital libraries based on closed topic models such as the DDC.

Design/methodology/approach

The approach takes the form of text classification, text‐technology, computational linguistics, computational semantics, and social semantics.

Findings

It is shown that SVM‐based classifiers perform best by exploring certain selections of OAI document metadata.

Research limitations/implications

The findings show that it is necessary to further develop SVM‐based DDC‐classifiers by using larger training sets possibly for more than two languages in order to get better F‐measure values.

Originality/value

Algorithmic and formal‐mathematical information is provided on how to build DDC‐classifiers for digital libraries.

Details

Library Hi Tech, vol. 27 no. 4
Type: Research Article
ISSN: 0737-8831

Keywords

To view the access options for this content please click here
Article

Vesna Pajić, Staša Vujičić Stanković, Ranka Stanković and Miloš Pajić

A hybrid approach is presented, which combines linguistic and statistical information to semi-automatically extract multiword term candidates from texts.

Abstract

Purpose

A hybrid approach is presented, which combines linguistic and statistical information to semi-automatically extract multiword term candidates from texts.

Design/methodology/approach

The method is designed to be domain and language independent, focusing on languages with rich morphology. Here, it is used for extracting multiword terms from texts in Serbian, belonging to the agricultural engineering domain, as a use case. Predefined syntactic structures were used for multiword terms. For each structure, a finite state transducer was developed, which recognizes text sequences having that structure and outputs the sequence in a normalized form, so that different inflectional forms of the same multiword term can be counted properly. Term candidates were further filtered by their frequencies and evaluated by two domain experts.

Findings

By using language resources, such as electronic dictionaries and grammars, 928 multiword terms were extracted out of 1,523 multiword terms that were recognized as candidates from a corpus having 42,260 different simple word forms; 870 of these were new, not already contained in the existing electronic dictionary of compounds for Serbian, and they were used to enrich the dictionary.

Originality/value

The paper presents methodology that can significantly contribute to the development of terminology lexicons in different areas. In this particular use case, some important agricultural engineering concepts were extracted from the text, but this approach could be used for other domains and languages as well.

Details

The Electronic Library, vol. 36 no. 3
Type: Research Article
ISSN: 0264-0473

Keywords

1 – 10 of 175