Search results
1 – 10 of 48Mohamed Boudchiche and Azzeddine Mazroui
We have developed in this paper a morphological disambiguation hybrid system for the Arabic language that identifies the stem, lemma and root of a given sentence words. Following…
Abstract
We have developed in this paper a morphological disambiguation hybrid system for the Arabic language that identifies the stem, lemma and root of a given sentence words. Following an out-of-context analysis performed by the morphological analyser Alkhalil Morpho Sys, the system first identifies all the potential tags of each word of the sentence. Then, a disambiguation phase is carried out to choose for each word the right solution among those obtained during the first phase. This problem has been solved by equating the disambiguation issue with a surface optimization problem of spline functions. Tests have shown the interest of this approach and the superiority of its performances compared to those of the state of the art.
Details
Keywords
Abdelhalim Saadi and Hacene Belhadef
The purpose of this paper is to present a system based on deep neural networks to extract particular entities from natural language text, knowing that a massive amount of textual…
Abstract
Purpose
The purpose of this paper is to present a system based on deep neural networks to extract particular entities from natural language text, knowing that a massive amount of textual information is electronically available at present. Notably, a large amount of electronic text data indicates great difficulty in finding or extracting relevant information from them.
Design/methodology/approach
This study presents an original system to extract Arabic-named entities by combining a deep neural network-based part-of-speech tagger and a neural network-based named entity extractor. Firstly, the system extracts the grammatical classes of the words with high precision depending on the context of the word. This module plays the role of the disambiguation process. Then, a second module is used to extract the named entities.
Findings
Using deep neural networks in natural language processing, requires tuning many hyperparameters, which is a time-consuming process. To deal with this problem, applying statistical methods like the Taguchi method is much requested. In this study, the system is successfully applied to the Arabic-named entities recognition, where accuracy of 96.81 per cent was reported, which is better than the state-of-the-art results.
Research limitations/implications
The system is designed and trained for the Arabic language, but the architecture can be used for other languages.
Practical implications
Information extraction systems are developed for different applications, such as analysing newspaper articles and databases for commercial, political and social objectives. Information extraction systems also can be built over an information retrieval (IR) system. The IR system eliminates irrelevant documents and paragraphs.
Originality/value
The proposed system can be regarded as the first attempt to use double deep neural networks to increase the accuracy. It also can be built over an IR system. The IR system eliminates irrelevant documents and paragraphs. This process reduces the mass number of documents from which the authors wish to extract the relevant information using an information extraction system.
Details
Keywords
Souheila Ben Guirat, Ibrahim Bounhas and Yahya Slimani
The semantic relations between Arabic word representations were recognized and widely studied in theoretical studies in linguistics many centuries ago. Nonetheless, most of the…
Abstract
Purpose
The semantic relations between Arabic word representations were recognized and widely studied in theoretical studies in linguistics many centuries ago. Nonetheless, most of the previous research in automatic information retrieval (IR) focused on stem or root-based indexing, while lemmas and patterns are under-exploited. However, the authors believe that each of the four morphological levels encapsulates a part of the meaning of words. That is, the purpose is to aggregate these levels using more sophisticated approaches to reach the optimal combination which enhances IR.
Design/methodology/approach
The authors first compare the state-of-the art Arabic natural language processing (NLP) tools in IR. This allows to select the most accurate tool in each representation level i.e. developing four basic IR systems. Then, the authors compare two rank aggregation approaches which combine the results of these systems. The first approach is based on linear combination, while the second exploits classification-based meta-search.
Findings
Combining different word representation levels, consistently and significantly enhances IR results. The proposed classification-based approach outperforms linear combination and all the basic systems.
Research limitations/implications
The work stands by a standard experimental comparative study which assesses several NLP tools and combining approaches on different test collections and IR models. Thus, it may be helpful for future research works to choose the most suitable tools and develop more sophisticated methods for handling the complexity of Arabic language.
Originality/value
The originality of the idea is to consider that the richness of Arabic is an exploitable characteristic and no more a challenging limit. Thus, the authors combine 4 different morphological levels for the first time in Arabic IR. This approach widely overtook previous research results.
Peer review
The peer review history for this article is available at: https://publons.com/publon/10.1108/OIR-11-2020-0515
Details
Keywords
Carmen Galvez and Félix de Moya‐Anegón
To evaluate the accuracy of conflation methods based on finite‐state transducers (FSTs).
Abstract
Purpose
To evaluate the accuracy of conflation methods based on finite‐state transducers (FSTs).
Design/methodology/approach
Incorrectly lemmatized and stemmed forms may lead to the retrieval of inappropriate documents. Experimental studies to date have focused on retrieval performance, but very few on conflation performance. The process of normalization we used involved a linguistic toolbox that allowed us to construct, through graphic interfaces, electronic dictionaries represented internally by FSTs. The lexical resources developed were applied to a Spanish test corpus for merging term variants in canonical lemmatized forms. Conflation performance was evaluated in terms of an adaptation of recall and precision measures, based on accuracy and coverage, not actual retrieval. The results were compared with those obtained using a Spanish version of the Porter algorithm.
Findings
The conclusion is that the main strength of lemmatization is its accuracy, whereas its main limitation is the underanalysis of variant forms.
Originality/value
The report outlines the potential of transducers in their application to normalization processes.
Details
Keywords
Ankie Visschedijk and Forbes Gibb
This article reviews some of the more unconventional text retrieval systems, emphasising those which have been commercialised. These sophisticated systems improve on conventional…
Abstract
This article reviews some of the more unconventional text retrieval systems, emphasising those which have been commercialised. These sophisticated systems improve on conventional retrieval by using either innovative software or hardware to increase retrieval speed or functionality, precision or recall. The software systems reviewed are: AIDA, CLARIT, Metamorph, SIMPR, STATUS/IQ, TCS, TINA and TOPIC. The hardware systems reviewed are: CAFS‐ISP, the Connection Machine, GESCAN,HSTS,MPP, TEXTRACT, TRW‐FDF and URSA.
Mahmoud Al-Ayyoub, Ahmed Alwajeeh and Ismail Hmeidi
The authorship authentication (AA) problem is concerned with correctly attributing a text document to its corresponding author. Historically, this problem has been the focus of…
Abstract
Purpose
The authorship authentication (AA) problem is concerned with correctly attributing a text document to its corresponding author. Historically, this problem has been the focus of various studies focusing on the intuitive idea that each author has a unique style that can be captured using stylometric features (SF). Another approach to this problem, known as the bag-of-words (BOW) approach, uses keywords occurrences/frequencies in each document to identify its author. Unlike the first one, this approach is more language-independent. This paper aims to study and compare both approaches focusing on the Arabic language which is still largely understudied despite its importance.
Design/methodology/approach
Being a supervised learning problem, the authors start by collecting a very large data set of Arabic documents to be used for training and testing purposes. For the SF approach, they compute hundreds of SF, whereas, for the BOW approach, the popular term frequency-inverse document frequency technique is used. Both approaches are compared under various settings.
Findings
The results show that the SF approach, which is much cheaper to train, can generate more accurate results under most settings.
Practical implications
Numerous advantages of efficiently solving the AA problem are obtained in different fields of academia as well as the industry including literature, security, forensics, electronic markets and trading, etc. Another practical implication of this work is the public release of its sources. Specifically, some of the SF can be very useful for other problems such as sentiment analysis.
Originality/value
This is the first study of its kind to compare the SF and BOW approaches for authorship analysis of Arabic articles. Moreover, many of the computed SF are novel, while other features are inspired by the literature. As SF are language-dependent and most existing papers focus on English, extra effort must be invested to adapt such features to Arabic text.
Details
Keywords
Carmen Galvez, Félix de Moya‐Anegón and Víctor H. Solana
To propose a categorization of the different conflation procedures at the two basic approaches, non‐linguistic and linguistic techniques, and to justify the application of…
Abstract
Purpose
To propose a categorization of the different conflation procedures at the two basic approaches, non‐linguistic and linguistic techniques, and to justify the application of normalization methods within the framework of linguistic techniques.
Design/methodology/approach
Presents a range of term conflation methods, that can be used in information retrieval. The uniterm and multiterm variants can be considered equivalent units for the purposes of automatic indexing. Stemming algorithms, segmentation rules, association measures and clustering techniques are well evaluated non‐linguistic methods, and experiments with these techniques show a wide variety of results. Alternatively, the lemmatisation and the use of syntactic pattern‐matching, through equivalence relations represented in finite‐state transducers (FST), are emerging methods for the recognition and standardization of terms.
Findings
The survey attempts to point out the positive and negative effects of the linguistic approach and its potential as a term conflation method.
Originality/value
Outlines the importance of FSTs for the normalization of term variants.
Details
Keywords
The linguistic and computational complexities of machine translation are not always apparent to all users or potential purchasers of systems. As a consequence, they are sometimes…
Abstract
The linguistic and computational complexities of machine translation are not always apparent to all users or potential purchasers of systems. As a consequence, they are sometimes unable to distinguish between the failings of particular systems and the problems which the best system would have. In this article I shall attempt to outline the difficulties encountered by computers in translating from one natural language into another. This is an introductory paper for those unfamiliar with what computers can and cannot achieve in this field.
The recent report for the Commission of the European Communities on current multilingual activities in the field of scientific and technical information and the 1977 conference on…
Abstract
The recent report for the Commission of the European Communities on current multilingual activities in the field of scientific and technical information and the 1977 conference on the same theme both included substantial sections on operational and experimental machine translation systems, and in its Plan of action the Commission announced its intention to introduce an operational machine translation system into its departments and to support research projects on machine translation. This revival of interest in machine translation may well have surprised many who have tended in recent years to dismiss it as one of the ‘great failures’ of scientific research. What has changed? What grounds are there now for optimism about machine translation? Or is it still a ‘utopian dream’ ? The aim of this review is to give a general picture of present activities which may help readers to reach their own conclusions. After a sketch of the historical background and general aims (section I), it describes operational and experimental machine translation systems of recent years (section II), it continues with descriptions of interactive (man‐machine) systems and machine‐assisted translation (section III), (and it concludes with a general survey of present problems and future possibilities section IV).
In the 1980s the dominant framework of MT was essentially ‘rule‐based’, e.g. the linguistics‐based approaches of Ariane, METAL, Eurotra, etc.; or the knowledge‐based approaches at…
Abstract
In the 1980s the dominant framework of MT was essentially ‘rule‐based’, e.g. the linguistics‐based approaches of Ariane, METAL, Eurotra, etc.; or the knowledge‐based approaches at Carnegie Mellon University and elsewhere. New approaches of the 1990s are based on large text corpora, the alignment of bilingual texts, the use of statistical methods and the use of parallel corpora for ‘example‐based’ translation. The problems of building large monolingual and bilingual lexical databases and of generating good quality output have come to the fore. In the past most systems were intended to be general‐purpose; now most are designed for specialized applications, e.g. restricted to controlled languages, to a sublanguage or to a specific domain, to a particular organization or to a particular user‐type. In addition, the field is widening with research under way on speech translation, on systems for monolingual users not knowing target languages, on systems for multilingual generation directly from structured databases, and in general for uses other than those traditionally associated with translation services.