Search results
1 – 10 of over 1000Ramzi A. Haraty and Rouba Nasrallah
The purpose of this paper is to propose a new model to enhance auto-indexing Arabic texts. The model denotes extracting new relevant words by relating those chosen by…
Abstract
Purpose
The purpose of this paper is to propose a new model to enhance auto-indexing Arabic texts. The model denotes extracting new relevant words by relating those chosen by previous classical methods to new words using data mining rules.
Design/methodology/approach
The proposed model uses an association rule algorithm for extracting frequent sets containing related items – to extract relationships between words in the texts to be indexed with words from texts that belong to the same category. The associations of words extracted are illustrated as sets of words that appear frequently together.
Findings
The proposed methodology shows significant enhancement in terms of accuracy, efficiency and reliability when compared to previous works.
Research limitations/implications
The stemming algorithm can be further enhanced. In the Arabic language, we have many grammatical rules. The more we integrate rules to the stemming algorithm, the better the stemming will be. Other enhancements can be done to the stop-list. This is by adding more words to it that should not be taken into consideration in the indexing mechanism. Also, numbers should be added to the list as well as using the thesaurus system because it links different phrases or words with the same meaning to each other, which improves the indexing mechanism. The authors also invite researchers to add more pre-requisite texts to have better results.
Originality/value
In this paper, the authors present a full text-based auto-indexing method for Arabic text documents. The auto-indexing method extracts new relevant words by using data mining rules, which has not been investigated before. The method uses an association rule mining algorithm for extracting frequent sets containing related items to extract relationships between words in the texts to be indexed with words from texts that belong to the same category. The benefits of the method are demonstrated using empirical work involving several Arabic texts.
Details
Keywords
Abdelhalim Saadi and Hacene Belhadef
The purpose of this paper is to present a system based on deep neural networks to extract particular entities from natural language text, knowing that a massive amount of…
Abstract
Purpose
The purpose of this paper is to present a system based on deep neural networks to extract particular entities from natural language text, knowing that a massive amount of textual information is electronically available at present. Notably, a large amount of electronic text data indicates great difficulty in finding or extracting relevant information from them.
Design/methodology/approach
This study presents an original system to extract Arabic-named entities by combining a deep neural network-based part-of-speech tagger and a neural network-based named entity extractor. Firstly, the system extracts the grammatical classes of the words with high precision depending on the context of the word. This module plays the role of the disambiguation process. Then, a second module is used to extract the named entities.
Findings
Using deep neural networks in natural language processing, requires tuning many hyperparameters, which is a time-consuming process. To deal with this problem, applying statistical methods like the Taguchi method is much requested. In this study, the system is successfully applied to the Arabic-named entities recognition, where accuracy of 96.81 per cent was reported, which is better than the state-of-the-art results.
Research limitations/implications
The system is designed and trained for the Arabic language, but the architecture can be used for other languages.
Practical implications
Information extraction systems are developed for different applications, such as analysing newspaper articles and databases for commercial, political and social objectives. Information extraction systems also can be built over an information retrieval (IR) system. The IR system eliminates irrelevant documents and paragraphs.
Originality/value
The proposed system can be regarded as the first attempt to use double deep neural networks to increase the accuracy. It also can be built over an IR system. The IR system eliminates irrelevant documents and paragraphs. This process reduces the mass number of documents from which the authors wish to extract the relevant information using an information extraction system.
Details
Keywords
Mahmoud Al-Ayyoub, Ahmed Alwajeeh and Ismail Hmeidi
The authorship authentication (AA) problem is concerned with correctly attributing a text document to its corresponding author. Historically, this problem has been the…
Abstract
Purpose
The authorship authentication (AA) problem is concerned with correctly attributing a text document to its corresponding author. Historically, this problem has been the focus of various studies focusing on the intuitive idea that each author has a unique style that can be captured using stylometric features (SF). Another approach to this problem, known as the bag-of-words (BOW) approach, uses keywords occurrences/frequencies in each document to identify its author. Unlike the first one, this approach is more language-independent. This paper aims to study and compare both approaches focusing on the Arabic language which is still largely understudied despite its importance.
Design/methodology/approach
Being a supervised learning problem, the authors start by collecting a very large data set of Arabic documents to be used for training and testing purposes. For the SF approach, they compute hundreds of SF, whereas, for the BOW approach, the popular term frequency-inverse document frequency technique is used. Both approaches are compared under various settings.
Findings
The results show that the SF approach, which is much cheaper to train, can generate more accurate results under most settings.
Practical implications
Numerous advantages of efficiently solving the AA problem are obtained in different fields of academia as well as the industry including literature, security, forensics, electronic markets and trading, etc. Another practical implication of this work is the public release of its sources. Specifically, some of the SF can be very useful for other problems such as sentiment analysis.
Originality/value
This is the first study of its kind to compare the SF and BOW approaches for authorship analysis of Arabic articles. Moreover, many of the computed SF are novel, while other features are inspired by the literature. As SF are language-dependent and most existing papers focus on English, extra effort must be invested to adapt such features to Arabic text.
Details
Keywords
Developing software for processing bibliographic materials in the Arabic language is a relatively recent development. When libraries in parts of the Middle East, where…
Abstract
Developing software for processing bibliographic materials in the Arabic language is a relatively recent development. When libraries in parts of the Middle East, where Arabic is the main language, started automating their collections, most library systems did not provide for the use of Arabic script and this capability had to be developed. Automated library systems started to emerge (like Minisis, ALEPH, Dobis/Libis, TinLib, OLIB) to fill the gap for non‐Roman scripts. This article describes the stages the American University of Beirut Libraries went through in converting their Arabic materials for use in the OLIB7 library management system. A description of the background of the library is given along with the details of the romanisation process, the conversion process, the software and hardware chosen, the testing of the database, problems encountered, output and the handling of authority records.
Details
Keywords
Mansoor Alghamdi and William Teahan
The aim of this paper is to experimentally evaluate the effectiveness of the state-of-the-art printed Arabic text recognition systems to determine open areas for future…
Abstract
Purpose
The aim of this paper is to experimentally evaluate the effectiveness of the state-of-the-art printed Arabic text recognition systems to determine open areas for future improvements. In addition, this paper proposes a standard protocol with a set of metrics for measuring the effectiveness of Arabic optical character recognition (OCR) systems to assist researchers in comparing different Arabic OCR approaches.
Design/methodology/approach
This paper describes an experiment to automatically evaluate four well-known Arabic OCR systems using a set of performance metrics. The evaluation experiment is conducted on a publicly available printed Arabic dataset comprising 240 text images with a variety of resolution levels, font types, font styles and font sizes.
Findings
The experimental results show that the field of character recognition for printed Arabic still requires further research to reach an efficient text recognition method for Arabic script.
Originality/value
To the best of the authors’ knowledge, this is the first work that provides a comprehensive automated evaluation of Arabic OCR systems with respect to the characteristics of Arabic script and, in addition, proposes an evaluation methodology that can be used as a benchmark by researchers and therefore will contribute significantly to the enhancement of the field of Arabic script recognition.
Details
Keywords
Arabic script is the most recent addition to the scripts available on the Research libraries Information Network (RLIN). Bibliographic control and retrieval using the…
Abstract
Arabic script is the most recent addition to the scripts available on the Research libraries Information Network (RLIN). Bibliographic control and retrieval using the authentic writing system are available for titles in Arabic, Persian (Farsi), Urdu, Ottoman Turkish, and other languages written with Arabic script. RLIN is the world's largest bibliographic database for Middle Eastern language material. This paper is a comprehensive description of the Arabic script features of RLIN. It covers Arabic character sets and RLIN's character repertoire for Arabic script; how Arabic characters are input and stored in the RLIN database; the equipment needed for Arabic script support; the indexing, retrieval, and presentation of records containing Arabic script; the inclusion of non‐Roman data in USMARC bibliographic records; and statistics on the RLIN databases. Sidebars explain features of Arabic writing. The discussion of data storage and presentation of text is relevant to any computer application that involves Arabic script.
AbdulMalik Al‐Salman, Mohamed Alkanhal, Yousef AlOhali, Hazem Al‐Rashed and Bander Al‐Sulami
The purpose of this paper is to describe the development of a system called Mubser to translate Arabic and English Braille into normal text. The system can automatically…
Abstract
Purpose
The purpose of this paper is to describe the development of a system called Mubser to translate Arabic and English Braille into normal text. The system can automatically detect the source language and the Braille grade.
Design/methodology/approach
Mubser system was designed under the MS‐Windows environment and implemented using Visual C# 2.0 with an Arabic interface. The system uses the concept of rule file to translate supported languages from Braille to text. The rule file is based on XML format. The identification of the source language and grade is based on a statistical approach.
Findings
From the literature review, the authors found that most researches and products do not support bilingual translation from Braille to text in either contracted or un‐contracted Braille. Mubser system is a robust system that fills that gap. It helps both visually impaired and sighted people, especially Arabic native speakers, to translate from Braille to text.
Research limitations/implications
Mubser is being implemented and tested by the authors for both Arabic and English languages. The tests performed so far have shown excellent results. In the future, it is planned to integrate the system with an optical Braille recognition system, enhance the system to accept new languages, support maths and scientific symbols, and add spell checkers.
Practical implications
There is a desperate need for such system to translate Braille system into normal text. This system helps both sighted and blind people to communicate better.
Originality/value
This paper presents a novel system for converting Braille codes (Arabic and English) into normal text.
Details
Keywords
Ismail Hmeidi, Mahmoud Al-Ayyoub, Nizar A. Mahyoub and Mohammed A. Shehab
Multi-label Text Classification (MTC) is one of the most recent research trends in data mining and information retrieval domains because of many reasons such as the rapid…
Abstract
Purpose
Multi-label Text Classification (MTC) is one of the most recent research trends in data mining and information retrieval domains because of many reasons such as the rapid growth of online data and the increasing tendency of internet users to be more comfortable with assigning multiple labels/tags to describe documents, emails, posts, etc. The dimensionality of labels makes MTC more difficult and challenging compared with traditional single-labeled text classification (TC). Because it is a natural extension of TC, several ways are proposed to benefit from the rich literature of TC through what is called problem transformation (PT) methods. Basically, PT methods transform the multi-label data into a single-label one that is suitable for traditional single-label classification algorithms. Another approach is to design novel classification algorithms customized for MTC. Over the past decade, several works have appeared on both approaches focusing mainly on the English language. This work aims to present an elaborate study of MTC of Arabic articles.
Design/methodology/approach
This paper presents a novel lexicon-based method for MTC, where the keywords that are most associated with each label are extracted from the training data along with a threshold that can later be used to determine whether each test document belongs to a certain label.
Findings
The experiments show that the presented approach outperforms the currently available approaches. Specifically, the results of our experiments show that the best accuracy obtained from existing approaches is only 18 per cent, whereas the accuracy of the presented lexicon-based approach can reach an accuracy level of 31 per cent.
Originality/value
Although there exist some tools that can be customized to address the MTC problem for Arabic text, their accuracies are very low when applied to Arabic articles. This paper presents a novel method for MTC. The experiments show that the presented approach outperforms the currently available approaches.
Details
Keywords
Ali Selamat and Choon‐Ching Ng
With the rapid emergence and explosion of the internet and the trend of globalization, a tremendous number of textual documents written in different languages are…
Abstract
Purpose
With the rapid emergence and explosion of the internet and the trend of globalization, a tremendous number of textual documents written in different languages are electronically accessible online from the world wide web. Efficiently and effectively managing these documents written in different languages is important to organizations and individuals. Therefore, the purpose of this paper is to propose letter frequency neural networks to enhance the performance of language identification.
Design/methodology/approach
Initially, the paper analyzes the feasibility of using a windowing algorithm in order to find the best method in selecting the features of Arabic script documents language identification using backpropagation neural networks. Previously, it had been found that the sliding window and non‐sliding window algorithm used as feature selection methods in the experiments did not yield a good result. Therefore, this paper proposes, a language identification of Arabic script documents based on letter frequency using a backpropagation neural network and used the datasets belonging to Arabic, Persian, Urdu and Pashto language documents which are all Arabic script languages.
Findings
The experiments have shown that the average root mean squared error of Arabic script document language identification based on letter frequency feature selection algorithm is lower than the windowing algorithm.
Originality/value
This paper highlights the fact that using neural networks with proper feature selection methods will increase the performance of language identification.
Details
Keywords
Suliman A. Alsuhibany, Muna Almushyti, Noorah Alghasham and Fatimah Alkhudhayr
Nowadays, there is a high demand for online services and applications. However, there is a challenge to keep these applications secured by applying different methods…
Abstract
Purpose
Nowadays, there is a high demand for online services and applications. However, there is a challenge to keep these applications secured by applying different methods rather than using the traditional approaches such as passwords and usernames. Keystroke dynamics is one of the alternative authentication methods that provide high level of security in which the used keyboard plays an important role in the recognition accuracy. To guarantee the robustness of a system in different practical situations, there is a need to examine how much the performance of the system is affected by changing the keyboard layout. This paper aims to investigate the impact of using different keyboards on the recognition accuracy for Arabic free-text typing.
Design/methodology/approach
To evaluate how much the performance of the system is affected by changing the keyboard layout, an experimental study is conducted by using two different keyboards which are a Mac’s keyboard and an HP’s keyboard.
Findings
By using the Mac’s keyboard, the results showed that the false rejection rate (FRR) was 0.20, whilst the false acceptance rate (FAR) was 0.44. However, these values have changed when using the HP’s keyboard where the FRR was equal to 0.08 and the FAR was equal to 0.60.
Research limitations/implications
The number of participants in the experiment, as the authors were targeting much more participants.
Originality/value
These results showed for the first time the impact of the keyboards on the system’s performance regarding the recognition accuracy when using Arabic free-text.
Details