Search results

1 – 3 of 3
Article
Publication date: 18 April 2017

Mahmoud Al-Ayyoub, Ahmed Alwajeeh and Ismail Hmeidi

The authorship authentication (AA) problem is concerned with correctly attributing a text document to its corresponding author. Historically, this problem has been the focus of…

Abstract

Purpose

The authorship authentication (AA) problem is concerned with correctly attributing a text document to its corresponding author. Historically, this problem has been the focus of various studies focusing on the intuitive idea that each author has a unique style that can be captured using stylometric features (SF). Another approach to this problem, known as the bag-of-words (BOW) approach, uses keywords occurrences/frequencies in each document to identify its author. Unlike the first one, this approach is more language-independent. This paper aims to study and compare both approaches focusing on the Arabic language which is still largely understudied despite its importance.

Design/methodology/approach

Being a supervised learning problem, the authors start by collecting a very large data set of Arabic documents to be used for training and testing purposes. For the SF approach, they compute hundreds of SF, whereas, for the BOW approach, the popular term frequency-inverse document frequency technique is used. Both approaches are compared under various settings.

Findings

The results show that the SF approach, which is much cheaper to train, can generate more accurate results under most settings.

Practical implications

Numerous advantages of efficiently solving the AA problem are obtained in different fields of academia as well as the industry including literature, security, forensics, electronic markets and trading, etc. Another practical implication of this work is the public release of its sources. Specifically, some of the SF can be very useful for other problems such as sentiment analysis.

Originality/value

This is the first study of its kind to compare the SF and BOW approaches for authorship analysis of Arabic articles. Moreover, many of the computed SF are novel, while other features are inspired by the literature. As SF are language-dependent and most existing papers focus on English, extra effort must be invested to adapt such features to Arabic text.

Details

International Journal of Web Information Systems, vol. 13 no. 1
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 4 October 2018

Maha Al-Yahya

In the context of information retrieval, text genre is as important as its content, and knowledge of the text genre enhances the search engine features by providing customized…

Abstract

Purpose

In the context of information retrieval, text genre is as important as its content, and knowledge of the text genre enhances the search engine features by providing customized retrieval. The purpose of this study is to explore and evaluate the use of stylometric analysis, a quantitative analysis for the linguistics features of text, to support the task of automated text genre detection for Classical Arabic text.

Design/methodology/approach

Unsupervised clustering and supervised classification were applied on the King Saud University Corpus of Classical Arabic texts (KSUCCA) using the most frequent words in the corpus (MFWs) as stylometric features. Four popular distance measures established in stylometric research are evaluated for the genre detection task.

Findings

The results of the experiments show that stylometry-based genre clustering and classification align well with human-defined genre. The evidence suggests that genre style signals exist for Classical Arabic and can be used to support the task of automated genre detection.

Originality/value

This work targets the task of genre detection in Classical Arabic text using stylometric features, an approach that has only been previously applied to Arabic authorship attribution. The study also provides a comparison of four distance measures used in stylomtreic analysis on the KSUCCA, a corpus with over 50 million words of Classical Arabic using clustering and classification.

Details

The Electronic Library, vol. 36 no. 5
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 7 November 2016

Ismail Hmeidi, Mahmoud Al-Ayyoub, Nizar A. Mahyoub and Mohammed A. Shehab

Multi-label Text Classification (MTC) is one of the most recent research trends in data mining and information retrieval domains because of many reasons such as the rapid growth…

Abstract

Purpose

Multi-label Text Classification (MTC) is one of the most recent research trends in data mining and information retrieval domains because of many reasons such as the rapid growth of online data and the increasing tendency of internet users to be more comfortable with assigning multiple labels/tags to describe documents, emails, posts, etc. The dimensionality of labels makes MTC more difficult and challenging compared with traditional single-labeled text classification (TC). Because it is a natural extension of TC, several ways are proposed to benefit from the rich literature of TC through what is called problem transformation (PT) methods. Basically, PT methods transform the multi-label data into a single-label one that is suitable for traditional single-label classification algorithms. Another approach is to design novel classification algorithms customized for MTC. Over the past decade, several works have appeared on both approaches focusing mainly on the English language. This work aims to present an elaborate study of MTC of Arabic articles.

Design/methodology/approach

This paper presents a novel lexicon-based method for MTC, where the keywords that are most associated with each label are extracted from the training data along with a threshold that can later be used to determine whether each test document belongs to a certain label.

Findings

The experiments show that the presented approach outperforms the currently available approaches. Specifically, the results of our experiments show that the best accuracy obtained from existing approaches is only 18 per cent, whereas the accuracy of the presented lexicon-based approach can reach an accuracy level of 31 per cent.

Originality/value

Although there exist some tools that can be customized to address the MTC problem for Arabic text, their accuracies are very low when applied to Arabic articles. This paper presents a novel method for MTC. The experiments show that the presented approach outperforms the currently available approaches.

Details

International Journal of Web Information Systems, vol. 12 no. 4
Type: Research Article
ISSN: 1744-0084

Keywords

1 – 3 of 3