To read this content please select one of the options below:

RDF graph mining for cluster-based theme identification

Siham Eddamiri (Department of Mathematics and Computer Science, University Moulay Ismail, ENSAM, Meknes, Morocco)
Asmaa Benghabrit (LMAID Laboratory, Universite Mohammed V de Rabat Ecole Mohammadia d'Ingenieurs, Rabat, Morocco)
Elmoukhtar Zemmouri (Department of Mathematics and Computer Science, University Moulay Ismail, ENSAM, Meknes, Morocco)

International Journal of Web Information Systems

ISSN: 1744-0084

Article publication date: 28 April 2020

Issue publication date: 3 June 2020

134

Abstract

Purpose

The purpose of this paper is to present a generic pipeline for Resource Description Framework (RDF) graph mining to provide a comprehensive review of each step in the knowledge discovery from data process. The authors also investigate different approaches and combinations to extract feature vectors from RDF graphs to apply the clustering and theme identification tasks.

Design/methodology/approach

The proposed methodology comprises four steps. First, the authors generate several graph substructures (Walks, Set of Walks, Walks with backward and Set of Walks with backward). Second, the authors build neural language models to extract numerical vectors of the generated sequences by using word embedding techniques (Word2Vec and Doc2Vec) combined with term frequency-inverse document frequency (TF-IDF). Third, the authors use the well-known K-means algorithm to cluster the RDF graph. Finally, the authors extract the most relevant rdf:type from the grouped vertices to describe the semantics of each theme by generating the labels.

Findings

The experimental evaluation on the state of the art data sets (AIFB, BGS and Conference) shows that the combination of Set of Walks-with-backward with TF-IDF and Doc2vec techniques give excellent results. In fact, the clustering results reach more than 97% and 90% in terms of purity and F-measure, respectively. Concerning the theme identification, the results show that by using the same combination, the purity and F-measure criteria reach more than 90% for all the considered data sets.

Originality/value

The originality of this paper lies in two aspects: first, a new machine learning pipeline for RDF data is presented; second, an efficient process to identify and extract relevant graph substructures from an RDF graph is proposed. The proposed techniques were combined with different neural language models to improve the accuracy and relevance of the obtained feature vectors that will be fed to the clustering mechanism.

Keywords

Citation

Eddamiri, S., Benghabrit, A. and Zemmouri, E. (2020), "RDF graph mining for cluster-based theme identification", International Journal of Web Information Systems, Vol. 16 No. 2, pp. 223-247. https://doi.org/10.1108/IJWIS-10-2019-0048

Publisher

:

Emerald Publishing Limited

Copyright © 2020, Emerald Publishing Limited

Related articles