Search results

1 – 10 of 459
Article
Publication date: 3 February 2020

Nikola Nikolić, Olivera Grljević and Aleksandar Kovačević

Student recruitment and retention are important issues for all higher education institutions. Constant monitoring of student satisfaction levels is therefore crucial…

Abstract

Purpose

Student recruitment and retention are important issues for all higher education institutions. Constant monitoring of student satisfaction levels is therefore crucial. Traditionally, students voice their opinions through official surveys organized by the universities. In addition to that, nowadays, social media and review websites such as “Rate my professors” are rich sources of opinions that should not be ignored. Automated mining of students’ opinions can be realized via aspect-based sentiment analysis (ABSA). ABSA s is a sub-discipline of natural language processing (NLP) that focusses on the identification of sentiments (negative, neutral, positive) and aspects (sentiment targets) in a sentence. The purpose of this paper is to introduce a system for ABSA of free text reviews expressed in student opinion surveys in the Serbian language. Sentiment analysis was carried out at the finest level of text granularity – the level of sentence segment (phrase and clause).

Design/methodology/approach

The presented system relies on NLP techniques, machine learning models, rules and dictionaries. The corpora collected and annotated for system development and evaluation comprise students’ reviews of teaching staff at the Faculty of Technical Sciences, University of Novi Sad, Serbia, and a corpus of publicly available reviews from the Serbian equivalent of the “Rate my professors” website.

Findings

The research results indicate that positive sentiment can successfully be identified with the F-measure of 0.83, while negative sentiment can be detected with the F-measure of 0.94. While the F-measure for the aspect’s range is between 0.49 and 0.89, depending on their frequency in the corpus. Furthermore, the authors have concluded that the quality of ABSA depends on the source of the reviews (official students’ surveys vs review websites).

Practical implications

The system for ABSA presented in this paper could improve the quality of service provided by the Serbian higher education institutions through a more effective search and summary of students’ opinions. For example, a particular educational institution could very easily find out which aspects of their service the students are not satisfied with and to which aspects of their service more attention should be directed.

Originality/value

To the best of the authors’ knowledge, this is the first study of ABSA carried out at the level of sentence segment for the Serbian language. The methodology and findings presented in this paper provide a much-needed bases for further work on sentiment analysis for the Serbian language that is well under-resourced and under-researched in this area.

Article
Publication date: 11 November 2014

Shuhei Yamamoto and Tetsuji Satoh

This paper aims to propose a multi-label method that estimates appropriate aspects against unknown tweets using the two-phase estimation method. Many Twitter users share…

Abstract

Purpose

This paper aims to propose a multi-label method that estimates appropriate aspects against unknown tweets using the two-phase estimation method. Many Twitter users share daily events and opinions. Some beneficial comments are posted on such real-life aspects as eating, traffic, weather and so on. Such posts as “The train is not coming” are categorized in the Traffic aspect. Such tweets as “The train is delayed by heavy rain” are categorized in both the Traffic and Weather aspects.

Design/methodology/approach

The proposed method consists of two phases. In the first, many topics are extracted from a sea of tweets using Latent Dirichlet Allocation (LDA). In the second, associations among many topics and fewer aspects are built using a small set of labeled tweets. The aspect scores for tweets were calculated using associations based on the extracted terms. Appropriate aspects are labeled for unknown tweets by averaging the aspect scores.

Findings

Using a large amount of actual tweets, the sophisticated experimental evaluations demonstrate the high efficiency of the proposed multi-label classification method. It is confirmed that high F-measure aspects are strongly associated with topics that have high relevance. Low F-measure aspects are associated with topics that are connected to many other aspects.

Originality/value

The proposed method features two-phase semi-supervised learning. Many topics are extracted using an unsupervised learning model called LDA. Associations among many topics and fewer aspects are built using labeled tweets.

Details

International Journal of Web Information Systems, vol. 10 no. 4
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 27 November 2020

Chaoqun Wang, Zhongyi Hu, Raymond Chiong, Yukun Bao and Jiang Wu

The aim of this study is to propose an efficient rule extraction and integration approach for identifying phishing websites. The proposed approach can elucidate patterns…

Abstract

Purpose

The aim of this study is to propose an efficient rule extraction and integration approach for identifying phishing websites. The proposed approach can elucidate patterns of phishing websites and identify them accurately.

Design/methodology/approach

Hyperlink indicators along with URL-based features are used to build the identification model. In the proposed approach, very simple rules are first extracted based on individual features to provide meaningful and easy-to-understand rules. Then, the F-measure score is used to select high-quality rules for identifying phishing websites. To construct a reliable and promising phishing website identification model, the selected rules are integrated using a simple neural network model.

Findings

Experiments conducted using self-collected and benchmark data sets show that the proposed approach outperforms 16 commonly used classifiers (including seven non–rule-based and four rule-based classifiers as well as five deep learning models) in terms of interpretability and identification performance.

Originality/value

Investigating patterns of phishing websites based on hyperlink indicators using the efficient rule-based approach is innovative. It is not only helpful for identifying phishing websites, but also beneficial for extracting simple and understandable rules.

Details

The Electronic Library , vol. 38 no. 5/6
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 27 September 2011

Aleksandar Kovačević, Dragan Ivanović, Branko Milosavljević, Zora Konjović and Dušan Surla

The aim of this paper is to develop a system for automatic extraction of metadata from scientific papers in PDF format for the information system for monitoring the…

1148

Abstract

Purpose

The aim of this paper is to develop a system for automatic extraction of metadata from scientific papers in PDF format for the information system for monitoring the scientific research activity of the University of Novi Sad (CRIS UNS).

Design/methodology/approach

The system is based on machine learning and performs automatic extraction and classification of metadata in eight pre‐defined categories. The extraction task is realised as a classification process. For the purpose of classification each row of text is represented with a vector that comprises different features: formatting, position, characteristics related to the words, etc. Experiments were performed with standard classification models. Both a single classifier with all eight categories and eight individual classifiers were tested. Classifiers were evaluated using the five‐fold cross validation, on a manually annotated corpus comprising 100 scientific papers in PDF format, collected from various conferences, journals and authors' personal web pages.

Findings

Based on the performances obtained on classification experiments, eight separate support vector machines (SVM) models (each of which recognises its corresponding category) were chosen. All eight models were established to have a good performance. The F‐measure was over 85 per cent for almost all of the classifiers and over 90 per cent for most of them.

Research limitations/implications

Automatically extracted metadata cannot be directly entered into CRIS UNS but requires control of the curators.

Practical implications

The proposed system for automatic metadata extraction using support vector machines model was integrated into the software system, CRIS UNS. Metadata extraction has been tested on the publications of researchers from the Department of Mathematics and Informatics of the Faculty of Sciences in Novi Sad. Analysis of extracted metadata from these publications showed that the performance of the system for the previously unseen data is in accordance with that obtained by the cross‐validation from eight separate SVM classifiers. This system will help in the process of synchronising metadata from CRIS UNS with other institutional repositories.

Originality/value

The paper documents a fully automated system for metadata extraction from scientific papers that was developed. The system is based on the SVM classifier and open source tools, and is capable of extracting eight types of metadata from scientific articles of any format that can be converted to PDF. Although developed as part of CRIS UNS, the proposed system can be integrated into other CRIS systems, as well as institutional repositories and library management systems.

Article
Publication date: 2 July 2020

N. Venkata Sailaja, L. Padmasree and N. Mangathayaru

Text mining has been used for various knowledge discovery based applications, and thus, a lot of research has been contributed towards it. Latest trending research in the…

144

Abstract

Purpose

Text mining has been used for various knowledge discovery based applications, and thus, a lot of research has been contributed towards it. Latest trending research in the text mining is adopting the incremental learning data, as it is economical while dealing with large volume of information.

Design/methodology/approach

The primary intention of this research is to design and develop a technique for incremental text categorization using optimized Support Vector Neural Network (SVNN). The proposed technique involves four major steps, such as pre-processing, feature selection, classification and feature extraction. Initially, the data is pre-processed based on stop word removal and stemming. Then, the feature extraction is done by extracting semantic word-based features and Term Frequency and Inverse Document Frequency (TF-IDF). From the extracted features, the important features are selected using Bhattacharya distance measure and the features are subjected as the input to the proposed classifier. The proposed classifier performs incremental learning using SVNN, wherein the weights are bounded in a limit using rough set theory. Moreover, for the optimal selection of weights in SVNN, Moth Search (MS) algorithm is used. Thus, the proposed classifier, named Rough set MS-SVNN, performs the text categorization for the incremental data, given as the input.

Findings

For the experimentation, the 20 News group dataset, and the Reuters dataset are used. Simulation results indicate that the proposed Rough set based MS-SVNN has achieved 0.7743, 0.7774 and 0.7745 for the precision, recall and F-measure, respectively.

Originality/value

In this paper, an online incremental learner is developed for the text categorization. The text categorization is done by developing the Rough set MS-SVNN classifier, which classifies the incoming texts based on the boundary condition evaluated by the Rough set theory, and the optimal weights from the MS. The proposed online text categorization scheme has the basic steps, like pre-processing, feature extraction, feature selection and classification. The pre-processing is carried out to identify the unique words from the dataset, and the features like semantic word-based features and TF-IDF are obtained from the keyword set. Feature selection is done by setting a minimum Bhattacharya distance measure, and the selected features are provided to the proposed Rough set MS-SVNN for the classification.

Details

Data Technologies and Applications, vol. 54 no. 5
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 2 April 2019

Hei Chia Wang, Yu Hung Chiang and Yi Feng Sun

This paper aims to improve a sentiment analysis (SA) system to help users (i.e. customers or hotel managers) understand hotel evaluations. There are three main purposes in…

Abstract

Purpose

This paper aims to improve a sentiment analysis (SA) system to help users (i.e. customers or hotel managers) understand hotel evaluations. There are three main purposes in this paper: designing an unsupervised method for extracting online Chinese features and opinion pairs, distinguishing different intensities of polarity in opinion words and examining the changes in polarity in the time series.

Design/methodology/approach

In this paper, a review analysis system is proposed to automatically capture feature opinions experienced by other tourists presented in the review documents. In the system, a feature-level SA is designed to determine the polarity of these features. Moreover, an unsupervised method using a part-of-speech pattern clarification query and multi-lexicons SA to summarize all Chinese reviews is adopted.

Findings

The authors expect this method to help travellers search for what they want and make decisions more efficiently. The experimental results show the F-measure of the proposed method to be 0.628. It thus outperforms the methods used in previous studies.

Originality/value

The study is useful for travellers who want to quickly retrieve and summarize helpful information from the pool of messy hotel reviews. Meanwhile, the system will assist hotel managers to comprehensively understand service qualities with which guests are satisfied or dissatisfied.

Details

The Electronic Library , vol. 37 no. 1
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 19 November 2021

Samir Al-Janabi and Ryszard Janicki

Data quality is a major challenge in data management. For organizations, the cleanliness of data is a significant problem that affects many business activities. Errors in…

Abstract

Purpose

Data quality is a major challenge in data management. For organizations, the cleanliness of data is a significant problem that affects many business activities. Errors in data occur for different reasons, such as violation of business rules. However, because of the huge amount of data, manual cleaning alone is infeasible. Methods are required to repair and clean the dirty data through automatic detection, which are data quality issues to address. The purpose of this work is to extend the density-based data cleaning approach using conditional functional dependencies to achieve better data repair.

Design/methodology/approach

A set of conditional functional dependencies is introduced as an input to the density-based data cleaning algorithm. The algorithm repairs inconsistent data using this set.

Findings

This new approach was evaluated through experiments on real-world as well as synthetic datasets. The repair quality was determined using the F-measure. The results showed that the quality and scalability of the density-based data cleaning approach improved when conditional functional dependencies were introduced.

Originality/value

Conditional functional dependencies capture semantic errors among data values. This work demonstrates that the density-based data cleaning approach can be improved in terms of repairing inconsistent data by using conditional functional dependencies.

Details

Data Technologies and Applications, vol. 56 no. 3
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 5 January 2021

Gogineni Krishna Chaitanya and Krovi Raja Sekhar

The existing authentication procedures (pin, pattern, password) are not very secure. Therefore, the Gait pattern authentication scheme is introduced to verify the own…

Abstract

Purpose

The existing authentication procedures (pin, pattern, password) are not very secure. Therefore, the Gait pattern authentication scheme is introduced to verify the own user. The current research proposes a running Gaussian grey wolf boosting (RGGWB) model to recognize the owner.

Design/methodology/approach

The biometrics system plays an important role in smartphones in securing confidential data stored in them. Moreover, the authentication schemes such as passwords and patterns are widely used in smartphones.

Findings

To validate this research model, the unauthenticated user's Gait was trained and tested simultaneously with owner gaits. Furthermore, if the gait matches, the smartphone unlocks automatically; otherwise, it rejects it.

Originality/value

Finally, the effectiveness of the proposed model is proved by attaining better accuracy and less error rate.

Details

International Journal of Intelligent Unmanned Systems, vol. 10 no. 1
Type: Research Article
ISSN: 2049-6427

Keywords

Article
Publication date: 20 November 2017

Xiangbin Yan, Yumei Li and Weiguo Fan

Getting high-quality data by removing the noisy data from the user-generated content (UGC) is the first step toward data mining and effective decision-making based on…

Abstract

Purpose

Getting high-quality data by removing the noisy data from the user-generated content (UGC) is the first step toward data mining and effective decision-making based on ubiquitous and unstructured social media data. This paper aims to design a framework for revoking noisy data from UGC.

Design/methodology/approach

In this paper, the authors consider a classification-based framework to remove the noise from the unstructured UGC in social media community. They treat the noise as the concerned topic non-relevant messages and apply a text classification-based approach to remove the noise. They introduce a domain lexicon to help identify the concerned topic from noise and compare the performance of several classification algorithms combined with different feature selection methods.

Findings

Experimental results based on a Chinese stock forum show that 84.9 per cent of all the noise data from the UGC could be removed with little valuable information loss. The support vector machines classifier combined with information gain feature extraction model is the best choice for this system. With longer messages getting better classification performance, it has been found that the length of messages affects the system performance.

Originality/value

The proposed method could be used for preprocessing in text mining and new knowledge discovery from the big data.

Details

Information Discovery and Delivery, vol. 45 no. 4
Type: Research Article
ISSN: 2398-6247

Keywords

Article
Publication date: 28 April 2020

Siham Eddamiri, Asmaa Benghabrit and Elmoukhtar Zemmouri

The purpose of this paper is to present a generic pipeline for Resource Description Framework (RDF) graph mining to provide a comprehensive review of each step in the…

Abstract

Purpose

The purpose of this paper is to present a generic pipeline for Resource Description Framework (RDF) graph mining to provide a comprehensive review of each step in the knowledge discovery from data process. The authors also investigate different approaches and combinations to extract feature vectors from RDF graphs to apply the clustering and theme identification tasks.

Design/methodology/approach

The proposed methodology comprises four steps. First, the authors generate several graph substructures (Walks, Set of Walks, Walks with backward and Set of Walks with backward). Second, the authors build neural language models to extract numerical vectors of the generated sequences by using word embedding techniques (Word2Vec and Doc2Vec) combined with term frequency-inverse document frequency (TF-IDF). Third, the authors use the well-known K-means algorithm to cluster the RDF graph. Finally, the authors extract the most relevant rdf:type from the grouped vertices to describe the semantics of each theme by generating the labels.

Findings

The experimental evaluation on the state of the art data sets (AIFB, BGS and Conference) shows that the combination of Set of Walks-with-backward with TF-IDF and Doc2vec techniques give excellent results. In fact, the clustering results reach more than 97% and 90% in terms of purity and F-measure, respectively. Concerning the theme identification, the results show that by using the same combination, the purity and F-measure criteria reach more than 90% for all the considered data sets.

Originality/value

The originality of this paper lies in two aspects: first, a new machine learning pipeline for RDF data is presented; second, an efficient process to identify and extract relevant graph substructures from an RDF graph is proposed. The proposed techniques were combined with different neural language models to improve the accuracy and relevance of the obtained feature vectors that will be fed to the clustering mechanism.

Details

International Journal of Web Information Systems, vol. 16 no. 2
Type: Research Article
ISSN: 1744-0084

Keywords

1 – 10 of 459