Search results

1 – 10 of 325
Article
Publication date: 29 March 2019

Julian Risch and Ralf Krestel

Patent offices and other stakeholders in the patent domain need to classify patent applications according to a standardized classification scheme. The purpose of this paper is to…

Abstract

Purpose

Patent offices and other stakeholders in the patent domain need to classify patent applications according to a standardized classification scheme. The purpose of this paper is to examine the novelty of an application it can then be compared to previously granted patents in the same class. Automatic classification would be highly beneficial, because of the large volume of patents and the domain-specific knowledge needed to accomplish this costly manual task. However, a challenge for the automation is patent-specific language use, such as special vocabulary and phrases.

Design/methodology/approach

To account for this language use, the authors present domain-specific pre-trained word embeddings for the patent domain. The authors train the model on a very large data set of more than 5m patents and evaluate it at the task of patent classification. To this end, the authors propose a deep learning approach based on gated recurrent units for automatic patent classification built on the trained word embeddings.

Findings

Experiments on a standardized evaluation data set show that the approach increases average precision for patent classification by 17 percent compared to state-of-the-art approaches. In this paper, the authors further investigate the model’s strengths and weaknesses. An extensive error analysis reveals that the learned embeddings indeed mirror patent-specific language use. The imbalanced training data and underrepresented classes are the most difficult remaining challenge.

Originality/value

The proposed approach fulfills the need for domain-specific word embeddings for downstream tasks in the patent domain, such as patent classification or patent analysis.

Details

Data Technologies and Applications, vol. 53 no. 1
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 3 November 2020

Femi Emmanuel Ayo, Olusegun Folorunso, Friday Thomas Ibharalu and Idowu Ademola Osinuga

Hate speech is an expression of intense hatred. Twitter has become a popular analytical tool for the prediction and monitoring of abusive behaviors. Hate speech detection with…

Abstract

Purpose

Hate speech is an expression of intense hatred. Twitter has become a popular analytical tool for the prediction and monitoring of abusive behaviors. Hate speech detection with social media data has witnessed special research attention in recent studies, hence, the need to design a generic metadata architecture and efficient feature extraction technique to enhance hate speech detection.

Design/methodology/approach

This study proposes a hybrid embeddings enhanced with a topic inference method and an improved cuckoo search neural network for hate speech detection in Twitter data. The proposed method uses a hybrid embeddings technique that includes Term Frequency-Inverse Document Frequency (TF-IDF) for word-level feature extraction and Long Short Term Memory (LSTM) which is a variant of recurrent neural networks architecture for sentence-level feature extraction. The extracted features from the hybrid embeddings then serve as input into the improved cuckoo search neural network for the prediction of a tweet as hate speech, offensive language or neither.

Findings

The proposed method showed better results when tested on the collected Twitter datasets compared to other related methods. In order to validate the performances of the proposed method, t-test and post hoc multiple comparisons were used to compare the significance and means of the proposed method with other related methods for hate speech detection. Furthermore, Paired Sample t-Test was also conducted to validate the performances of the proposed method with other related methods.

Research limitations/implications

Finally, the evaluation results showed that the proposed method outperforms other related methods with mean F1-score of 91.3.

Originality/value

The main novelty of this study is the use of an automatic topic spotting measure based on naïve Bayes model to improve features representation.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 13 no. 4
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 9 February 2022

Pradeep Kumar and Gaurav Sarin

Sarcasm is a sentiment in which human beings convey messages with the opposite meanings to hurt someone emotionally or condemn something in a witty manner. The difference between…

Abstract

Purpose

Sarcasm is a sentiment in which human beings convey messages with the opposite meanings to hurt someone emotionally or condemn something in a witty manner. The difference between the text's literal and its intended meaning makes it tough to identify. Mostly, researchers and practitioners only consider explicit information for text classification; however, considering implicit with explicit information will enhance the classifier's accuracy. Several sarcasm detection studies focus on syntactic, lexical or pragmatic features that are uttered using words, emoticons and exclamation marks. Discrete models, which are utilized by many existing works, require manual features that are costly to uncover.

Design/methodology/approach

In this research, word embeddings used for feature extraction are combined with context-aware language models to provide automatic feature engineering capabilities as well superior classification performance as compared to baseline models. Performance of the proposed models has been shown on three benchmark datasets over different evaluation metrics namely misclassification rate, receiver operating characteristic (ROC) curve and area under curve (AUC).

Findings

Experimental results suggest that FastText word embedding technique with BERT language model gives higher accuracy and helps to identify the sarcastic textual element correctly.

Originality/value

Sarcasm detection is a sub-task of sentiment analysis. To help in appropriate data-driven decision-making, the sentiment of the text that gets reversed due to sarcasm needs to be detected properly. In online social environments, it is critical for businesses and individuals to detect the correct sentiment polarity. This will aid in the right selling and buying of products and/or services, leading to higher sales and better market share for businesses, and meeting the quality requirements of customers.

Details

Online Information Review, vol. 46 no. 7
Type: Research Article
ISSN: 1468-4527

Keywords

Article
Publication date: 28 May 2021

Subbaraju Pericherla and E. Ilavarasan

Nowadays people are connected by social media like Facebook, Instagram, Twitter, YouTube and much more. Bullies take advantage of these social networks to share their comments…

Abstract

Purpose

Nowadays people are connected by social media like Facebook, Instagram, Twitter, YouTube and much more. Bullies take advantage of these social networks to share their comments. Cyberbullying is one typical kind of harassment by making aggressive comments, abuses to hurt the netizens. Social media is one of the areas where bullying happens extensively. Hence, it is necessary to develop an efficient and autonomous cyberbullying detection technique.

Design/methodology/approach

In this paper, the authors proposed a transformer network-based word embeddings approach for cyberbullying detection. RoBERTa is used to generate word embeddings and Light Gradient Boosting Machine is used as a classifier.

Findings

The proposed approach outperforms machine learning algorithms such as logistic regression, support vector machine and deep learning models such as word-level convolutional neural networks (word CNN) and character convolutional neural networks with short cuts (char CNNS) in terms of precision, recall, F1-score.

Originality/value

One of the limitations of traditional word embeddings methods is context-independent. In this work, only text data are utilized to identify cyberbullying. This work can be extended to predict cyberbullying activities in multimedia environment like image, audio and video.

Details

International Journal of Intelligent Unmanned Systems, vol. 12 no. 1
Type: Research Article
ISSN: 2049-6427

Keywords

Article
Publication date: 28 March 2023

Antonijo Marijić and Marina Bagić Babac

Genre classification of songs based on lyrics is a challenging task even for humans, however, state-of-the-art natural language processing has recently offered advanced solutions…

Abstract

Purpose

Genre classification of songs based on lyrics is a challenging task even for humans, however, state-of-the-art natural language processing has recently offered advanced solutions to this task. The purpose of this study is to advance the understanding and application of natural language processing and deep learning in the domain of music genre classification, while also contributing to the broader themes of global knowledge and communication, and sustainable preservation of cultural heritage.

Design/methodology/approach

The main contribution of this study is the development and evaluation of various machine and deep learning models for song genre classification. Additionally, we investigated the effect of different word embeddings, including Global Vectors for Word Representation (GloVe) and Word2Vec, on the classification performance. The tested models range from benchmarks such as logistic regression, support vector machine and random forest, to more complex neural network architectures and transformer-based models, such as recurrent neural network, long short-term memory, bidirectional long short-term memory and bidirectional encoder representations from transformers (BERT).

Findings

The authors conducted experiments on both English and multilingual data sets for genre classification. The results show that the BERT model achieved the best accuracy on the English data set, whereas cross-lingual language model pretraining based on RoBERTa (XLM-RoBERTa) performed the best on the multilingual data set. This study found that songs in the metal genre were the most accurately labeled, as their text style and topics were the most distinct from other genres. On the contrary, songs from the pop and rock genres were more challenging to differentiate. This study also compared the impact of different word embeddings on the classification task and found that models with GloVe word embeddings outperformed Word2Vec and the learning embedding layer.

Originality/value

This study presents the implementation, testing and comparison of various machine and deep learning models for genre classification. The results demonstrate that transformer models, including BERT, robustly optimized BERT pretraining approach, distilled bidirectional encoder representations from transformers, bidirectional and auto-regressive transformers and XLM-RoBERTa, outperformed other models.

Details

Global Knowledge, Memory and Communication, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2514-9342

Keywords

Book part
Publication date: 18 April 2022

Rodolphe Durand and Paul Gouvard

Extant research presents firms’ purpose as a consensual and positive attribute. This paper introduces an alternative perspective, which sees firms’ purposefulness as defined in

Abstract

Extant research presents firms’ purpose as a consensual and positive attribute. This paper introduces an alternative perspective, which sees firms’ purposefulness as defined in relation to specific audiences. A firm’s purposefulness to a focal audience can be either positive or negative. Audiences find firms with which they share a common prioritization of issues more purposeful in absolute terms. Audiences find firms with which they share a common understanding of issues positively purposeful. Conversely, audiences find firms with an opposite understanding of issues negatively purposeful. Audiences harness specific resources to support firms they find positively purposeful and to oppose firms they find negatively purposeful. This paper introduces topic modeling and word embeddings as two techniques to operationalize this audience-based approach to purposefulness.

Details

Advances in Cultural Entrepreneurship
Type: Book
ISBN: 978-1-80262-207-2

Keywords

Article
Publication date: 2 September 2019

Guellil Imane, Darwish Kareem and Azouaou Faical

This paper aims to propose an approach to automatically annotate a large corpus in Arabic dialect. This corpus is used in order to analyse sentiments of Arabic users on social…

Abstract

Purpose

This paper aims to propose an approach to automatically annotate a large corpus in Arabic dialect. This corpus is used in order to analyse sentiments of Arabic users on social medias. It focuses on the Algerian dialect, which is a sub-dialect of Maghrebi Arabic. Although Algerian is spoken by roughly 40 million speakers, few studies address the automated processing in general and the sentiment analysis in specific for Algerian.

Design/methodology/approach

The approach is based on the construction and use of a sentiment lexicon to automatically annotate a large corpus of Algerian text that is extracted from Facebook. Using this approach allow to significantly increase the size of the training corpus without calling the manual annotation. The annotated corpus is then vectorized using document embedding (doc2vec), which is an extension of word embeddings (word2vec). For sentiments classification, the authors used different classifiers such as support vector machines (SVM), Naive Bayes (NB) and logistic regression (LR).

Findings

The results suggest that NB and SVM classifiers generally led to the best results and MLP generally had the worst results. Further, the threshold that the authors use in selecting messages for the training set had a noticeable impact on recall and precision, with a threshold of 0.6 producing the best results. Using PV-DBOW led to slightly higher results than using PV-DM. Combining PV-DBOW and PV-DM representations led to slightly lower results than using PV-DBOW alone. The best results were obtained by the NB classifier with F1 up to 86.9 per cent.

Originality/value

The principal originality of this paper is to determine the right parameters for automatically annotating an Algerian dialect corpus. This annotation is based on a sentiment lexicon that was also constructed automatically.

Details

International Journal of Web Information Systems, vol. 15 no. 5
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 10 December 2018

Luciano Barbosa

Matching instances of the same entity, a task known as entity resolution, is a key step in the process of data integration. This paper aims to propose a deep learning network that…

Abstract

Purpose

Matching instances of the same entity, a task known as entity resolution, is a key step in the process of data integration. This paper aims to propose a deep learning network that learns different representations of Web entities for entity resolution.

Design/methodology/approach

To match Web entities, the proposed network learns the following representations of entities: embeddings, which are vector representations of the words in the entities in a low-dimensional space; convolutional vectors from a convolutional layer, which capture short-distance patterns in word sequences in the entities; and bag-of-word vectors, created by a bow layer that learns weights for words in the vocabulary based on the task at hand. Given a pair of entities, the similarity between their learned representations is used as a feature to a binary classifier that identifies a possible match. In addition to those features, the classifier also uses a modification of inverse document frequency for pairs, which identifies discriminative words in pairs of entities.

Findings

The proposed approach was evaluated in two commercial and two academic entity resolution benchmarking data sets. The results have shown that the proposed strategy outperforms previous approaches in the commercial data sets, which are more challenging, and have similar results to its competitors in the academic data sets.

Originality/value

No previous work has used a single deep learning framework to learn different representations of Web entities for entity resolution.

Details

International Journal of Web Information Systems, vol. 15 no. 3
Type: Research Article
ISSN: 1744-0084

Keywords

Open Access
Article
Publication date: 10 November 2023

Sue-Ting Chang and Jia-Jhou Wu

The study aims to propose an instrument for measuring product-centeredness (i.e. the extent to which comment content is related to a product) using word embedding techniques as…

2387

Abstract

Purpose

The study aims to propose an instrument for measuring product-centeredness (i.e. the extent to which comment content is related to a product) using word embedding techniques as well as explore its determinants.

Design/methodology/approach

The study collected branded posts from 205 Instagram influencers and empirically examined how four factors (i.e. authenticity, vividness, coolness and influencer–product congruence) influence the content of the comments on branded posts.

Findings

Post authenticity and congruence are shown to have positive effects on product-centeredness. The interaction between coolness and authenticity is also significant. The number of comments or likes on branded posts is not correlated with product-centeredness.

Originality/value

In social media influencer marketing, volume-based metrics such as the numbers of likes and comments have been researched and applied extensively. However, content-based metrics are urgently needed, as fans may ignore brands and focus on influencers. The proposed instrument for assessing comment content enables marketers to construct content-based metrics. Additionally, the authors' findings enhance the understanding of social media users' engagement behaviors.

Details

Industrial Management & Data Systems, vol. 124 no. 1
Type: Research Article
ISSN: 0263-5577

Keywords

Article
Publication date: 27 April 2022

Romina Sharifpour, Mingfang Wu and Xiuzhen Zhang

With an explosion of datasets available on the Web, dataset search has gained attention as an emerging research domain. Understanding users' dataset behaviour is imperative for…

Abstract

Purpose

With an explosion of datasets available on the Web, dataset search has gained attention as an emerging research domain. Understanding users' dataset behaviour is imperative for providing effective data discovery services. In this paper, the authors present a study on users' dataset search behaviour through the analysis of search logs from a research data discovery portal.

Design/methodology/approach

Using query and session based features, the authors apply cluster analysis to discover distinct user profiles with different search behaviours. One particular behavioural construct of our interest is users' expertise that the authors generate via computing semantic similarity between users' search queries and the title of metadata records in the displayed search results.

Findings

The findings revealed that there are six distinct classes of user behaviours for dataset search, namely; Expert Research, Expert Search, Expert Explore, Novice Research, Novice Search and Novice Explore.

Research limitations/implications

The user profiles are derived based on analysis of the search log of the research data catalogue in this study. Further research is needed to generalise the user profiles to other dataset search settings. Future research can take on a confirmatory approach to verify these user groups and establish a deeper understanding of their information needs.

Practical implications

The findings in this paper have implications for designing search systems that tailor search results matching the diverse information needs of different user groups.

Originality/value

We propose for the first time a taxonomy of users for dataset search based on their domain expertise and search behaviour.

Details

Journal of Documentation, vol. 79 no. 1
Type: Research Article
ISSN: 0022-0418

Keywords

1 – 10 of 325