Search results
1 – 10 of 325Patent offices and other stakeholders in the patent domain need to classify patent applications according to a standardized classification scheme. The purpose of this paper is to…
Abstract
Purpose
Patent offices and other stakeholders in the patent domain need to classify patent applications according to a standardized classification scheme. The purpose of this paper is to examine the novelty of an application it can then be compared to previously granted patents in the same class. Automatic classification would be highly beneficial, because of the large volume of patents and the domain-specific knowledge needed to accomplish this costly manual task. However, a challenge for the automation is patent-specific language use, such as special vocabulary and phrases.
Design/methodology/approach
To account for this language use, the authors present domain-specific pre-trained word embeddings for the patent domain. The authors train the model on a very large data set of more than 5m patents and evaluate it at the task of patent classification. To this end, the authors propose a deep learning approach based on gated recurrent units for automatic patent classification built on the trained word embeddings.
Findings
Experiments on a standardized evaluation data set show that the approach increases average precision for patent classification by 17 percent compared to state-of-the-art approaches. In this paper, the authors further investigate the model’s strengths and weaknesses. An extensive error analysis reveals that the learned embeddings indeed mirror patent-specific language use. The imbalanced training data and underrepresented classes are the most difficult remaining challenge.
Originality/value
The proposed approach fulfills the need for domain-specific word embeddings for downstream tasks in the patent domain, such as patent classification or patent analysis.
Details
Keywords
Femi Emmanuel Ayo, Olusegun Folorunso, Friday Thomas Ibharalu and Idowu Ademola Osinuga
Hate speech is an expression of intense hatred. Twitter has become a popular analytical tool for the prediction and monitoring of abusive behaviors. Hate speech detection with…
Abstract
Purpose
Hate speech is an expression of intense hatred. Twitter has become a popular analytical tool for the prediction and monitoring of abusive behaviors. Hate speech detection with social media data has witnessed special research attention in recent studies, hence, the need to design a generic metadata architecture and efficient feature extraction technique to enhance hate speech detection.
Design/methodology/approach
This study proposes a hybrid embeddings enhanced with a topic inference method and an improved cuckoo search neural network for hate speech detection in Twitter data. The proposed method uses a hybrid embeddings technique that includes Term Frequency-Inverse Document Frequency (TF-IDF) for word-level feature extraction and Long Short Term Memory (LSTM) which is a variant of recurrent neural networks architecture for sentence-level feature extraction. The extracted features from the hybrid embeddings then serve as input into the improved cuckoo search neural network for the prediction of a tweet as hate speech, offensive language or neither.
Findings
The proposed method showed better results when tested on the collected Twitter datasets compared to other related methods. In order to validate the performances of the proposed method, t-test and post hoc multiple comparisons were used to compare the significance and means of the proposed method with other related methods for hate speech detection. Furthermore, Paired Sample t-Test was also conducted to validate the performances of the proposed method with other related methods.
Research limitations/implications
Finally, the evaluation results showed that the proposed method outperforms other related methods with mean F1-score of 91.3.
Originality/value
The main novelty of this study is the use of an automatic topic spotting measure based on naïve Bayes model to improve features representation.
Details
Keywords
Pradeep Kumar and Gaurav Sarin
Sarcasm is a sentiment in which human beings convey messages with the opposite meanings to hurt someone emotionally or condemn something in a witty manner. The difference between…
Abstract
Purpose
Sarcasm is a sentiment in which human beings convey messages with the opposite meanings to hurt someone emotionally or condemn something in a witty manner. The difference between the text's literal and its intended meaning makes it tough to identify. Mostly, researchers and practitioners only consider explicit information for text classification; however, considering implicit with explicit information will enhance the classifier's accuracy. Several sarcasm detection studies focus on syntactic, lexical or pragmatic features that are uttered using words, emoticons and exclamation marks. Discrete models, which are utilized by many existing works, require manual features that are costly to uncover.
Design/methodology/approach
In this research, word embeddings used for feature extraction are combined with context-aware language models to provide automatic feature engineering capabilities as well superior classification performance as compared to baseline models. Performance of the proposed models has been shown on three benchmark datasets over different evaluation metrics namely misclassification rate, receiver operating characteristic (ROC) curve and area under curve (AUC).
Findings
Experimental results suggest that FastText word embedding technique with BERT language model gives higher accuracy and helps to identify the sarcastic textual element correctly.
Originality/value
Sarcasm detection is a sub-task of sentiment analysis. To help in appropriate data-driven decision-making, the sentiment of the text that gets reversed due to sarcasm needs to be detected properly. In online social environments, it is critical for businesses and individuals to detect the correct sentiment polarity. This will aid in the right selling and buying of products and/or services, leading to higher sales and better market share for businesses, and meeting the quality requirements of customers.
Details
Keywords
Subbaraju Pericherla and E. Ilavarasan
Nowadays people are connected by social media like Facebook, Instagram, Twitter, YouTube and much more. Bullies take advantage of these social networks to share their comments…
Abstract
Purpose
Nowadays people are connected by social media like Facebook, Instagram, Twitter, YouTube and much more. Bullies take advantage of these social networks to share their comments. Cyberbullying is one typical kind of harassment by making aggressive comments, abuses to hurt the netizens. Social media is one of the areas where bullying happens extensively. Hence, it is necessary to develop an efficient and autonomous cyberbullying detection technique.
Design/methodology/approach
In this paper, the authors proposed a transformer network-based word embeddings approach for cyberbullying detection. RoBERTa is used to generate word embeddings and Light Gradient Boosting Machine is used as a classifier.
Findings
The proposed approach outperforms machine learning algorithms such as logistic regression, support vector machine and deep learning models such as word-level convolutional neural networks (word CNN) and character convolutional neural networks with short cuts (char CNNS) in terms of precision, recall, F1-score.
Originality/value
One of the limitations of traditional word embeddings methods is context-independent. In this work, only text data are utilized to identify cyberbullying. This work can be extended to predict cyberbullying activities in multimedia environment like image, audio and video.
Details
Keywords
Antonijo Marijić and Marina Bagić Babac
Genre classification of songs based on lyrics is a challenging task even for humans, however, state-of-the-art natural language processing has recently offered advanced solutions…
Abstract
Purpose
Genre classification of songs based on lyrics is a challenging task even for humans, however, state-of-the-art natural language processing has recently offered advanced solutions to this task. The purpose of this study is to advance the understanding and application of natural language processing and deep learning in the domain of music genre classification, while also contributing to the broader themes of global knowledge and communication, and sustainable preservation of cultural heritage.
Design/methodology/approach
The main contribution of this study is the development and evaluation of various machine and deep learning models for song genre classification. Additionally, we investigated the effect of different word embeddings, including Global Vectors for Word Representation (GloVe) and Word2Vec, on the classification performance. The tested models range from benchmarks such as logistic regression, support vector machine and random forest, to more complex neural network architectures and transformer-based models, such as recurrent neural network, long short-term memory, bidirectional long short-term memory and bidirectional encoder representations from transformers (BERT).
Findings
The authors conducted experiments on both English and multilingual data sets for genre classification. The results show that the BERT model achieved the best accuracy on the English data set, whereas cross-lingual language model pretraining based on RoBERTa (XLM-RoBERTa) performed the best on the multilingual data set. This study found that songs in the metal genre were the most accurately labeled, as their text style and topics were the most distinct from other genres. On the contrary, songs from the pop and rock genres were more challenging to differentiate. This study also compared the impact of different word embeddings on the classification task and found that models with GloVe word embeddings outperformed Word2Vec and the learning embedding layer.
Originality/value
This study presents the implementation, testing and comparison of various machine and deep learning models for genre classification. The results demonstrate that transformer models, including BERT, robustly optimized BERT pretraining approach, distilled bidirectional encoder representations from transformers, bidirectional and auto-regressive transformers and XLM-RoBERTa, outperformed other models.
Details
Keywords
Rodolphe Durand and Paul Gouvard
Extant research presents firms’ purpose as a consensual and positive attribute. This paper introduces an alternative perspective, which sees firms’ purposefulness as defined in…
Abstract
Extant research presents firms’ purpose as a consensual and positive attribute. This paper introduces an alternative perspective, which sees firms’ purposefulness as defined in relation to specific audiences. A firm’s purposefulness to a focal audience can be either positive or negative. Audiences find firms with which they share a common prioritization of issues more purposeful in absolute terms. Audiences find firms with which they share a common understanding of issues positively purposeful. Conversely, audiences find firms with an opposite understanding of issues negatively purposeful. Audiences harness specific resources to support firms they find positively purposeful and to oppose firms they find negatively purposeful. This paper introduces topic modeling and word embeddings as two techniques to operationalize this audience-based approach to purposefulness.
Details
Keywords
Guellil Imane, Darwish Kareem and Azouaou Faical
This paper aims to propose an approach to automatically annotate a large corpus in Arabic dialect. This corpus is used in order to analyse sentiments of Arabic users on social…
Abstract
Purpose
This paper aims to propose an approach to automatically annotate a large corpus in Arabic dialect. This corpus is used in order to analyse sentiments of Arabic users on social medias. It focuses on the Algerian dialect, which is a sub-dialect of Maghrebi Arabic. Although Algerian is spoken by roughly 40 million speakers, few studies address the automated processing in general and the sentiment analysis in specific for Algerian.
Design/methodology/approach
The approach is based on the construction and use of a sentiment lexicon to automatically annotate a large corpus of Algerian text that is extracted from Facebook. Using this approach allow to significantly increase the size of the training corpus without calling the manual annotation. The annotated corpus is then vectorized using document embedding (doc2vec), which is an extension of word embeddings (word2vec). For sentiments classification, the authors used different classifiers such as support vector machines (SVM), Naive Bayes (NB) and logistic regression (LR).
Findings
The results suggest that NB and SVM classifiers generally led to the best results and MLP generally had the worst results. Further, the threshold that the authors use in selecting messages for the training set had a noticeable impact on recall and precision, with a threshold of 0.6 producing the best results. Using PV-DBOW led to slightly higher results than using PV-DM. Combining PV-DBOW and PV-DM representations led to slightly lower results than using PV-DBOW alone. The best results were obtained by the NB classifier with F1 up to 86.9 per cent.
Originality/value
The principal originality of this paper is to determine the right parameters for automatically annotating an Algerian dialect corpus. This annotation is based on a sentiment lexicon that was also constructed automatically.
Details
Keywords
Matching instances of the same entity, a task known as entity resolution, is a key step in the process of data integration. This paper aims to propose a deep learning network that…
Abstract
Purpose
Matching instances of the same entity, a task known as entity resolution, is a key step in the process of data integration. This paper aims to propose a deep learning network that learns different representations of Web entities for entity resolution.
Design/methodology/approach
To match Web entities, the proposed network learns the following representations of entities: embeddings, which are vector representations of the words in the entities in a low-dimensional space; convolutional vectors from a convolutional layer, which capture short-distance patterns in word sequences in the entities; and bag-of-word vectors, created by a bow layer that learns weights for words in the vocabulary based on the task at hand. Given a pair of entities, the similarity between their learned representations is used as a feature to a binary classifier that identifies a possible match. In addition to those features, the classifier also uses a modification of inverse document frequency for pairs, which identifies discriminative words in pairs of entities.
Findings
The proposed approach was evaluated in two commercial and two academic entity resolution benchmarking data sets. The results have shown that the proposed strategy outperforms previous approaches in the commercial data sets, which are more challenging, and have similar results to its competitors in the academic data sets.
Originality/value
No previous work has used a single deep learning framework to learn different representations of Web entities for entity resolution.
Details
Keywords
Sue-Ting Chang and Jia-Jhou Wu
The study aims to propose an instrument for measuring product-centeredness (i.e. the extent to which comment content is related to a product) using word embedding techniques as…
Abstract
Purpose
The study aims to propose an instrument for measuring product-centeredness (i.e. the extent to which comment content is related to a product) using word embedding techniques as well as explore its determinants.
Design/methodology/approach
The study collected branded posts from 205 Instagram influencers and empirically examined how four factors (i.e. authenticity, vividness, coolness and influencer–product congruence) influence the content of the comments on branded posts.
Findings
Post authenticity and congruence are shown to have positive effects on product-centeredness. The interaction between coolness and authenticity is also significant. The number of comments or likes on branded posts is not correlated with product-centeredness.
Originality/value
In social media influencer marketing, volume-based metrics such as the numbers of likes and comments have been researched and applied extensively. However, content-based metrics are urgently needed, as fans may ignore brands and focus on influencers. The proposed instrument for assessing comment content enables marketers to construct content-based metrics. Additionally, the authors' findings enhance the understanding of social media users' engagement behaviors.
Details
Keywords
Romina Sharifpour, Mingfang Wu and Xiuzhen Zhang
With an explosion of datasets available on the Web, dataset search has gained attention as an emerging research domain. Understanding users' dataset behaviour is imperative for…
Abstract
Purpose
With an explosion of datasets available on the Web, dataset search has gained attention as an emerging research domain. Understanding users' dataset behaviour is imperative for providing effective data discovery services. In this paper, the authors present a study on users' dataset search behaviour through the analysis of search logs from a research data discovery portal.
Design/methodology/approach
Using query and session based features, the authors apply cluster analysis to discover distinct user profiles with different search behaviours. One particular behavioural construct of our interest is users' expertise that the authors generate via computing semantic similarity between users' search queries and the title of metadata records in the displayed search results.
Findings
The findings revealed that there are six distinct classes of user behaviours for dataset search, namely; Expert Research, Expert Search, Expert Explore, Novice Research, Novice Search and Novice Explore.
Research limitations/implications
The user profiles are derived based on analysis of the search log of the research data catalogue in this study. Further research is needed to generalise the user profiles to other dataset search settings. Future research can take on a confirmatory approach to verify these user groups and establish a deeper understanding of their information needs.
Practical implications
The findings in this paper have implications for designing search systems that tailor search results matching the diverse information needs of different user groups.
Originality/value
We propose for the first time a taxonomy of users for dataset search based on their domain expertise and search behaviour.
Details