Search results

1 – 4 of 4
Open Access
Article
Publication date: 17 July 2020

Mukesh Kumar and Palak Rehan

Social media networks like Twitter, Facebook, WhatsApp etc. are most commonly used medium for sharing news, opinions and to stay in touch with peers. Messages on twitter are…

1186

Abstract

Social media networks like Twitter, Facebook, WhatsApp etc. are most commonly used medium for sharing news, opinions and to stay in touch with peers. Messages on twitter are limited to 140 characters. This led users to create their own novel syntax in tweets to express more in lesser words. Free writing style, use of URLs, markup syntax, inappropriate punctuations, ungrammatical structures, abbreviations etc. makes it harder to mine useful information from them. For each tweet, we can get an explicit time stamp, the name of the user, the social network the user belongs to, or even the GPS coordinates if the tweet is created with a GPS-enabled mobile device. With these features, Twitter is, in nature, a good resource for detecting and analyzing the real time events happening around the world. By using the speed and coverage of Twitter, we can detect events, a sequence of important keywords being talked, in a timely manner which can be used in different applications like natural calamity relief support, earthquake relief support, product launches, suspicious activity detection etc. The keyword detection process from Twitter can be seen as a two step process: detection of keyword in the raw text form (words as posted by the users) and keyword normalization process (reforming the users’ unstructured words in the complete meaningful English language words). In this paper a keyword detection technique based upon the graph, spanning tree and Page Rank algorithm is proposed. A text normalization technique based upon hybrid approach using Levenshtein distance, demetaphone algorithm and dictionary mapping is proposed to work upon the unstructured keywords as produced by the proposed keyword detector. The proposed normalization technique is validated using the standard lexnorm 1.2 dataset. The proposed system is used to detect the keywords from Twiter text being posted at real time. The detected and normalized keywords are further validated from the search engine results at later time for detection of events.

Details

Applied Computing and Informatics, vol. 17 no. 2
Type: Research Article
ISSN: 2634-1964

Keywords

Open Access
Article
Publication date: 13 October 2022

Linzi Wang, Qiudan Li, Jingjun David Xu and Minjie Yuan

Mining user-concerned actionable and interpretable hot topics will help management departments fully grasp the latest events and make timely decisions. Existing topic models…

379

Abstract

Purpose

Mining user-concerned actionable and interpretable hot topics will help management departments fully grasp the latest events and make timely decisions. Existing topic models primarily integrate word embedding and matrix decomposition, which only generates keyword-based hot topics with weak interpretability, making it difficult to meet the specific needs of users. Mining phrase-based hot topics with syntactic dependency structure have been proven to model structure information effectively. A key challenge lies in the effective integration of the above information into the hot topic mining process.

Design/methodology/approach

This paper proposes the nonnegative matrix factorization (NMF)-based hot topic mining method, semantics syntax-assisted hot topic model (SSAHM), which combines semantic association and syntactic dependency structure. First, a semantic–syntactic component association matrix is constructed. Then, the matrix is used as a constraint condition to be incorporated into the block coordinate descent (BCD)-based matrix decomposition process. Finally, a hot topic information-driven phrase extraction algorithm is applied to describe hot topics.

Findings

The efficacy of the developed model is demonstrated on two real-world datasets, and the effects of dependency structure information on different topics are compared. The qualitative examples further explain the application of the method in real scenarios.

Originality/value

Most prior research focuses on keyword-based hot topics. Thus, the literature is advanced by mining phrase-based hot topics with syntactic dependency structure, which can effectively analyze the semantics. The development of syntactic dependency structure considering the combination of word order and part-of-speech (POS) is a step forward as word order, and POS are only separately utilized in the prior literature. Ignoring this synergy may miss important information, such as grammatical structure coherence and logical relations between syntactic components.

Details

Journal of Electronic Business & Digital Economics, vol. 1 no. 1/2
Type: Research Article
ISSN: 2754-4214

Keywords

Open Access
Article
Publication date: 8 December 2020

Matjaž Kragelj and Mirjana Kljajić Borštnar

The purpose of this study is to develop a model for automated classification of old digitised texts to the Universal Decimal Classification (UDC), using machine-learning methods.

2889

Abstract

Purpose

The purpose of this study is to develop a model for automated classification of old digitised texts to the Universal Decimal Classification (UDC), using machine-learning methods.

Design/methodology/approach

The general research approach is inherent to design science research, in which the problem of UDC assignment of the old, digitised texts is addressed by developing a machine-learning classification model. A corpus of 70,000 scholarly texts, fully bibliographically processed by librarians, was used to train and test the model, which was used for classification of old texts on a corpus of 200,000 items. Human experts evaluated the performance of the model.

Findings

Results suggest that machine-learning models can correctly assign the UDC at some level for almost any scholarly text. Furthermore, the model can be recommended for the UDC assignment of older texts. Ten librarians corroborated this on 150 randomly selected texts.

Research limitations/implications

The main limitations of this study were unavailability of labelled older texts and the limited availability of librarians.

Practical implications

The classification model can provide a recommendation to the librarians during their classification work; furthermore, it can be implemented as an add-on to full-text search in the library databases.

Social implications

The proposed methodology supports librarians by recommending UDC classifiers, thus saving time in their daily work. By automatically classifying older texts, digital libraries can provide a better user experience by enabling structured searches. These contribute to making knowledge more widely available and useable.

Originality/value

These findings contribute to the field of automated classification of bibliographical information with the usage of full texts, especially in cases in which the texts are old, unstructured and in which archaic language and vocabulary are used.

Details

Journal of Documentation, vol. 77 no. 3
Type: Research Article
ISSN: 0022-0418

Keywords

Open Access
Article
Publication date: 11 October 2023

Bachriah Fatwa Dhini, Abba Suganda Girsang, Unggul Utan Sufandi and Heny Kurniawati

The authors constructed an automatic essay scoring (AES) model in a discussion forum where the result was compared with scores given by human evaluators. This research proposes…

Abstract

Purpose

The authors constructed an automatic essay scoring (AES) model in a discussion forum where the result was compared with scores given by human evaluators. This research proposes essay scoring, which is conducted through two parameters, semantic and keyword similarities, using a SentenceTransformers pre-trained model that can construct the highest vector embedding. Combining these models is used to optimize the model with increasing accuracy.

Design/methodology/approach

The development of the model in the study is divided into seven stages: (1) data collection, (2) pre-processing data, (3) selected pre-trained SentenceTransformers model, (4) semantic similarity (sentence pair), (5) keyword similarity, (6) calculate final score and (7) evaluating model.

Findings

The multilingual paraphrase-multilingual-MiniLM-L12-v2 and distilbert-base-multilingual-cased-v1 models got the highest scores from comparisons of 11 pre-trained multilingual models of SentenceTransformers with Indonesian data (Dhini and Girsang, 2023). Both multilingual models were adopted in this study. A combination of two parameters is obtained by comparing the response of the keyword extraction responses with the rubric keywords. Based on the experimental results, proposing a combination can increase the evaluation results by 0.2.

Originality/value

This study uses discussion forum data from the general biology course in online learning at the open university for the 2020.2 and 2021.2 semesters. Forum discussion ratings are still manual. In this survey, the authors created a model that automatically calculates the value of discussion forums, which are essays based on the lecturer's answers moreover rubrics.

Details

Asian Association of Open Universities Journal, vol. 18 no. 3
Type: Research Article
ISSN: 1858-3431

Keywords

Access

Only Open Access

Year

Content type

1 – 4 of 4