Search results

1 – 10 of 435
Article
Publication date: 26 June 2020

Jamal Al Qundus, Adrian Paschke, Shivam Gupta, Ahmad M. Alzouby and Malik Yousef

The purpose of this paper is to explore to which extent the quality of social media short text without extensions can be investigated and what are the predictors, if any, of such…

Abstract

Purpose

The purpose of this paper is to explore to which extent the quality of social media short text without extensions can be investigated and what are the predictors, if any, of such short text that lead to trust its content.

Design/methodology/approach

The paper applies a trust model to classify data collections based on metadata into four classes: Very Trusted, Trusted, Untrusted and Very Untrusted. These data are collected from the online communities, Genius and Stack Overflow. In order to evaluate short texts in terms of its trust levels, the authors have conducted two investigations: (1) A natural language processing (NLP) approach to extract relevant features (i.e. Part-of-Speech and various readability indexes). The authors report relatively good performance of the NLP study. (2) A machine learning technique in more precise, a random forest (RF) classifierusing bag-of-words model (BoW).

Findings

The investigation of the RF classifier using BoW shows promising intermediate results (on average 62% accuracy of both online communities) in short-text quality identification that leads to trust.

Practical implications

As social media becomes an increasingly new and attractive source of information, which is mostly provided in the form of short texts, businesses (e.g. in search engines for smart data) can filter content without having to apply complex approaches and continue to deal with information that is considered more trustworthy.

Originality/value

Short-text classifications with regard to a criterion (e.g. quality, readability) are usually extended by an external source or its metadata. This enhancement either changes the original text if it is an additional text from an external source, or it requires text metadata that is not always available. To this end, the originality of this study faces the challenge of investigating the quality of short text (i.e. social media text) without having to extend or modify it using external sources. This modification alters the text and distorts the results of the investigation.

Details

Journal of Enterprise Information Management, vol. 33 no. 6
Type: Research Article
ISSN: 1741-0398

Keywords

Article
Publication date: 29 April 2021

Heng-Yang Lu, Yi Zhang and Yuntao Du

Topic model has been widely applied to discover important information from a vast amount of unstructured data. Traditional long-text topic models such as Latent Dirichlet…

Abstract

Purpose

Topic model has been widely applied to discover important information from a vast amount of unstructured data. Traditional long-text topic models such as Latent Dirichlet Allocation may suffer from the sparsity problem when dealing with short texts, which mostly come from the Web. These models also exist the readability problem when displaying the discovered topics. The purpose of this paper is to propose a novel model called the Sense Unit based Phrase Topic Model (SenU-PTM) for both the sparsity and readability problems.

Design/methodology/approach

SenU-PTM is a novel phrase-based short-text topic model under a two-phase framework. The first phase introduces a phrase-generation algorithm by exploiting word embeddings, which aims to generate phrases with the original corpus. The second phase introduces a new concept of sense unit, which consists of a set of semantically similar tokens for modeling topics with token vectors generated in the first phase. Finally, SenU-PTM infers topics based on the above two phases.

Findings

Experimental results on two real-world and publicly available datasets show the effectiveness of SenU-PTM from the perspectives of topical quality and document characterization. It reveals that modeling topics on sense units can solve the sparsity of short texts and improve the readability of topics at the same time.

Originality/value

The originality of SenU-PTM lies in the new procedure of modeling topics on the proposed sense units with word embeddings for short-text topic discovery.

Details

Data Technologies and Applications, vol. 55 no. 5
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 27 October 2020

Deepak Trehan and Rajat Sharma

The purpose of this paper is to test relevance of the information quality (IQ) framework in understanding quality of advertisements (ads) posted by ordinary consumers.

Abstract

Purpose

The purpose of this paper is to test relevance of the information quality (IQ) framework in understanding quality of advertisements (ads) posted by ordinary consumers.

Design/methodology/approach

The main objective of this study is to assess quality ads posted on customer-to-customer (C2C) social commerce platforms from an IQ framework. The authors deployed innovative text mining techniques to generate features from the IQ framework and then used a machine learning (ML) algorithm to classify ads into three categories ‐ high quality, medium quality and low quality.

Findings

The results show that not all dimensions of IQ framework are important to assess quality of ads posted on the platforms. Potential buyers on these platforms look for appropriate amount of information, which is objective, concise and complete, to make a potential purchase decision.

Research limitations/implications

As the research focuses on specific product categories, it lacks generalisability. Therefore, it needs to be tested for other product categories.

Practical implications

The paper includes recommendation for C2C marketplaces on how to increase quality of ads posted by consumers on the platform.

Originality/value

This study has focused on the user-generated content posted by ordinary consumers on the C2C commerce platform to sell used goods. Though C2C model has been developed on ads posted on C2C platforms, it can be established for brands as it provides them with an insight into latent dimensions that a consumer shall look for in an ad on social commerce platforms.

Details

Online Information Review, vol. 45 no. 1
Type: Research Article
ISSN: 1468-4527

Keywords

Article
Publication date: 21 January 2019

Issa Alsmadi and Keng Hoon Gan

Rapid developments in social networks and their usage in everyday life have caused an explosion in the amount of short electronic documents. Thus, the need to classify this type…

1096

Abstract

Purpose

Rapid developments in social networks and their usage in everyday life have caused an explosion in the amount of short electronic documents. Thus, the need to classify this type of document based on their content has a significant implication in many applications. The need to classify these documents in relevant classes according to their text contents should be interested in many practical reasons. Short-text classification is an essential step in many applications, such as spam filtering, sentiment analysis, Twitter personalization, customer review and many other applications related to social networks. Reviews on short text and its application are limited. Thus, this paper aims to discuss the characteristics of short text, its challenges and difficulties in classification. The paper attempt to introduce all stages in principle classification, the technique used in each stage and the possible development trend in each stage.

Design/methodology/approach

The paper as a review of the main aspect of short-text classification. The paper is structured based on the classification task stage.

Findings

This paper discusses related issues and approaches to these problems. Further research could be conducted to address the challenges in short texts and avoid poor accuracy in classification. Problems in low performance can be solved by using optimized solutions, such as genetic algorithms that are powerful in enhancing the quality of selected features. Soft computing solution has a fuzzy logic that makes short-text problems a promising area of research.

Originality/value

Using a powerful short-text classification method significantly affects many applications in terms of efficiency enhancement. Current solutions still have low performance, implying the need for improvement. This paper discusses related issues and approaches to these problems.

Details

International Journal of Web Information Systems, vol. 15 no. 2
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 14 October 2021

Roman Egger and Joanne Yu

Intrigued by the methodological challenges emerging from text complexity, the purpose of this study is to evaluate the effectiveness of different topic modelling algorithms based…

Abstract

Purpose

Intrigued by the methodological challenges emerging from text complexity, the purpose of this study is to evaluate the effectiveness of different topic modelling algorithms based on Instagram textual data.

Design/methodology/approach

By taking Instagram posts captioned with #darktourism as the study context, this research applies latent Dirichlet allocation (LDA), correlation explanation (CorEx), and non-negative matrix factorisation (NMF) to uncover tourist experiences.

Findings

CorEx outperforms LDA and NMF by classifying emerging dark sites and activities into 17 distinct topics. The results of LDA appear homogeneous and overlapping, whereas the extracted topics of NMF are not specific enough to gain deep insights.

Originality/value

This study assesses different topic modelling algorithms for knowledge extraction in the highly heterogeneous tourism industry. The findings unfold the complexity of analysing short-text social media data and strengthen the use of CorEx in analysing Instagram content.

研究目的

基于对文本复杂性的兴趣, 本研究以Instagram文本数据为基准, 旨在比较不同主题建模的算法的有效性。

研究方法

本研究以标有 #darktourism的Instagram帖子作为背景, 评估直观理解(LDA), 相关解释(CorEx)和非负矩阵分解(NMF)在分析与黑暗观光相关的帖子的实用性。

研究结果

CorEx分析出17个新兴的黑暗景点和活动, 亦胜过LDA和NMF。虽然LDA能探讨出较多的主题数, 但它们的内容几乎重复。同样的, 尽管NMF适用于短文本数据, 但它提取出主题相当笼统且不够具体。

原创性

透过将营销和数据科学学科相结合, 本研究为分析非结构化的文本奠定了基础, 并证实了CorEx在分析短文本社交媒体数据(如Instagram数据)中的效益。

Propósito

Intrigado por los desafíos metodológicos que surgen de la complejidad del texto, este estudio evalúa la efectividad de diferentes algoritmos de modelado de temas basados en datos textuales de Instagram.

Metodología

Al tomar publicaciones de Instagram con #darktourism como contexto de estudio, esta investigación aplica la asignación de Dirichlet latente (LDA), la explicación de correlación (CorEx) y la factorización matricial no negativa (NMF) para descubrir experiencias turísticas.

Resultados

CorEx supera a LDA y NMF al clasificar los sitios y actividades oscuros emergentes en 17 temas distintos. Los resultados de LDA son homogéneos y se superponen, mientras que los temas extraídos de NMF no son lo suficientemente específicos como para obtener conocimientos profundos.

Originalidad

Este estudio evalúa diferentes algoritmos de modelado de temas para la extracción de conocimiento en la industria del turismo. Los hallazgos revelan la complejidad de analizar datos de redes sociales de texto corto y fortalecen el uso de CorEx para analizar el contenido de Instagram.

Article
Publication date: 25 May 2012

Ingrid Utne, Lars Thuestad, Kaare Finbak and Tom Anders Thorstensen

The purpose of this paper is to present an approach for measuring the ability of oil and gas production plants to utilize shutdowns opportunistically for maintenance.

1622

Abstract

Purpose

The purpose of this paper is to present an approach for measuring the ability of oil and gas production plants to utilize shutdowns opportunistically for maintenance.

Design/methodology/approach

Key performance indicators have been developed from case studies with two offshore oil and gas installations on the Norwegian Continental Shelf. The key performance indicators measure the quality of the work preparations and the ability to utilize shutdowns opportunistically. Shutdowns may provide opportunities for execution of maintenance, but it is hardly possible to undertake any maintenance work requiring shutdown if the organization is not well prepared and the work is not well planned.

Findings

The results from testing of the indicators on two oil and gas installations shows that several of the indicators are relevant for determining the quality of preparations, whereas more effort needs to be put into gathering data applicable for monitoring the actual utilization of the shutdowns.

Research limitations/implications

Production losses, due to turnarounds and unforeseen shutdowns in oil and gas operations, are significant, and the improvement potential is large. The indicators may assist maintenance managers in planning and improving the plant's utilization of shutdowns and may contribute to substantial cost savings.

Originality/value

The approach in the paper adds important knowledge on how to actually measure the quality of maintenance work planning and execution.

Article
Publication date: 20 December 2007

Isak Taksa, Sarah Zelikovitz and Amanda Spink

The work presented in this paper aims to provide an approach to classifying web logs by personal properties of users.

483

Abstract

Purpose

The work presented in this paper aims to provide an approach to classifying web logs by personal properties of users.

Design/methodology/approach

The authors describe an iterative system that begins with a small set of manually labeled terms, which are used to label queries from the log. A set of background knowledge related to these labeled queries is acquired by combining web search results on these queries. This background set is used to obtain many terms that are related to the classification task. The system then ranks each of the related terms, choosing those that most fit the personal properties of the users. These terms are then used to begin the next iteration.

Findings

The authors identify the difficulties of classifying web logs, by approaching this problem from a machine learning perspective. By applying the approach developed, the authors are able to show that many queries in a large query log can be classified.

Research limitations/implications

Testing results in this type of classification work is difficult, as the true personal properties of web users are unknown. Evaluation of the classification results in terms of the comparison of classified queries to well known age‐related sites is a direction that is currently being exploring.

Practical implications

This research is background work that can be incorporated in search engines or other web‐based applications, to help marketing companies and advertisers.

Originality/value

This research enhances the current state of knowledge in short‐text classification and query log learning.

Details

International Journal of Web Information Systems, vol. 3 no. 4
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 22 October 2019

Ming Li, Lisheng Chen and Yingcheng Xu

A large number of questions are posted on community question answering (CQA) websites every day. Providing a set of core questions will ease the question overload problem. These…

Abstract

Purpose

A large number of questions are posted on community question answering (CQA) websites every day. Providing a set of core questions will ease the question overload problem. These core questions should cover the main content of the original question set. There should be low redundancy within the core questions and a consistent distribution with the original question set. The paper aims to discuss these issues.

Design/methodology/approach

In the paper, a method named QueExt method for extracting core questions is proposed. First, questions are modeled using a biterm topic model. Then, these questions are clustered based on particle swarm optimization (PSO). With the clustering results, the number of core questions to be extracted from each cluster can be determined. Afterwards, the multi-objective PSO algorithm is proposed to extract the core questions. Both PSO algorithms are integrated with operators in genetic algorithms to avoid the local optimum.

Findings

Extensive experiments on real data collected from the famous CQA website Zhihu have been conducted and the experimental results demonstrate the superior performance over other benchmark methods.

Research limitations/implications

The proposed method provides new insight into and enriches research on information overload in CQA. It performs better than other methods in extracting core short text documents, and thus provides a better way to extract core data. The PSO is a novel method used for selecting core questions. The research on the application of the PSO model is expanded. The study also contributes to research on PSO-based clustering. With the integration of K-means++, the key parameter number of clusters is optimized.

Originality/value

The novel core question extraction method in CQA is proposed, which provides a novel and efficient way to alleviate the question overload. The PSO model is extended and novelty used in selecting core questions. The PSO model is integrated with K-means++ method to optimize the number of clusters, which is just the key parameter in text clustering based on PSO. It provides a new way to cluster texts.

Details

Data Technologies and Applications, vol. 53 no. 4
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 1 May 1993

Edmond Lassalle

The use of an information retrieval (IR) system would be easier if natural language processing were applied. There are essentially two different ways to use NLP techniques: as a…

Abstract

The use of an information retrieval (IR) system would be easier if natural language processing were applied. There are essentially two different ways to use NLP techniques: as a user interface coupled with a factual database, or as an integrated part of a system which deals with a textual database. In this paper, two approaches are presented, that of MGS, a commercialized system in use in France Télécom, and that of Telmi, a France Télécom research system. Telmi is an information retrieval system designed for use with medium sized databases of short text. The characteristics of the system include fine‐grained NLP, an open domain and large scale knowledge base, automated indexing based on conceptual representation of texts, and reusability of the NLP tools. The knowledge base is (semi) automatically extracted from a monolingual machine‐readable dictionary (MRD). Telmi is integrated into a production‐scale prototype which implements a Minitel Information Service (IS) for the use of the general public. France Télécom Minitel(i) and its problems are described, along with the solutions Telmi offers. The paper then goes on to describe how France Télécom intends to reuse, in a continuation of the present project, the Telmi tools in a multilingual system, particularly in (semi)automatic data acquisition from multilingual MRDs.

Details

Aslib Proceedings, vol. 45 no. 5
Type: Research Article
ISSN: 0001-253X

Article
Publication date: 3 December 2018

Cong-Phuoc Phan, Hong-Quang Nguyen and Tan-Tai Nguyen

Large collections of patent documents disclosing novel, non-obvious technologies are publicly available and beneficial to academia and industries. To maximally exploit its…

Abstract

Purpose

Large collections of patent documents disclosing novel, non-obvious technologies are publicly available and beneficial to academia and industries. To maximally exploit its potential, searching these patent documents has increasingly become an important topic. Although much research has processed a large size of collections, a few studies have attempted to integrate both patent classifications and specifications for analyzing user queries. Consequently, the queries are often insufficiently analyzed for improving the accuracy of search results. This paper aims to address such limitation by exploiting semantic relationships between patent contents and their classification.

Design/methodology/approach

The contributions are fourfold. First, the authors enhance similarity measurement between two short sentences and make it 20 per cent more accurate. Second, the Graph-embedded Tree ontology is enriched by integrating both patent documents and classification scheme. Third, the ontology does not rely on rule-based method or text matching; instead, an heuristic meaning comparison to extract semantic relationships between concepts is applied. Finally, the patent search approach uses the ontology effectively with the results sorted based on their most common order.

Findings

The experiment on searching for 600 patent documents in the field of Logistics brings better 15 per cent in terms of F-Measure when compared with traditional approaches.

Research limitations/implications

The research, however, still requires improvement in which the terms and phrases extracted by Noun and Noun phrases making less sense in some aspect and thus might not result in high accuracy. The large collection of extracted relationships could be further optimized for its conciseness. In addition, parallel processing such as Map-Reduce could be further used to improve the search processing performance.

Practical implications

The experimental results could be used for scientists and technologists to search for novel, non-obvious technologies in the patents.

Social implications

High quality of patent search results will reduce the patent infringement.

Originality/value

The proposed ontology is semantically enriched by integrating both patent documents and their classification. This ontology facilitates the analysis of the user queries for enhancing the accuracy of the patent search results.

Details

International Journal of Web Information Systems, vol. 15 no. 3
Type: Research Article
ISSN: 1744-0084

Keywords

1 – 10 of 435