Search results

1 – 10 of over 17000
Article
Publication date: 14 August 2017

Sudeep Thepade, Rik Das and Saurav Ghosh

Current practices in data classification and retrieval have experienced a surge in the use of multimedia content. Identification of desired information from the huge image…

Abstract

Purpose

Current practices in data classification and retrieval have experienced a surge in the use of multimedia content. Identification of desired information from the huge image databases has been facing increased complexities for designing an efficient feature extraction process. Conventional approaches of image classification with text-based image annotation have faced assorted limitations due to erroneous interpretation of vocabulary and huge time consumption involved due to manual annotation. Content-based image recognition has emerged as an alternative to combat the aforesaid limitations. However, exploring rich feature content in an image with a single technique has lesser probability of extract meaningful signatures compared to multi-technique feature extraction. Therefore, the purpose of this paper is to explore the possibilities of enhanced content-based image recognition by fusion of classification decision obtained using diverse feature extraction techniques.

Design/methodology/approach

Three novel techniques of feature extraction have been introduced in this paper and have been tested with four different classifiers individually. The four classifiers used for performance testing were K nearest neighbor (KNN) classifier, RIDOR classifier, artificial neural network classifier and support vector machine classifier. Thereafter, classification decisions obtained using KNN classifier for different feature extraction techniques have been integrated by Z-score normalization and feature scaling to create fusion-based framework of image recognition. It has been followed by the introduction of a fusion-based retrieval model to validate the retrieval performance with classified query. Earlier works on content-based image identification have adopted fusion-based approach. However, to the best of the authors’ knowledge, fusion-based query classification has been addressed for the first time as a precursor of retrieval in this work.

Findings

The proposed fusion techniques have successfully outclassed the state-of-the-art techniques in classification and retrieval performances. Four public data sets, namely, Wang data set, Oliva and Torralba (OT-scene) data set, Corel data set and Caltech data set comprising of 22,615 images on the whole are used for the evaluation purpose.

Originality/value

To the best of the authors’ knowledge, fusion-based query classification has been addressed for the first time as a precursor of retrieval in this work. The novel idea of exploring rich image features by fusion of multiple feature extraction techniques has also encouraged further research on dimensionality reduction of feature vectors for enhanced classification results.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 10 no. 3
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 3 August 2021

Chuanming Yu, Haodong Xue, Manyi Wang and Lu An

Owing to the uneven distribution of annotated corpus among different languages, it is necessary to bridge the gap between low resource languages and high resource languages. From…

Abstract

Purpose

Owing to the uneven distribution of annotated corpus among different languages, it is necessary to bridge the gap between low resource languages and high resource languages. From the perspective of entity relation extraction, this paper aims to extend the knowledge acquisition task from a single language context to a cross-lingual context, and to improve the relation extraction performance for low resource languages.

Design/methodology/approach

This paper proposes a cross-lingual adversarial relation extraction (CLARE) framework, which decomposes cross-lingual relation extraction into parallel corpus acquisition and adversarial adaptation relation extraction. Based on the proposed framework, this paper conducts extensive experiments in two tasks, i.e. the English-to-Chinese and the English-to-Arabic cross-lingual entity relation extraction.

Findings

The Macro-F1 values of the optimal models in the two tasks are 0.880 1 and 0.789 9, respectively, indicating that the proposed CLARE framework for CLARE can significantly improve the effect of low resource language entity relation extraction. The experimental results suggest that the proposed framework can effectively transfer the corpus as well as the annotated tags from English to Chinese and Arabic. This study reveals that the proposed approach is less human labour intensive and more effective in the cross-lingual entity relation extraction than the manual method. It shows that this approach has high generalizability among different languages.

Originality/value

The research results are of great significance for improving the performance of the cross-lingual knowledge acquisition. The cross-lingual transfer may greatly reduce the time and cost of the manual construction of the multi-lingual corpus. It sheds light on the knowledge acquisition and organization from the unstructured text in the era of big data.

Details

The Electronic Library , vol. 39 no. 3
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 27 July 2022

Svetlozar Nestorov, Dinko Bačić, Nenad Jukić and Mary Malliaris

The purpose of this paper is to propose an extensible framework for extracting data set usage from research articles.

Abstract

Purpose

The purpose of this paper is to propose an extensible framework for extracting data set usage from research articles.

Design/methodology/approach

The framework uses a training set of manually labeled examples to identify word features surrounding data set usage references. Using the word features and general entity identifiers, candidate data sets are extracted and scored separately at the sentence and document levels. Finally, the extracted data set references can be verified by the authors using a web-based verification module.

Findings

This paper successfully addresses a significant gap in entity extraction literature by focusing on data set extraction. In the process, this paper: identified an entity-extraction scenario with specific characteristics that enable a multiphase approach, including a feasible author-verification step; defined the search space for word feature identification; defined scoring functions for sentences and documents; and designed a simple web-based author verification step. The framework is successfully tested on 178 articles authored by researchers from a large research organization.

Originality/value

Whereas previous approaches focused on completely automated large-scale entity recognition from text snippets, the proposed framework is designed for a longer, high-quality text, such as a research publication. The framework includes a verification module that enables the request validation of the discovered entities by the authors of the research publications. This module shares some similarities with general crowdsourcing approaches, but the target scenario increases the likelihood of meaningful author participation.

Article
Publication date: 19 June 2019

Prafulla Bafna, Dhanya Pramod, Shailaja Shrwaikar and Atiya Hassan

Document management is growing in importance proportionate to the growth of unstructured data, and its applications are increasing from process benchmarking to customer…

Abstract

Purpose

Document management is growing in importance proportionate to the growth of unstructured data, and its applications are increasing from process benchmarking to customer relationship management and so on. The purpose of this paper is to improve important components of document management that is keyword extraction and document clustering. It is achieved through knowledge extraction by updating the phrase document matrix. The objective is to manage documents by extending the phrase document matrix and achieve refined clusters. The study achieves consistency in cluster quality in spite of the increasing size of data set. Domain independence of the proposed method is tested and compared with other methods.

Design/methodology/approach

In this paper, a synset-based phrase document matrix construction method is proposed where semantically similar phrases are grouped to reduce the dimension curse. When a large collection of documents is to be processed, it includes some documents that are very much related to the topic of interest known as model documents and also the documents that deviate from the topic of interest. These non-relevant documents may affect the cluster quality. The first step in knowledge extraction from the unstructured textual data is converting it into structured form either as term frequency-inverse document frequency matrix or as phrase document matrix. Once in structured form, a range of mining algorithms from classification to clustering can be applied.

Findings

In the enhanced approach, the model documents are used to extract key phrases with synset groups, whereas the other documents participate in the construction of the feature matrix. It gives a better feature vector representation and improved cluster quality.

Research limitations/implications

Various applications that require managing of unstructured documents can use this approach by specifically incorporating the domain knowledge with a thesaurus.

Practical implications

Experiment pertaining to the academic domain is presented that categorizes research papers according to the context and topic, and this will help academicians to organize and build knowledge in a better way. The grouping and feature extraction for resume data can facilitate the candidate selection process.

Social implications

Applications like knowledge management, clustering of search engine results, different recommender systems like hotel recommender, task recommender, and so on, will benefit from this study. Hence, the study contributes to improving document management in business domains or areas of interest of its users from various strata’s of society.

Originality/value

The study proposed an improvement to document management approach that can be applied in various domains. The efficacy of the proposed approach and its enhancement is validated on three different data sets of well-articulated documents from data sets such as biography, resume and research papers. These results can be used for benchmarking further work carried out in these areas.

Details

Benchmarking: An International Journal, vol. 26 no. 6
Type: Research Article
ISSN: 1463-5771

Keywords

Article
Publication date: 6 November 2017

Ngurah Agus Sanjaya Er, Mouhamadou Lamine Ba, Talel Abdessalem and Stéphane Bressan

This paper aims to focus on the design of algorithms and techniques for an effective set expansion. A tool that finds and extracts candidate sets of tuples from the World Wide Web…

Abstract

Purpose

This paper aims to focus on the design of algorithms and techniques for an effective set expansion. A tool that finds and extracts candidate sets of tuples from the World Wide Web was designed and implemented. For instance, when a given user provides <Indonesia, Jakarta, Indonesian Rupiah>, <China, Beijing, Yuan Renminbi>, <Canada, Ottawa, Canadian Dollar> as seeds, our system returns tuples composed of countries with their corresponding capital cities and currency names constructed from content extracted from Web pages retrieved.

Design/methodology/approach

The seeds are used to query a search engine and to retrieve relevant Web pages. The seeds are also used to infer wrappers from the retrieved pages. The wrappers, in turn, are used to extract candidates. The Web pages, wrappers, seeds and candidates, as well as their relationships, are vertices and edges of a heterogeneous graph. Several options for ranking candidates from PageRank to truth finding algorithms were evaluated and compared. Remarkably, all vertices are ranked, thus providing an integrated approach to not only answer direct set expansion questions but also find the most relevant pages to expand a given set of seeds.

Findings

The experimental results show that leveraging the truth finding algorithm can indeed improve the level of confidence in the extracted candidates and the sources.

Originality/value

Current approaches on set expansion mostly support sets of atomic data expansion. This idea can be extended to the sets of tuples and extract relation instances from the Web given a handful set of tuple seeds. A truth finding algorithm is also incorporated into the approach and it is shown that it can improve the confidence level in the ranking of both candidates and sources in set of tuples expansion.

Details

International Journal of Web Information Systems, vol. 13 no. 4
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 21 January 2019

Issa Alsmadi and Keng Hoon Gan

Rapid developments in social networks and their usage in everyday life have caused an explosion in the amount of short electronic documents. Thus, the need to classify this type…

1096

Abstract

Purpose

Rapid developments in social networks and their usage in everyday life have caused an explosion in the amount of short electronic documents. Thus, the need to classify this type of document based on their content has a significant implication in many applications. The need to classify these documents in relevant classes according to their text contents should be interested in many practical reasons. Short-text classification is an essential step in many applications, such as spam filtering, sentiment analysis, Twitter personalization, customer review and many other applications related to social networks. Reviews on short text and its application are limited. Thus, this paper aims to discuss the characteristics of short text, its challenges and difficulties in classification. The paper attempt to introduce all stages in principle classification, the technique used in each stage and the possible development trend in each stage.

Design/methodology/approach

The paper as a review of the main aspect of short-text classification. The paper is structured based on the classification task stage.

Findings

This paper discusses related issues and approaches to these problems. Further research could be conducted to address the challenges in short texts and avoid poor accuracy in classification. Problems in low performance can be solved by using optimized solutions, such as genetic algorithms that are powerful in enhancing the quality of selected features. Soft computing solution has a fuzzy logic that makes short-text problems a promising area of research.

Originality/value

Using a powerful short-text classification method significantly affects many applications in terms of efficiency enhancement. Current solutions still have low performance, implying the need for improvement. This paper discusses related issues and approaches to these problems.

Details

International Journal of Web Information Systems, vol. 15 no. 2
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 26 July 2021

Pengcheng Li, Qikai Liu, Qikai Cheng and Wei Lu

This paper aims to identify data set entities in scientific literature. To address poor recognition caused by a lack of training corpora in existing studies, a distant supervised…

Abstract

Purpose

This paper aims to identify data set entities in scientific literature. To address poor recognition caused by a lack of training corpora in existing studies, a distant supervised learning-based approach is proposed to identify data set entities automatically from large-scale scientific literature in an open domain.

Design/methodology/approach

Firstly, the authors use a dictionary combined with a bootstrapping strategy to create a labelled corpus to apply supervised learning. Secondly, a bidirectional encoder representation from transformers (BERT)-based neural model was applied to identify data set entities in the scientific literature automatically. Finally, two data augmentation techniques, entity replacement and entity masking, were introduced to enhance the model generalisability and improve the recognition of data set entities.

Findings

In the absence of training data, the proposed method can effectively identify data set entities in large-scale scientific papers. The BERT-based vectorised representation and data augmentation techniques enable significant improvements in the generality and robustness of named entity recognition models, especially in long-tailed data set entity recognition.

Originality/value

This paper provides a practical research method for automatically recognising data set entities in scientific literature. To the best of the authors’ knowledge, this is the first attempt to apply distant learning to the study of data set entity recognition. The authors introduce a robust vectorised representation and two data augmentation strategies (entity replacement and entity masking) to address the problem inherent in distant supervised learning methods, which the existing research has mostly ignored. The experimental results demonstrate that our approach effectively improves the recognition of data set entities, especially long-tailed data set entities.

Article
Publication date: 9 January 2024

Bülent Doğan, Yavuz Selim Balcioglu and Meral Elçi

This study aims to elucidate the dynamics of social media discourse during global health events, specifically investigating how users across different platforms perceive, react to…

Abstract

Purpose

This study aims to elucidate the dynamics of social media discourse during global health events, specifically investigating how users across different platforms perceive, react to and engage with information concerning such crises.

Design/methodology/approach

A mixed-method approach was employed, combining both quantitative and qualitative data collection. Initially, thematic analysis was applied to a data set of social media posts across four major platforms over a 12-month period. This was followed by sentiment analysis to discern the predominant emotions embedded within these communications. Statistical tools were used to validate findings, ensuring robustness in the results.

Findings

The results showcased discernible thematic and emotional disparities across platforms. While some platforms leaned toward factual information dissemination, others were rife with user sentiments, anecdotes and personal experiences. Overall, a global sense of concern was evident, but the ways in which this concern manifested varied significantly between platforms.

Research limitations/implications

The primary limitation is the potential non-representativeness of the sample, as only four major social media platforms were considered. Future studies might expand the scope to include emerging platforms or non-English language platforms. Additionally, the rapidly evolving nature of social media discourse implies that findings might be time-bound, necessitating periodic follow-up studies.

Practical implications

Understanding the nature of discourse on various platforms can guide health organizations, policymakers and communicators in tailoring their messages. Recognizing where factual information is required, versus where sentiment and personal stories resonate, can enhance the efficacy of public health communication strategies.

Social implications

The study underscores the societal reliance on social media for information during crises. Recognizing the different ways in which communities engage with, and are influenced by, platform-specific discourse can help in fostering a more informed and empathetic society, better equipped to handle global challenges.

Originality/value

This research is among the first to offer a comprehensive, cross-platform analysis of social media discourse during a global health event. By comparing user engagement across platforms, it provides unique insights into the multifaceted nature of public sentiment and information dissemination during crises.

Details

Kybernetes, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 9 October 2017

Jin Zhang, Ming Ren, Xian Xiao and Jilong Zhang

The purpose of this paper is to find a representative subset from large-scale online reviews for consumers. The subset is significantly small in size, but covers the majority…

Abstract

Purpose

The purpose of this paper is to find a representative subset from large-scale online reviews for consumers. The subset is significantly small in size, but covers the majority amount of information in the original reviews and contains little redundant information.

Design/methodology/approach

A heuristic approach named RewSel is proposed to successively select representatives until the number of representatives meets the requirement. To reveal the advantages of the approach, extensive data experiments and a user study are conducted on real data.

Findings

The proposed approach has the advantage over the benchmarks in terms of coverage and redundancy. People show preference to the representative subsets provided by RewSel. The proposed approach also has good scalability, and is more adaptive to big data applications.

Research limitations/implications

The paper contributes to the literature of review selection, by proposing a heuristic approach which achieves both high coverage and low redundancy. This study can be applied as the basis for conducting further analysis of large-scale online reviews.

Practical implications

The proposed approach offers a novel way to select a representative subset of online reviews to facilitate consumer decision making. It can also enhance the existing information retrieval system to provide representative information to users rather than a large amount of results.

Originality/value

The proposed approach finds the representative subset by adopting the concept of relative entropy and sentiment analysis methods. Compared with state-of-the-art approaches, it offers a more effective and efficient way for users to handle a large amount of online information.

Details

Online Information Review, vol. 41 no. 6
Type: Research Article
ISSN: 1468-4527

Keywords

Article
Publication date: 22 December 2023

Ali Ahmed Albinali, Russell Lock and Iain Phillips

This study aims to look at challenges that hinder small- and medium-sized enterprises (SMEs) from using open data (OD). The research gaps identified are then used to propose a…

Abstract

Purpose

This study aims to look at challenges that hinder small- and medium-sized enterprises (SMEs) from using open data (OD). The research gaps identified are then used to propose a next generation of OD platform (ODP+).

Design/methodology/approach

This study proposes a more effective platform for SMEs called ODP+. A proof of concept was implemented by using modern techniques and technologies, with a pilot conducted among selected SMEs and government employees to test the approach’s viability.

Findings

The findings identify current OD platforms generally, and in Gulf Cooperation Council (GCC) countries, they encounter several difficulties, including that the data sets are complex to understand and determine their potential for reuse. The application of big data analytics in mitigating the identified challenges is demonstrated through the artefacts that have been developed.

Research limitations/implications

This paper discusses several challenges that must be addressed to ensure that OD is accessible, helpful and of high quality in the future when planning and implementing OD initiatives.

Practical implications

The proposed ODP+ integrates social network data, SME data sets and government databases. It will give SMEs a platform for combining data from government agencies, third parties and social networks to carry out complex analytical scenarios or build the needed application using artificial intelligence.

Social implications

The findings promote the potential future utilisation of OD and suggest ways to give users access to knowledge and features.

Originality/value

To the best of the authors’ knowledge, no study provides extensive research about OD in Qatar or GCC. Further, the proposed ODP+ is a new platform that allows SMEs to run natural language data analytics queries.

Details

Transforming Government: People, Process and Policy, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 1750-6166

Keywords

1 – 10 of over 17000