Search results

1 – 10 of over 39000
Article
Publication date: 4 September 2009

M.R. Davarpanah, M. Sanji and M. Aramideh

The purpose of this article is to present an aggregated methodology for construction of the stop word list in Farsi language and generate a generic Farsi stop word list.

Abstract

Purpose

The purpose of this article is to present an aggregated methodology for construction of the stop word list in Farsi language and generate a generic Farsi stop word list.

Design/methodology/approach

The stop word list is extracted based on: syntactic classes, domain dependent, corpus statistic and expert judgments. Some of the main challenges that arise in the Farsi automatic text processing are outlined as well.

Findings

Results from the techniques are aggregated and a general Farsi stop word list containing 927 words is generated.

Practical implications

The created stop word list can affect the efficiency and effectiveness of retrieval and indexing process in Farsi information retrieval system, moreover, it can play an important role during Farsi text segmentation.

Originality/value

Our stop word extraction algorithm is a promising technique; it could be applied into other languages that they have ambiguities in automatic text segmentation.

Details

Library Hi Tech, vol. 27 no. 3
Type: Research Article
ISSN: 0737-8831

Keywords

Article
Publication date: 1 December 2001

D.T. Tomov

A semantic analysis of the “Weekly Subject Index Stop Word List” of Current Contents of the Institute for Scientific Information (ISI) as well as of the full‐stop word and semi…

312

Abstract

A semantic analysis of the “Weekly Subject Index Stop Word List” of Current Contents of the Institute for Scientific Information (ISI) as well as of the full‐stop word and semi‐stop word lists of the Permuterm Subject Index of Science Citation Index was carried out. Selected terms from the first issues for 1997, 1999 and 2000 of the CCODAb/Life Sciences, of the first issues for 1997 and 2000 of CCOD Proceedings, as well as from the SCI CDE for 1997 and January‐June of 2000 were screened. True full‐stop and semi‐stop words commonly occur in the dictionaries of these databases which proves that there is an abundance of meaningless terms in titles and abstracts. On the other hand, many synonyms and antonyms are absent in these lists. Proper list enlarging could contribute to more effective preparation of both printed reference publications and large databases thus ensuring a more economic information retrieval by practical users and scientometricians. The necessity of an improved, semantically oriented policy in preparing the lists of fullstop words and semi‐stop words used in modern databases worldwide is emphasised. Journal editors should encourage authors to reduce stopword usage in article titles and keyword sets.

Details

Journal of Documentation, vol. 57 no. 6
Type: Research Article
ISSN: 0022-0418

Keywords

Book part
Publication date: 18 January 2023

Shane W. Reid, Aaron F. McKenny and Jeremy C. Short

A growing body of research outlines how to best facilitate and ensure methodological rigor when using dictionary-based computerized text analyses (DBCTA) in organizational…

Abstract

A growing body of research outlines how to best facilitate and ensure methodological rigor when using dictionary-based computerized text analyses (DBCTA) in organizational research. However, these best practices are currently scattered across several methodological and empirical manuscripts, making it difficult for scholars new to the technique to implement DBCTA in their own research. To better equip researchers looking to leverage this technique, this methodological report consolidates current best practices for applying DBCTA into a single, practical guide. In doing so, we provide direction regarding how to make key design decisions and identify valuable resources to help researchers from the beginning of the research process through final publication. Consequently, we advance DBCTA methods research by providing a one-stop reference for novices and experts alike concerning current best practices and available resources.

Abstract

Details

Automated Information Retrieval: Theory and Methods
Type: Book
ISBN: 978-0-12266-170-9

Article
Publication date: 20 June 2018

Ramzi A. Haraty and Rouba Nasrallah

The purpose of this paper is to propose a new model to enhance auto-indexing Arabic texts. The model denotes extracting new relevant words by relating those chosen by previous…

2157

Abstract

Purpose

The purpose of this paper is to propose a new model to enhance auto-indexing Arabic texts. The model denotes extracting new relevant words by relating those chosen by previous classical methods to new words using data mining rules.

Design/methodology/approach

The proposed model uses an association rule algorithm for extracting frequent sets containing related items – to extract relationships between words in the texts to be indexed with words from texts that belong to the same category. The associations of words extracted are illustrated as sets of words that appear frequently together.

Findings

The proposed methodology shows significant enhancement in terms of accuracy, efficiency and reliability when compared to previous works.

Research limitations/implications

The stemming algorithm can be further enhanced. In the Arabic language, we have many grammatical rules. The more we integrate rules to the stemming algorithm, the better the stemming will be. Other enhancements can be done to the stop-list. This is by adding more words to it that should not be taken into consideration in the indexing mechanism. Also, numbers should be added to the list as well as using the thesaurus system because it links different phrases or words with the same meaning to each other, which improves the indexing mechanism. The authors also invite researchers to add more pre-requisite texts to have better results.

Originality/value

In this paper, the authors present a full text-based auto-indexing method for Arabic text documents. The auto-indexing method extracts new relevant words by using data mining rules, which has not been investigated before. The method uses an association rule mining algorithm for extracting frequent sets containing related items to extract relationships between words in the texts to be indexed with words from texts that belong to the same category. The benefits of the method are demonstrated using empirical work involving several Arabic texts.

Details

Library Hi Tech, vol. 37 no. 1
Type: Research Article
ISSN: 0737-8831

Keywords

Article
Publication date: 7 August 2017

Wen Zeng, Changqing Yao and Hui Li

Science and technology policy plays an important role in promoting the development of economic and social development in China. At present, the research on science and technology…

588

Abstract

Purpose

Science and technology policy plays an important role in promoting the development of economic and social development in China. At present, the research on science and technology policy is mainly focused on the basic theories and some quantitative research. The analyses for contents of massive science and technology policies are relatively less. This paper makes use of semantic technologies to extract and analyze the relatively important information from massive science and technology policies. The purpose of this paper is to facilitate users to quickly and effectively obtain valuable information from the massive science and technology policies. The key methods and study results are presented in the paper. The study results can provide references for further study and application in China.

Design/methodology/approach

The paper presented the analysis model and method for science and technology policy in China. The terms and sentences are the important information in the science and technology policy. The study adopted the technology of natural language processing to analyze the linguistics characteristics of terms and combined with statistical analyses to extract the terms from Chinese science and technology policy. Then, the authors designed an algorithm, calculated and analyzed the important sentences in Chinese science and technology policies. The experiments were run on the Java test platform.

Findings

This paper put forward the analysis model and method for science and technology policy in China. The study obtained the following conclusions: term extraction of science and technology policy: the paper analyzed characteristic of terms in Chinese science and technology policy and designed a method of extracting a term that was suitable for the science and technology policy. The calculation of important sentences for science and technology policy: the paper designed an algorithm and calculated the importance of the sentences to obtain valuable information from the massive science and technology policies.

Research limitations/implications

In our methods, there are some defects to be improved or solved in the future. For example, the precision of algorithm needs to be improved. The significance of this paper is to propose and use the analysis model to process Chinese science and technology policy; we can provide an auxiliary tool to help policy beneficiaries. Enterprises and individuals can be more effective to extraction and mining information from massive science and technology policy and find the target policy.

Practical implications

To verify the effectiveness of the method, the paper selected the real policies about the new energy vehicles as experimental data; at the same time, the paper added uncorrelated policies. It used the proposed analysis model of science and technology policy to calculate and find out the relatively important sentences. The results of study showed that the proposed method can obtain better performance. It verified the validity of this method. The model and method have been applied to actual retrieval system.

Social implications

The proposed model and method in the paper have been applied to actual retrieval system for users.

Originality/value

The paper proposed the new analysis model and method to analyze science and technology policies in China. The presented model and method are a new attempt. According to the experimental results, this exploration and study are valuable. In addition, the idea and method will give a good start for improving information services of massive science and technology policies in China.

Details

The Electronic Library, vol. 35 no. 4
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 18 March 2021

Pandiaraj A., Sundar C. and Pavalarajan S.

Up to date development in sentiment analysis has resulted in a symbolic growth in the volume of study, especially on more subjective text types, namely, product or movie reviews…

Abstract

Purpose

Up to date development in sentiment analysis has resulted in a symbolic growth in the volume of study, especially on more subjective text types, namely, product or movie reviews. The key difference between these texts with news articles is that their target is defined and unique across the text. Hence, the reviews on newspaper articles can deal with three subtasks: correctly spotting the target, splitting the good and bad content from the reviews on the concerned target and evaluating different opinions provided in a detailed manner. On defining these tasks, this paper aims to implement a new sentiment analysis model for article reviews from the newspaper.

Design/methodology/approach

Here, tweets from various newspaper articles are taken and the sentiment analysis process is done with pre-processing, semantic word extraction, feature extraction and classification. Initially, the pre-processing phase is performed, in which different steps such as stop word removal, stemming, blank space removal are carried out and it results in producing the keywords that speak about positive, negative or neutral. Further, semantic words (similar) are extracted from the available dictionary by matching the keywords. Next, the feature extraction is done for the extracted keywords and semantic words using holoentropy to attain information statistics, which results in the attainment of maximum related information. Here, two categories of holoentropy features are extracted: joint holoentropy and cross holoentropy. These extracted features of entire keywords are finally subjected to a hybrid classifier, which merges the beneficial concepts of neural network (NN), and deep belief network (DBN). For improving the performance of sentiment classification, modification is done by inducing the idea of a modified rider optimization algorithm (ROA), so-called new steering updated ROA (NSU-ROA) into NN and DBN for weight update. Hence, the average of both improved classifiers will provide the classified sentiment as positive, negative or neutral from the reviews of newspaper articles effectively.

Findings

Three data sets were considered for experimentation. The results have shown that the developed NSU-ROA + DBN + NN attained high accuracy, which was 2.6% superior to particle swarm optimization, 3% superior to FireFly, 3.8% superior to grey wolf optimization, 5.5% superior to whale optimization algorithm and 3.2% superior to ROA-based DBN + NN from data set 1. The classification analysis has shown that the accuracy of the proposed NSU − DBN + NN was 3.4% enhanced than DBN + NN, 25% enhanced than DBN and 28.5% enhanced than NN and 32.3% enhanced than support vector machine from data set 2. Thus, the effective performance of the proposed NSU − ROA + DBN + NN on sentiment analysis of newspaper articles has been proved.

Originality/value

This paper adopts the latest optimization algorithm called the NSU-ROA to effectively recognize the sentiments of the newspapers with NN and DBN. This is the first work that uses NSU-ROA-based optimization for accurate identification of sentiments from newspaper articles.

Details

Kybernetes, vol. 51 no. 1
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 16 September 2021

Sireesha Jasti

Internet has endorsed a tremendous change with the advancement of the new technologies. The change has made the users of the internet to make comments regarding the service or…

Abstract

Purpose

Internet has endorsed a tremendous change with the advancement of the new technologies. The change has made the users of the internet to make comments regarding the service or product. The Sentiment classification is the process of analyzing the reviews for helping the user to decide whether to purchase the product or not.

Design/methodology/approach

A rider feedback artificial tree optimization-enabled deep recurrent neural networks (RFATO-enabled deep RNN) is developed for the effective classification of sentiments into various grades. The proposed RFATO algorithm is modeled by integrating the feedback artificial tree (FAT) algorithm in the rider optimization algorithm (ROA), which is used for training the deep RNN classifier for the classification of sentiments in the review data. The pre-processing is performed by the stemming and the stop word removal process for removing the redundancy for smoother processing of the data. The features including the sentiwordnet-based features, a variant of term frequency-inverse document frequency (TF-IDF) features and spam words-based features are extracted from the review data to form the feature vector. Feature fusion is performed based on the entropy of the features that are extracted. The metrics employed for the evaluation in the proposed RFATO algorithm are accuracy, sensitivity, and specificity.

Findings

By using the proposed RFATO algorithm, the evaluation metrics such as accuracy, sensitivity and specificity are maximized when compared to the existing algorithms.

Originality/value

The proposed RFATO algorithm is modeled by integrating the FAT algorithm in the ROA, which is used for training the deep RNN classifier for the classification of sentiments in the review data. The pre-processing is performed by the stemming and the stop word removal process for removing the redundancy for smoother processing of the data. The features including the sentiwordnet-based features, a variant of TF-IDF features and spam words-based features are extracted from the review data to form the feature vector. Feature fusion is performed based on the entropy of the features that are extracted.

Details

International Journal of Web Information Systems, vol. 17 no. 6
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 20 November 2017

Xiangbin Yan, Yumei Li and Weiguo Fan

Getting high-quality data by removing the noisy data from the user-generated content (UGC) is the first step toward data mining and effective decision-making based on ubiquitous…

Abstract

Purpose

Getting high-quality data by removing the noisy data from the user-generated content (UGC) is the first step toward data mining and effective decision-making based on ubiquitous and unstructured social media data. This paper aims to design a framework for revoking noisy data from UGC.

Design/methodology/approach

In this paper, the authors consider a classification-based framework to remove the noise from the unstructured UGC in social media community. They treat the noise as the concerned topic non-relevant messages and apply a text classification-based approach to remove the noise. They introduce a domain lexicon to help identify the concerned topic from noise and compare the performance of several classification algorithms combined with different feature selection methods.

Findings

Experimental results based on a Chinese stock forum show that 84.9 per cent of all the noise data from the UGC could be removed with little valuable information loss. The support vector machines classifier combined with information gain feature extraction model is the best choice for this system. With longer messages getting better classification performance, it has been found that the length of messages affects the system performance.

Originality/value

The proposed method could be used for preprocessing in text mining and new knowledge discovery from the big data.

Details

Information Discovery and Delivery, vol. 45 no. 4
Type: Research Article
ISSN: 2398-6247

Keywords

Article
Publication date: 16 August 2021

Nael Alqtati, Jonathan A.J. Wilson and Varuna De Silva

This paper aims to equip professionals and researchers in the fields of advertising, branding, public relations, marketing communications, social media analytics and marketing…

Abstract

Purpose

This paper aims to equip professionals and researchers in the fields of advertising, branding, public relations, marketing communications, social media analytics and marketing with a simple, effective and dynamic means of evaluating consumer behavioural sentiments and engagement through Arabic language and script, in vivo.

Design/methodology/approach

Using quantitative and qualitative situational linguistic analyses of Classical Arabic, found in Quranic and religious texts scripts; Modern Standard Arabic, which is commonly used in formal Arabic channels; and dialectical Arabic, which varies hugely from one Arabic country to another: this study analyses rich marketing and consumer messages (tweets) – as a basis for developing an Arabic language social media methodological tool.

Findings

Despite the popularity of Arabic language communication on social media platforms across geographies, currently, comprehensive language processing toolkits for analysing Arabic social media conversations have limitations and require further development. Furthermore, due to its unique morphology, developing text understanding capabilities specific to the Arabic language poses challenges.

Practical implications

This study demonstrates the application and effectiveness of the proposed methodology on a random sample of Twitter data from Arabic-speaking regions. Furthermore, as Arabic is the language of Islam, the study is of particular importance to Islamic and Muslim geographies, markets and marketing.

Social implications

The findings suggest that the proposed methodology has a wider potential beyond the data set and health-care sector analysed, and therefore, can be applied to further markets, social media platforms and consumer segments.

Originality/value

To remedy these gaps, this study presents a new methodology and analytical approach to investigating Arabic language social media conversations, which brings together a multidisciplinary knowledge of technology, data science and marketing communications.

1 – 10 of over 39000