Search results

1 – 10 of 401
Article
Publication date: 1 March 1998

Alexander M. Robertson and Peter Willett

This paper provides an introduction to the use of n‐grams in textual information systems, where an n‐gram is a string of n, usually adjacent, characters extracted from a section…

Abstract

This paper provides an introduction to the use of n‐grams in textual information systems, where an n‐gram is a string of n, usually adjacent, characters extracted from a section of continuous text. Applications that can be implemented efficiently and effectively using sets of n‐grams include spelling error detection and correction, query expansion, information retrieval with serial, inverted and signature files, dictionary look‐up, text compression, and language identification.

Details

Journal of Documentation, vol. 54 no. 1
Type: Research Article
ISSN: 0022-0418

Keywords

Abstract

Details

Power Laws in the Information Production Process: Lotkaian Informetrics
Type: Book
ISBN: 978-0-12088-753-8

Article
Publication date: 17 September 2021

Deborah Yvonne Nagel, Stephan Fuhrmann and Thomas W. Guenther

The usefulness of risk disclosures (RDs) to support equity investors’ investment decisions is highly discussed. As prior research criticizes the extensive aggregation of risk…

Abstract

Purpose

The usefulness of risk disclosures (RDs) to support equity investors’ investment decisions is highly discussed. As prior research criticizes the extensive aggregation of risk information in existing empirical research, this paper aims to provide an attempt to identify disaggregated risk information associated with cumulative abnormal stock returns (CARs).

Design/methodology/approach

The sample consists of 2,558 RDs of companies listed in the S&P 500 index. The RDs were filed within 10 K filings between 2011 and 2017. First, this study automatically extracted 35,685 key phrases that occurred in a maximum of 1.5% of the RDs. Second, this study performed stepwise regressions of these key phrases and identified 67 (78) key phrases that show positive (negative) associations with CARs.

Findings

The paper finds that investors seem to value most the more common key phrases just below the 1.5% rarest key phrase threshold and business-related key phrases from RDs. Furthermore, investors seem to perceive key phrases that contain words indicating uncertainty (impacts) as a negative (positive) rather than a positive (negative) signal.

Research limitations/implications

The research approach faces limitations mainly due to the selection of the included key phrases, the focus on CARs and the methodological choice of the stepwise regression analysis.

Originality/value

The study reveals the potential for companies to increase the information value of their RDs for equity investors by providing tailored information within RDs instead of universal phrases. In addition, the research indicates that the tailored RDs encouraged by the SEC contain relevant information for investors. Furthermore, the results may guide the attention of equity investors to relevant text passages whose deeper analysis might be useful with regard to investors’ capital market decisions.

Article
Publication date: 22 August 2008

Majed Sanan, Mahmoud Rammal and Khaldoun Zreik

Recently, classification of Arabic documents is a real problem for juridical centers. In this case, some of the Lebanese official journal documents are classified, and the center…

Abstract

Purpose

Recently, classification of Arabic documents is a real problem for juridical centers. In this case, some of the Lebanese official journal documents are classified, and the center has to classify new documents based on these documents. This paper aims to study and explain the useful application of supervised learning method on Arabic texts using N‐gram as an indexing method (n  =  3).

Design/methodology/approach

The Lebanese official journal documents are categorized into several classes. Supposing that we know the class(es) of some documents (called learning texts), this can help to determine the candidate words of each class by segmenting the documents.

Findings

Results showed that N‐gram text classification using the cosine coefficient measure outperforms classification using Dice's measure and TF*ICF weight. Then it is the best between the three measures but it still insufficient. N‐gram method is good, but still insufficient for the classification of Arabic documents, and then it is necessary to look at the future of a new approach like distributional or symbolic approach in order to increase the effectiveness.

Originality/value

The results could be used to improve Arabic document classification (using software also). This work has evaluated a number of similarity measures for the classification of Arabic documents, using the Lebanese parliament documents and especially the Lebanese official journal documents Arabic corpus as the test bed.

Details

Interactive Technology and Smart Education, vol. 5 no. 3
Type: Research Article
ISSN: 1741-5659

Keywords

Article
Publication date: 29 July 2021

Aarathi S. and Vasundra S.

Pervasive analytics act as a prominent role in computer-aided prediction of non-communicating diseases. In the early stage, arrhythmia diagnosis detection helps prevent the cause…

Abstract

Purpose

Pervasive analytics act as a prominent role in computer-aided prediction of non-communicating diseases. In the early stage, arrhythmia diagnosis detection helps prevent the cause of death suddenly owing to heart failure or heart stroke. The arrhythmia scope can be identified by electrocardiogram (ECG) report.

Design/methodology/approach

The ECG report has been used extensively by several clinical experts. However, diagnosis accuracy has been dependent on clinical experience. For the prediction methods of computer-aided heart disease, both accuracy and sensitivity metrics play a remarkable part. Hence, the existing research contributions have optimized the machine-learning approaches to have a great significance in computer-aided methods, which perform predictive analysis of arrhythmia detection.

Findings

In reference to this, this paper determined a regression heuristics by tridimensional optimum features of ECG reports to perform pervasive analytics for computer-aided arrhythmia prediction. The intent of these reports is arrhythmia detection. From an empirical outcome, it has been envisioned that the project model of this contribution is more optimal and added a more advantage when compared to existing or contemporary approaches.

Originality/value

In reference to this, this paper determined a regression heuristics by tridimensional optimum features of ECG reports to perform pervasive analytics for computer-aided arrhythmia prediction. The intent of these reports is arrhythmia detection. From an empirical outcome, it has been envisioned that the project model of this contribution is more optimal and added a more advantage when compared to existing or contemporary approaches.

Details

International Journal of Pervasive Computing and Communications, vol. 20 no. 1
Type: Research Article
ISSN: 1742-7371

Keywords

Article
Publication date: 1 May 2007

Fuchun Peng and Xiangji Huang

The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word…

Abstract

Purpose

The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task.

Design/methodology/approach

Naïve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation‐based approach was compared with the non‐segmentation‐based approach.

Findings

There were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy.

Practical implications

Apply the findings to real web text classification is ongoing work.

Originality/value

The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.

Details

Journal of Documentation, vol. 63 no. 3
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 13 September 2019

Collins Udanor and Chinatu C. Anyanwu

Hate speech in recent times has become a troubling development. It has different meanings to different people in different cultures. The anonymity and ubiquity of the social media…

2143

Abstract

Purpose

Hate speech in recent times has become a troubling development. It has different meanings to different people in different cultures. The anonymity and ubiquity of the social media provides a breeding ground for hate speech and makes combating it seems like a lost battle. However, what may constitute a hate speech in a cultural or religious neutral society may not be perceived as such in a polarized multi-cultural and multi-religious society like Nigeria. Defining hate speech, therefore, may be contextual. Hate speech in Nigeria may be perceived along ethnic, religious and political boundaries. The purpose of this paper is to check for the presence of hate speech in social media platforms like Twitter, and to what degree is hate speech permissible, if available? It also intends to find out what monitoring mechanisms the social media platforms like Facebook and Twitter have put in place to combat hate speech. Lexalytics is a term coined by the authors from the words lexical analytics for the purpose of opinion mining unstructured texts like tweets.

Design/methodology/approach

This research developed a Python software called polarized opinions sentiment analyzer (POSA), adopting an ego social network analytics technique in which an individual’s behavior is mined and described. POSA uses a customized Python N-Gram dictionary of local context-based terms that may be considered as hate terms. It then applied the Twitter API to stream tweets from popular and trending Nigerian Twitter handles in politics, ethnicity, religion, social activism, racism, etc., and filtered the tweets against the custom dictionary using unsupervised classification of the texts as either positive or negative sentiments. The outcome is visualized using tables, pie charts and word clouds. A similar implementation was also carried out using R-Studio codes and both results are compared and a t-test was applied to determine if there was a significant difference in the results. The research methodology can be classified as both qualitative and quantitative. Qualitative in terms of data classification, and quantitative in terms of being able to identify the results as either negative or positive from the computation of text to vector.

Findings

The findings from two sets of experiments on POSA and R are as follows: in the first experiment, the POSA software found that the Twitter handles analyzed contained between 33 and 55 percent hate contents, while the R results show hate contents ranging from 38 to 62 percent. Performing a t-test on both positive and negative scores for both POSA and R-studio, results reveal p-values of 0.389 and 0.289, respectively, on an α value of 0.05, implying that there is no significant difference in the results from POSA and R. During the second experiment performed on 11 local handles with 1,207 tweets, the authors deduce as follows: that the percentage of hate contents classified by POSA is 40 percent, while the percentage of hate contents classified by R is 51 percent. That the accuracy of hate speech classification predicted by POSA is 87 percent, while free speech is 86 percent. And the accuracy of hate speech classification predicted by R is 65 percent, while free speech is 74 percent. This study reveals that neither Twitter nor Facebook has an automated monitoring system for hate speech, and no benchmark is set to decide the level of hate contents allowed in a text. The monitoring is rather done by humans whose assessment is usually subjective and sometimes inconsistent.

Research limitations/implications

This study establishes the fact that hate speech is on the increase on social media. It also shows that hate mongers can actually be pinned down, with the contents of their messages. The POSA system can be used as a plug-in by Twitter to detect and stop hate speech on its platform. The study was limited to public Twitter handles only. N-grams are effective features for word-sense disambiguation, but when using N-grams, the feature vector could take on enormous proportions and in turn increasing sparsity of the feature vectors.

Practical implications

The findings of this study show that if urgent measures are not taken to combat hate speech there could be dare consequences, especially in highly polarized societies that are always heated up along religious and ethnic sentiments. On daily basis tempers are flaring in the social media over comments made by participants. This study has also demonstrated that it is possible to implement a technology that can track and terminate hate speech in a micro-blog like Twitter. This can also be extended to other social media platforms.

Social implications

This study will help to promote a more positive society, ensuring the social media is positively utilized to the benefit of mankind.

Originality/value

The findings can be used by social media companies to monitor user behaviors, and pin hate crimes to specific persons. Governments and law enforcement bodies can also use the POSA application to track down hate peddlers.

Details

Data Technologies and Applications, vol. 53 no. 4
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 20 November 2009

Maria Soledad Pera and Yiu‐Kai Ng

The web provides its users with abundant information. Unfortunately, when a web search is performed, both users and search engines must deal with an annoying problem: the presence…

Abstract

Purpose

The web provides its users with abundant information. Unfortunately, when a web search is performed, both users and search engines must deal with an annoying problem: the presence of spam documents that are ranked among legitimate ones. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. To improve the quality of web searches, the number of spam documents on the web must be reduced, if they cannot be eradicated entirely. This paper aims to present a novel approach for identifying spam web documents, which have mismatched titles and bodies and/or low percentage of hidden content in markup data structure.

Design/methodology/approach

The paper shows that by considering the degree of similarity among the words in the title and body of a web docuemnt D, which is computed by using their word‐correlation factors; using the percentage of hidden context in the markup data structure within D; and/or considering the bigram or trigram phase‐similarity values of D, it is possible to determine whether D is spam with high accuracy

Findings

By considering the content and markup of web documents, this paper develops a spam‐detection tool that is: reliable, since we can accurately detect 84.5 percent of spam/legitimate web documents; and computational inexpensive, since the word‐correlation factors used for content analysis are pre‐computed.

Research limitations/implications

Since the bigram‐correlation values employed in the spam‐detection approach are computed by using the unigram‐correlation factors, it imposes additional computational time during the spam‐detection process and could generate higher number of misclassified spam web documents.

Originality/value

The paper verifies that the spam‐detection approach outperforms existing anti‐spam methods by at least 3 percent in terms of F‐measure.

Details

International Journal of Web Information Systems, vol. 5 no. 4
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 21 October 2023

Alex Rudniy, Olena Rudna and Arim Park

This paper seeks to demonstrate the value of using social media to capture fashion trends, including the popularity of specific features of clothing, in order to improve the speed…

Abstract

Purpose

This paper seeks to demonstrate the value of using social media to capture fashion trends, including the popularity of specific features of clothing, in order to improve the speed and accuracy of supply chain response in the era of fast fashion.

Design/methodology/approach

This study examines the role that text mining can play to improve trend recognition in the fashion industry. Researchers used n-gram analysis to design a social media trend detection tool referred to here as the Twitter Trend Tool (3Ts). This tool was applied to a Twitter dataset to identify trends whose validity was then checked against Google Trends.

Findings

The results suggest that Twitter data are trend representative and can be used to identify the apparel features that are most in demand in near real time.

Originality/value

The 3Ts introduced in this research contributes to the field of fashion analytics by offering a novel method for employing big data from social media to identify consumer preferences in fashion elements and analyzes consumer preferences to improve demand planning.

Practical implications

The 3Ts improves forecasting models and helps inform marketing campaigns in the apparel retail industry, especially in fast fashion.

Details

Journal of Fashion Marketing and Management: An International Journal, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 1361-2026

Keywords

Article
Publication date: 12 April 2022

Mengjuan Zha, Changping Hu and Yu Shi

Sentiment lexicon is an essential resource for sentiment analysis of user reviews. By far, there is still a lack of domain sentiment lexicon with large scale and high accuracy for…

Abstract

Purpose

Sentiment lexicon is an essential resource for sentiment analysis of user reviews. By far, there is still a lack of domain sentiment lexicon with large scale and high accuracy for Chinese book reviews. This paper aims to construct a large-scale sentiment lexicon based on the ultrashort reviews of Chinese books.

Design/methodology/approach

First, large-scale ultrashort reviews of Chinese books, whose length is no more than six Chinese characters, are collected and preprocessed as candidate sentiment words. Second, non-sentiment words are filtered out through certain rules, such as part of speech rules, context rules, feature word rules and user behaviour rules. Third, the relative frequency is used to select and judge the polarity of sentiment words. Finally, the performance of the sentiment lexicon is evaluated through experiments.

Findings

This paper proposes a method of sentiment lexicon construction based on ultrashort reviews and successfully builds one for Chinese books with nearly 40,000 words based on the Douban book.

Originality/value

Compared with the idea of constructing a sentiment lexicon based on a small number of reviews, the proposed method can give full play to the advantages of data scale to build a corpus. Moreover, different from the computer segmentation method, this method helps to avoid the problems caused by immature segmentation technology and an imperfect N-gram language model.

Details

The Electronic Library , vol. 40 no. 3
Type: Research Article
ISSN: 0264-0473

Keywords

1 – 10 of 401