Search results

1 – 10 of 902
Article
Publication date: 30 March 2012

Marcelo Mendoza

Automatic text categorization has applications in several domains, for example e‐mail spam detection, sexual content filtering, directory maintenance, and focused crawling, among…

Abstract

Purpose

Automatic text categorization has applications in several domains, for example e‐mail spam detection, sexual content filtering, directory maintenance, and focused crawling, among others. Most information retrieval systems contain several components which use text categorization methods. One of the first text categorization methods was designed using a naïve Bayes representation of the text. Currently, a number of variations of naïve Bayes have been discussed. The purpose of this paper is to evaluate naïve Bayes approaches on text categorization introducing new competitive extensions to previous approaches.

Design/methodology/approach

The paper focuses on introducing a new Bayesian text categorization method based on an extension of the naïve Bayes approach. Some modifications to document representations are introduced based on the well‐known BM25 text information retrieval method. The performance of the method is compared to several extensions of naïve Bayes using benchmark datasets designed for this purpose. The method is compared also to training‐based methods such as support vector machines and logistic regression.

Findings

The proposed text categorizer outperforms state‐of‐the‐art methods without introducing new computational costs. It also achieves performance results very similar to more complex methods based on criterion function optimization as support vector machines or logistic regression.

Practical implications

The proposed method scales well regarding the size of the collection involved. The presented results demonstrate the efficiency and effectiveness of the approach.

Originality/value

The paper introduces a novel naïve Bayes text categorization approach based on the well‐known BM25 information retrieval model, which offers a set of good properties for this problem.

Details

International Journal of Web Information Systems, vol. 8 no. 1
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 6 February 2023

Francina Malan and Johannes Lodewyk Jooste

The purpose of this paper is to compare the effectiveness of the various text mining techniques that can be used to classify maintenance work-order records into their respective…

Abstract

Purpose

The purpose of this paper is to compare the effectiveness of the various text mining techniques that can be used to classify maintenance work-order records into their respective failure modes, focussing on the choice of algorithm and preprocessing transforms. Three algorithms are evaluated, namely Bernoulli Naïve Bayes, multinomial Naïve Bayes and support vector machines.

Design/methodology/approach

The paper has both a theoretical and experimental component. In the literature review, the various algorithms and preprocessing techniques used in text classification is considered from three perspectives: the domain-specific maintenance literature, the broader short-form literature and the general text classification literature. The experimental component consists of a 5 × 2 nested cross-validation with an inner optimisation loop performed using a randomised search procedure.

Findings

From the literature review, the aspects most affected by short document length are identified as the feature representation scheme, higher-order n-grams, document length normalisation, stemming, stop-word removal and algorithm selection. However, from the experimental analysis, the selection of preprocessing transforms seemed more dependent on the particular algorithm than on short document length. Multinomial Naïve Bayes performs marginally better than the other algorithms, but overall, the performances of the optimised models are comparable.

Originality/value

This work highlights the importance of model optimisation, including the selection of preprocessing transforms. Not only did the optimisation improve the performance of all the algorithms substantially, but it also affects model comparisons, with multinomial Naïve Bayes going from the worst to the best performing algorithm.

Details

Journal of Quality in Maintenance Engineering, vol. 29 no. 3
Type: Research Article
ISSN: 1355-2511

Keywords

Article
Publication date: 10 May 2022

Arghya Ray, Pradip Kumar Bala, Nripendra P. Rana and Yogesh K. Dwivedi

The widespread acceptance of various social platforms has increased the number of users posting about various services based on their experiences about the services. Finding out…

Abstract

Purpose

The widespread acceptance of various social platforms has increased the number of users posting about various services based on their experiences about the services. Finding out the intended ratings of social media (SM) posts is important for both organizations and prospective users since these posts can help in capturing the user’s perspectives. However, unlike merchant websites, the SM posts related to the service-experience cannot be rated unless explicitly mentioned in the comments. Additionally, predicting ratings can also help to build a database using recent comments for testing recommender algorithms in various scenarios.

Design/methodology/approach

In this study, the authors have predicted the ratings of SM posts using linear (Naïve Bayes, max-entropy) and non-linear (k-nearest neighbor, k-NN) classifiers utilizing combinations of different features, sentiment scores and emotion scores.

Findings

Overall, the results of this study reveal that the non-linear classifier (k-NN classifier) performed better than the linear classifiers (Naïve Bayes, Max-entropy classifier). Results also show an improvement of performance where the classifier was combined with sentiment and emotion scores. Introduction of the feature “factors of importance” or “the latent factors” also show an improvement of the classifier performance.

Originality/value

This study provides a new avenue of predicting ratings of SM feeds by the use of machine learning algorithms along with a combination of different features like emotional aspects and latent factors.

Details

Aslib Journal of Information Management, vol. 74 no. 6
Type: Research Article
ISSN: 2050-3806

Keywords

Article
Publication date: 15 August 2016

Shuhei Yamamoto, Kei Wakabayashi, Noriko Kando and Tetsuji Satoh

Many Twitter users post tweets that are related to their particular interests. Users can also collect information by following other users. One approach clarifies user interests…

Abstract

Purpose

Many Twitter users post tweets that are related to their particular interests. Users can also collect information by following other users. One approach clarifies user interests by tagging labels based on the users. A user tagging method is important to discover candidate users with similar interests. This paper aims to propose a new user tagging method using the posting time series data of the number of tweets.

Design/methodology/approach

Our hypothesis focuses on the relationship between a user’s interests and the posting times of tweets: as users have interests, they will post more tweets at the time when events occur compared with general times. The authors assume that hashtags are labeled tags to users and observe their occurrence counts in each timestamp. The authors extract burst timestamps using Kleinberg’s burst enumeration algorithm and estimate the burst levels. The authors manage the burst levels as term frequency in documents and calculate the score using typical methods such as cosine similarity, Naïve Bayes and term frequency (TF) in a document and inversed document frequency (IDF; TF-IDF).

Findings

From the sophisticated experimental evaluations, the authors demonstrate the high efficiency of the tagging method. Naïve Bayes and cosine similarity are particular suitable for the user tagging and tag score calculation tasks, respectively. Some users, whose hashtags were appropriately estimated by our methods, experienced higher the maximum value of the number of tweets than other users.

Originality/value

Many approaches estimate user interest based on the terms in tweets and apply such graph theory as following networks. The authors propose a new estimation method that uses the time series data of the number of tweets. The merits to estimating user interest using the time series data do not depend on language and can decrease the calculation costs compared with the above-mentioned approaches because the number of features is fewer.

Details

International Journal of Web Information Systems, vol. 12 no. 3
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 3 November 2020

Femi Emmanuel Ayo, Olusegun Folorunso, Friday Thomas Ibharalu and Idowu Ademola Osinuga

Hate speech is an expression of intense hatred. Twitter has become a popular analytical tool for the prediction and monitoring of abusive behaviors. Hate speech detection with…

Abstract

Purpose

Hate speech is an expression of intense hatred. Twitter has become a popular analytical tool for the prediction and monitoring of abusive behaviors. Hate speech detection with social media data has witnessed special research attention in recent studies, hence, the need to design a generic metadata architecture and efficient feature extraction technique to enhance hate speech detection.

Design/methodology/approach

This study proposes a hybrid embeddings enhanced with a topic inference method and an improved cuckoo search neural network for hate speech detection in Twitter data. The proposed method uses a hybrid embeddings technique that includes Term Frequency-Inverse Document Frequency (TF-IDF) for word-level feature extraction and Long Short Term Memory (LSTM) which is a variant of recurrent neural networks architecture for sentence-level feature extraction. The extracted features from the hybrid embeddings then serve as input into the improved cuckoo search neural network for the prediction of a tweet as hate speech, offensive language or neither.

Findings

The proposed method showed better results when tested on the collected Twitter datasets compared to other related methods. In order to validate the performances of the proposed method, t-test and post hoc multiple comparisons were used to compare the significance and means of the proposed method with other related methods for hate speech detection. Furthermore, Paired Sample t-Test was also conducted to validate the performances of the proposed method with other related methods.

Research limitations/implications

Finally, the evaluation results showed that the proposed method outperforms other related methods with mean F1-score of 91.3.

Originality/value

The main novelty of this study is the use of an automatic topic spotting measure based on naïve Bayes model to improve features representation.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 13 no. 4
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 28 December 2020

Arpita Gupta, Saloni Priyani and Ramadoss Balakrishnan

In this study, the authors have used the customer reviews of books and movies in natural language for the purpose of sentiment analysis and reputation generation on the reviews…

Abstract

Purpose

In this study, the authors have used the customer reviews of books and movies in natural language for the purpose of sentiment analysis and reputation generation on the reviews. Most of the existing work has performed sentiment analysis and reputation generation on the reviews by using single classification models and considered other attributes for reputation generation.

Design/methodology/approach

The authors have taken review, helpfulness and rating into consideration. In this paper, the authors have performed sentiment analysis for extracting the probability of the review belonging to a class, which is further used for generating the sentiment score and reputation of the review. The authors have used pre-trained BERT fine-tuned for sentiment analysis on movie and book reviews separately.

Findings

In this study, the authors have also combined the three models (BERT, Naïve Bayes and SVM) for more accurate sentiment classification and reputation generation, which has outperformed the best BERT model in this study. They have achieved the best accuracy of 91.2% for the movie review data set and 89.4% for the book review data set which is better than the existing state-of-art methods. They have used the transfer learning concept in deep learning where you take knowledge gained from one problem and apply it to a similar problem.

Originality/value

The authors have proposed a novel model based on combination of three classification models, which has outperformed the existing state-of-art methods. To the best of the authors’ knowledge, there is no existing model which combines three models for sentiment score calculation and reputation generation for the book review data set.

Details

World Journal of Engineering, vol. 18 no. 4
Type: Research Article
ISSN: 1708-5284

Keywords

Article
Publication date: 30 November 2022

Dhanya M. and Sanjana S.

The purpose of this paper is to understand the customer sentiment towards telemedicine apps and also to apply machine learning algorithms to analyse the sentiments in the adoption…

Abstract

Purpose

The purpose of this paper is to understand the customer sentiment towards telemedicine apps and also to apply machine learning algorithms to analyse the sentiments in the adoption during the COVID-19 pandemic.

Design/methodology/approach

Text mining that uses natural language processing to extract insights from unstructured text is used to find out the customer sentiment towards the telemedicine apps during the COVID-19 pandemic. Machine learning algorithms like support vector machine (SVM) and Naïve Bayes classifier are used for classification, and their sensitivity and specificity are found using a confusion matrix.

Findings

The paper explores the customer sentiment towards telemedicine apps and their adoption during the COVID-19 pandemic. Text mining that uses natural language processing to extract insights from unstructured text is used to find out the customer sentiment towards the telemedicine apps during the COVID-19 pandemic. Machine learning algorithms like SVM and Naïve Bayes classifier are used for classification, and their sensitivity and specificity are found using a confusion matrix. The customers who used telemedicine apps have positive sentiment as well as negative sentiment towards the telemedicine apps. Some of the customers have concerns about the medicines delivered, their delivery time, the quality of service and other technical difficulties. Even a small percentage of doctors feel uncomfortable in online consultation through the application.

Originality/value

The primary value of this paper lies in providing an overview of the customers’ approach towards the telemedicine apps, especially during the COVID-19 pandemic.

Details

Journal of Science and Technology Policy Management, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2053-4620

Keywords

Article
Publication date: 16 August 2021

Nur Azreen Zulkefly, Norjihan Abdul Ghani, Christie Pei-Yee Chin, Suraya Hamid and Nor Aniza Abdullah

Predicting the impact of social entrepreneurship is crucial as it can help social entrepreneurs to determine the achievement of their social mission and performance. However…

1064

Abstract

Purpose

Predicting the impact of social entrepreneurship is crucial as it can help social entrepreneurs to determine the achievement of their social mission and performance. However, there is a lack of existing social entrepreneurship models to predict social enterprises' social impacts. This paper aims to propose the social impact prediction model for social entrepreneurs using a data analytic approach.

Design/methodology/approach

This study implemented an experimental method using three different algorithms: naive Bayes, k-nearest neighbor and J48 decision tree algorithms to develop and test the social impact prediction model.

Findings

The accurate result of the developed social impact prediction model is based on the list of identified social impact prediction variables that have been evaluated by social entrepreneurship experts. Based on the three algorithms' implementation of the model, the results showed that naive Bayes is the best performance classifier for social impact prediction accuracy.

Research limitations/implications

Although there are three categories of social entrepreneurship impact, this research only focuses on social impact. There will be a bright future of social entrepreneurship if the research can focus on all three social entrepreneurship categories. Future research in this area could look beyond these three categories of social entrepreneurship, so the prediction of social impact will be broader. The prospective researcher also can look beyond the difference and similarities of economic, social impacts and environmental impacts and study the overall perspective on those impacts.

Originality/value

This paper fulfills the need for the Malaysian social entrepreneurship blueprint to design the social impact in social entrepreneurship. There are none of the prediction models that can be used in predicting social impact in Malaysia. This study also contributes to social entrepreneur researchers, as the new social impact prediction variables found can be used in predicting social impact in social entrepreneurship in the future, which may lead to the significance of the prediction performance.

Details

Internet Research, vol. 32 no. 2
Type: Research Article
ISSN: 1066-2243

Keywords

Open Access
Article
Publication date: 3 July 2017

Rahila Umer, Teo Susnjak, Anuradha Mathrani and Suriadi Suriadi

The purpose of this paper is to propose a process mining approach to help in making early predictions to improve students’ learning experience in massive open online courses…

6230

Abstract

Purpose

The purpose of this paper is to propose a process mining approach to help in making early predictions to improve students’ learning experience in massive open online courses (MOOCs). It investigates the impact of various machine learning techniques in combination with process mining features to measure effectiveness of these techniques.

Design/methodology/approach

Student’s data (e.g. assessment grades, demographic information) and weekly interaction data based on event logs (e.g. video lecture interaction, solution submission time, time spent weekly) have guided this design. This study evaluates four machine learning classification techniques used in the literature (logistic regression (LR), Naïve Bayes (NB), random forest (RF) and K-nearest neighbor) to monitor weekly progression of students’ performance and to predict their overall performance outcome. Two data sets – one, with traditional features and second, with features obtained from process conformance testing – have been used.

Findings

The results show that techniques used in the study are able to make predictions on the performance of students. Overall accuracy (F1-score, area under curve) of machine learning techniques can be improved by integrating process mining features with standard features. Specifically, the use of LR and NB classifiers outperforms other techniques in a statistical significant way.

Practical implications

Although MOOCs provide a platform for learning in highly scalable and flexible manner, they are prone to early dropout and low completion rate. This study outlines a data-driven approach to improve students’ learning experience and decrease the dropout rate.

Social implications

Early predictions based on individual’s participation can help educators provide support to students who are struggling in the course.

Originality/value

This study outlines the innovative use of process mining techniques in education data mining to help educators gather data-driven insight on student performances in the enrolled courses.

Details

Journal of Research in Innovative Teaching & Learning, vol. 10 no. 2
Type: Research Article
ISSN: 2397-7604

Keywords

Open Access
Article
Publication date: 12 June 2017

Aida Krichene

Loan default risk or credit risk evaluation is important to financial institutions which provide loans to businesses and individuals. Loans carry the risk of being defaulted. To…

6730

Abstract

Purpose

Loan default risk or credit risk evaluation is important to financial institutions which provide loans to businesses and individuals. Loans carry the risk of being defaulted. To understand the risk levels of credit users (corporations and individuals), credit providers (bankers) normally collect vast amounts of information on borrowers. Statistical predictive analytic techniques can be used to analyse or to determine the risk levels involved in loans. This paper aims to address the question of default prediction of short-term loans for a Tunisian commercial bank.

Design/methodology/approach

The authors have used a database of 924 files of credits granted to industrial Tunisian companies by a commercial bank in the years 2003, 2004, 2005 and 2006. The naive Bayesian classifier algorithm was used, and the results show that the good classification rate is of the order of 63.85 per cent. The default probability is explained by the variables measuring working capital, leverage, solvency, profitability and cash flow indicators.

Findings

The results of the validation test show that the good classification rate is of the order of 58.66 per cent; nevertheless, the error types I and II remain relatively high at 42.42 and 40.47 per cent, respectively. A receiver operating characteristic curve is plotted to evaluate the performance of the model. The result shows that the area under the curve criterion is of the order of 69 per cent.

Originality/value

The paper highlights the fact that the Tunisian central bank obliged all commercial banks to conduct a survey study to collect qualitative data for better credit notation of the borrowers.

Propósito

El riesgo de incumplimiento de préstamos o la evaluación del riesgo de crédito es importante para las instituciones financieras que otorgan préstamos a empresas e individuos. Existe el riesgo de que el pago de préstamos no se cumpla. Para entender los niveles de riesgo de los usuarios de crédito (corporaciones e individuos), los proveedores de crédito (banqueros) normalmente recogen gran cantidad de información sobre los prestatarios. Las técnicas analíticas predictivas estadísticas pueden utilizarse para analizar o determinar los niveles de riesgo involucrados en los préstamos. En este artículo abordamos la cuestión de la predicción por defecto de los préstamos a corto plazo para un banco comercial tunecino.

Diseño/metodología/enfoque

Utilizamos una base de datos de 924 archivos de créditos concedidos a empresas industriales tunecinas por un banco comercial en 2003, 2004, 2005 y 2006. El algoritmo bayesiano de clasificadores se llevó a cabo y los resultados muestran que la tasa de clasificación buena es del orden del 63.85%. La probabilidad de incumplimiento se explica por las variables que miden el capital de trabajo, el apalancamiento, la solvencia, la rentabilidad y los indicadores de flujo de efectivo.

Hallazgos

Los resultados de la prueba de validación muestran que la buena tasa de clasificación es del orden de 58.66% ; sin embargo, los errores tipo I y II permanecen relativamente altos, siendo de 42.42% y 40.47%, respectivamente. Se traza una curva ROC para evaluar el rendimiento del modelo. El resultado muestra que el criterio de área bajo curva (AUC, por sus siglas en inglés) es del orden del 69%.

Originalidad/valor

El documento destaca el hecho de que el Banco Central tunecino obligó a todas las entidades del sector llevar a cabo un estudio de encuesta para recopilar datos cualitativos para un mejor registro de crédito de los prestatarios.

Palabras clave

Curva ROC, Evaluación de riesgos, Riesgo de incumplimiento, Sector bancario, Algoritmo clasificador bayesiano.

Tipo de artículo

Artículo de investigación

Details

Journal of Economics, Finance and Administrative Science, vol. 22 no. 42
Type: Research Article
ISSN: 2077-1886

Keywords

1 – 10 of 902