Search results
1 – 10 of 963Automatic text categorization has applications in several domains, for example e‐mail spam detection, sexual content filtering, directory maintenance, and focused crawling, among…
Abstract
Purpose
Automatic text categorization has applications in several domains, for example e‐mail spam detection, sexual content filtering, directory maintenance, and focused crawling, among others. Most information retrieval systems contain several components which use text categorization methods. One of the first text categorization methods was designed using a naïve Bayes representation of the text. Currently, a number of variations of naïve Bayes have been discussed. The purpose of this paper is to evaluate naïve Bayes approaches on text categorization introducing new competitive extensions to previous approaches.
Design/methodology/approach
The paper focuses on introducing a new Bayesian text categorization method based on an extension of the naïve Bayes approach. Some modifications to document representations are introduced based on the well‐known BM25 text information retrieval method. The performance of the method is compared to several extensions of naïve Bayes using benchmark datasets designed for this purpose. The method is compared also to training‐based methods such as support vector machines and logistic regression.
Findings
The proposed text categorizer outperforms state‐of‐the‐art methods without introducing new computational costs. It also achieves performance results very similar to more complex methods based on criterion function optimization as support vector machines or logistic regression.
Practical implications
The proposed method scales well regarding the size of the collection involved. The presented results demonstrate the efficiency and effectiveness of the approach.
Originality/value
The paper introduces a novel naïve Bayes text categorization approach based on the well‐known BM25 information retrieval model, which offers a set of good properties for this problem.
Details
Keywords
Francina Malan and Johannes Lodewyk Jooste
The purpose of this paper is to compare the effectiveness of the various text mining techniques that can be used to classify maintenance work-order records into their respective…
Abstract
Purpose
The purpose of this paper is to compare the effectiveness of the various text mining techniques that can be used to classify maintenance work-order records into their respective failure modes, focussing on the choice of algorithm and preprocessing transforms. Three algorithms are evaluated, namely Bernoulli Naïve Bayes, multinomial Naïve Bayes and support vector machines.
Design/methodology/approach
The paper has both a theoretical and experimental component. In the literature review, the various algorithms and preprocessing techniques used in text classification is considered from three perspectives: the domain-specific maintenance literature, the broader short-form literature and the general text classification literature. The experimental component consists of a 5 × 2 nested cross-validation with an inner optimisation loop performed using a randomised search procedure.
Findings
From the literature review, the aspects most affected by short document length are identified as the feature representation scheme, higher-order n-grams, document length normalisation, stemming, stop-word removal and algorithm selection. However, from the experimental analysis, the selection of preprocessing transforms seemed more dependent on the particular algorithm than on short document length. Multinomial Naïve Bayes performs marginally better than the other algorithms, but overall, the performances of the optimised models are comparable.
Originality/value
This work highlights the importance of model optimisation, including the selection of preprocessing transforms. Not only did the optimisation improve the performance of all the algorithms substantially, but it also affects model comparisons, with multinomial Naïve Bayes going from the worst to the best performing algorithm.
Details
Keywords
Arghya Ray, Pradip Kumar Bala, Nripendra P. Rana and Yogesh K. Dwivedi
The widespread acceptance of various social platforms has increased the number of users posting about various services based on their experiences about the services. Finding out…
Abstract
Purpose
The widespread acceptance of various social platforms has increased the number of users posting about various services based on their experiences about the services. Finding out the intended ratings of social media (SM) posts is important for both organizations and prospective users since these posts can help in capturing the user’s perspectives. However, unlike merchant websites, the SM posts related to the service-experience cannot be rated unless explicitly mentioned in the comments. Additionally, predicting ratings can also help to build a database using recent comments for testing recommender algorithms in various scenarios.
Design/methodology/approach
In this study, the authors have predicted the ratings of SM posts using linear (Naïve Bayes, max-entropy) and non-linear (k-nearest neighbor, k-NN) classifiers utilizing combinations of different features, sentiment scores and emotion scores.
Findings
Overall, the results of this study reveal that the non-linear classifier (k-NN classifier) performed better than the linear classifiers (Naïve Bayes, Max-entropy classifier). Results also show an improvement of performance where the classifier was combined with sentiment and emotion scores. Introduction of the feature “factors of importance” or “the latent factors” also show an improvement of the classifier performance.
Originality/value
This study provides a new avenue of predicting ratings of SM feeds by the use of machine learning algorithms along with a combination of different features like emotional aspects and latent factors.
Details
Keywords
Shuhei Yamamoto, Kei Wakabayashi, Noriko Kando and Tetsuji Satoh
Many Twitter users post tweets that are related to their particular interests. Users can also collect information by following other users. One approach clarifies user interests…
Abstract
Purpose
Many Twitter users post tweets that are related to their particular interests. Users can also collect information by following other users. One approach clarifies user interests by tagging labels based on the users. A user tagging method is important to discover candidate users with similar interests. This paper aims to propose a new user tagging method using the posting time series data of the number of tweets.
Design/methodology/approach
Our hypothesis focuses on the relationship between a user’s interests and the posting times of tweets: as users have interests, they will post more tweets at the time when events occur compared with general times. The authors assume that hashtags are labeled tags to users and observe their occurrence counts in each timestamp. The authors extract burst timestamps using Kleinberg’s burst enumeration algorithm and estimate the burst levels. The authors manage the burst levels as term frequency in documents and calculate the score using typical methods such as cosine similarity, Naïve Bayes and term frequency (TF) in a document and inversed document frequency (IDF; TF-IDF).
Findings
From the sophisticated experimental evaluations, the authors demonstrate the high efficiency of the tagging method. Naïve Bayes and cosine similarity are particular suitable for the user tagging and tag score calculation tasks, respectively. Some users, whose hashtags were appropriately estimated by our methods, experienced higher the maximum value of the number of tweets than other users.
Originality/value
Many approaches estimate user interest based on the terms in tweets and apply such graph theory as following networks. The authors propose a new estimation method that uses the time series data of the number of tweets. The merits to estimating user interest using the time series data do not depend on language and can decrease the calculation costs compared with the above-mentioned approaches because the number of features is fewer.
Details
Keywords
Femi Emmanuel Ayo, Olusegun Folorunso, Friday Thomas Ibharalu and Idowu Ademola Osinuga
Hate speech is an expression of intense hatred. Twitter has become a popular analytical tool for the prediction and monitoring of abusive behaviors. Hate speech detection with…
Abstract
Purpose
Hate speech is an expression of intense hatred. Twitter has become a popular analytical tool for the prediction and monitoring of abusive behaviors. Hate speech detection with social media data has witnessed special research attention in recent studies, hence, the need to design a generic metadata architecture and efficient feature extraction technique to enhance hate speech detection.
Design/methodology/approach
This study proposes a hybrid embeddings enhanced with a topic inference method and an improved cuckoo search neural network for hate speech detection in Twitter data. The proposed method uses a hybrid embeddings technique that includes Term Frequency-Inverse Document Frequency (TF-IDF) for word-level feature extraction and Long Short Term Memory (LSTM) which is a variant of recurrent neural networks architecture for sentence-level feature extraction. The extracted features from the hybrid embeddings then serve as input into the improved cuckoo search neural network for the prediction of a tweet as hate speech, offensive language or neither.
Findings
The proposed method showed better results when tested on the collected Twitter datasets compared to other related methods. In order to validate the performances of the proposed method, t-test and post hoc multiple comparisons were used to compare the significance and means of the proposed method with other related methods for hate speech detection. Furthermore, Paired Sample t-Test was also conducted to validate the performances of the proposed method with other related methods.
Research limitations/implications
Finally, the evaluation results showed that the proposed method outperforms other related methods with mean F1-score of 91.3.
Originality/value
The main novelty of this study is the use of an automatic topic spotting measure based on naïve Bayes model to improve features representation.
Details
Keywords
Arpita Gupta, Saloni Priyani and Ramadoss Balakrishnan
In this study, the authors have used the customer reviews of books and movies in natural language for the purpose of sentiment analysis and reputation generation on the reviews…
Abstract
Purpose
In this study, the authors have used the customer reviews of books and movies in natural language for the purpose of sentiment analysis and reputation generation on the reviews. Most of the existing work has performed sentiment analysis and reputation generation on the reviews by using single classification models and considered other attributes for reputation generation.
Design/methodology/approach
The authors have taken review, helpfulness and rating into consideration. In this paper, the authors have performed sentiment analysis for extracting the probability of the review belonging to a class, which is further used for generating the sentiment score and reputation of the review. The authors have used pre-trained BERT fine-tuned for sentiment analysis on movie and book reviews separately.
Findings
In this study, the authors have also combined the three models (BERT, Naïve Bayes and SVM) for more accurate sentiment classification and reputation generation, which has outperformed the best BERT model in this study. They have achieved the best accuracy of 91.2% for the movie review data set and 89.4% for the book review data set which is better than the existing state-of-art methods. They have used the transfer learning concept in deep learning where you take knowledge gained from one problem and apply it to a similar problem.
Originality/value
The authors have proposed a novel model based on combination of three classification models, which has outperformed the existing state-of-art methods. To the best of the authors’ knowledge, there is no existing model which combines three models for sentiment score calculation and reputation generation for the book review data set.
Details
Keywords
Dhanya M. and Sanjana S.
The purpose of this paper is to understand the customer sentiment towards telemedicine apps and also to apply machine learning algorithms to analyse the sentiments in the adoption…
Abstract
Purpose
The purpose of this paper is to understand the customer sentiment towards telemedicine apps and also to apply machine learning algorithms to analyse the sentiments in the adoption during the COVID-19 pandemic.
Design/methodology/approach
Text mining that uses natural language processing to extract insights from unstructured text is used to find out the customer sentiment towards the telemedicine apps during the COVID-19 pandemic. Machine learning algorithms like support vector machine (SVM) and Naïve Bayes classifier are used for classification, and their sensitivity and specificity are found using a confusion matrix.
Findings
The paper explores the customer sentiment towards telemedicine apps and their adoption during the COVID-19 pandemic. Text mining that uses natural language processing to extract insights from unstructured text is used to find out the customer sentiment towards the telemedicine apps during the COVID-19 pandemic. Machine learning algorithms like SVM and Naïve Bayes classifier are used for classification, and their sensitivity and specificity are found using a confusion matrix. The customers who used telemedicine apps have positive sentiment as well as negative sentiment towards the telemedicine apps. Some of the customers have concerns about the medicines delivered, their delivery time, the quality of service and other technical difficulties. Even a small percentage of doctors feel uncomfortable in online consultation through the application.
Originality/value
The primary value of this paper lies in providing an overview of the customers’ approach towards the telemedicine apps, especially during the COVID-19 pandemic.
Details
Keywords
Nur Azreen Zulkefly, Norjihan Abdul Ghani, Christie Pei-Yee Chin, Suraya Hamid and Nor Aniza Abdullah
Predicting the impact of social entrepreneurship is crucial as it can help social entrepreneurs to determine the achievement of their social mission and performance. However…
Abstract
Purpose
Predicting the impact of social entrepreneurship is crucial as it can help social entrepreneurs to determine the achievement of their social mission and performance. However, there is a lack of existing social entrepreneurship models to predict social enterprises' social impacts. This paper aims to propose the social impact prediction model for social entrepreneurs using a data analytic approach.
Design/methodology/approach
This study implemented an experimental method using three different algorithms: naive Bayes, k-nearest neighbor and J48 decision tree algorithms to develop and test the social impact prediction model.
Findings
The accurate result of the developed social impact prediction model is based on the list of identified social impact prediction variables that have been evaluated by social entrepreneurship experts. Based on the three algorithms' implementation of the model, the results showed that naive Bayes is the best performance classifier for social impact prediction accuracy.
Research limitations/implications
Although there are three categories of social entrepreneurship impact, this research only focuses on social impact. There will be a bright future of social entrepreneurship if the research can focus on all three social entrepreneurship categories. Future research in this area could look beyond these three categories of social entrepreneurship, so the prediction of social impact will be broader. The prospective researcher also can look beyond the difference and similarities of economic, social impacts and environmental impacts and study the overall perspective on those impacts.
Originality/value
This paper fulfills the need for the Malaysian social entrepreneurship blueprint to design the social impact in social entrepreneurship. There are none of the prediction models that can be used in predicting social impact in Malaysia. This study also contributes to social entrepreneur researchers, as the new social impact prediction variables found can be used in predicting social impact in social entrepreneurship in the future, which may lead to the significance of the prediction performance.
Details
Keywords
Rahila Umer, Teo Susnjak, Anuradha Mathrani and Suriadi Suriadi
The purpose of this paper is to propose a process mining approach to help in making early predictions to improve students’ learning experience in massive open online courses…
Abstract
Purpose
The purpose of this paper is to propose a process mining approach to help in making early predictions to improve students’ learning experience in massive open online courses (MOOCs). It investigates the impact of various machine learning techniques in combination with process mining features to measure effectiveness of these techniques.
Design/methodology/approach
Student’s data (e.g. assessment grades, demographic information) and weekly interaction data based on event logs (e.g. video lecture interaction, solution submission time, time spent weekly) have guided this design. This study evaluates four machine learning classification techniques used in the literature (logistic regression (LR), Naïve Bayes (NB), random forest (RF) and K-nearest neighbor) to monitor weekly progression of students’ performance and to predict their overall performance outcome. Two data sets – one, with traditional features and second, with features obtained from process conformance testing – have been used.
Findings
The results show that techniques used in the study are able to make predictions on the performance of students. Overall accuracy (F1-score, area under curve) of machine learning techniques can be improved by integrating process mining features with standard features. Specifically, the use of LR and NB classifiers outperforms other techniques in a statistical significant way.
Practical implications
Although MOOCs provide a platform for learning in highly scalable and flexible manner, they are prone to early dropout and low completion rate. This study outlines a data-driven approach to improve students’ learning experience and decrease the dropout rate.
Social implications
Early predictions based on individual’s participation can help educators provide support to students who are struggling in the course.
Originality/value
This study outlines the innovative use of process mining techniques in education data mining to help educators gather data-driven insight on student performances in the enrolled courses.
Details
Keywords
Loan default risk or credit risk evaluation is important to financial institutions which provide loans to businesses and individuals. Loans carry the risk of being defaulted. To…
Abstract
Purpose
Loan default risk or credit risk evaluation is important to financial institutions which provide loans to businesses and individuals. Loans carry the risk of being defaulted. To understand the risk levels of credit users (corporations and individuals), credit providers (bankers) normally collect vast amounts of information on borrowers. Statistical predictive analytic techniques can be used to analyse or to determine the risk levels involved in loans. This paper aims to address the question of default prediction of short-term loans for a Tunisian commercial bank.
Design/methodology/approach
The authors have used a database of 924 files of credits granted to industrial Tunisian companies by a commercial bank in the years 2003, 2004, 2005 and 2006. The naive Bayesian classifier algorithm was used, and the results show that the good classification rate is of the order of 63.85 per cent. The default probability is explained by the variables measuring working capital, leverage, solvency, profitability and cash flow indicators.
Findings
The results of the validation test show that the good classification rate is of the order of 58.66 per cent; nevertheless, the error types I and II remain relatively high at 42.42 and 40.47 per cent, respectively. A receiver operating characteristic curve is plotted to evaluate the performance of the model. The result shows that the area under the curve criterion is of the order of 69 per cent.
Originality/value
The paper highlights the fact that the Tunisian central bank obliged all commercial banks to conduct a survey study to collect qualitative data for better credit notation of the borrowers.
Propósito
El riesgo de incumplimiento de préstamos o la evaluación del riesgo de crédito es importante para las instituciones financieras que otorgan préstamos a empresas e individuos. Existe el riesgo de que el pago de préstamos no se cumpla. Para entender los niveles de riesgo de los usuarios de crédito (corporaciones e individuos), los proveedores de crédito (banqueros) normalmente recogen gran cantidad de información sobre los prestatarios. Las técnicas analíticas predictivas estadísticas pueden utilizarse para analizar o determinar los niveles de riesgo involucrados en los préstamos. En este artículo abordamos la cuestión de la predicción por defecto de los préstamos a corto plazo para un banco comercial tunecino.
Diseño/metodología/enfoque
Utilizamos una base de datos de 924 archivos de créditos concedidos a empresas industriales tunecinas por un banco comercial en 2003, 2004, 2005 y 2006. El algoritmo bayesiano de clasificadores se llevó a cabo y los resultados muestran que la tasa de clasificación buena es del orden del 63.85%. La probabilidad de incumplimiento se explica por las variables que miden el capital de trabajo, el apalancamiento, la solvencia, la rentabilidad y los indicadores de flujo de efectivo.
Hallazgos
Los resultados de la prueba de validación muestran que la buena tasa de clasificación es del orden de 58.66% ; sin embargo, los errores tipo I y II permanecen relativamente altos, siendo de 42.42% y 40.47%, respectivamente. Se traza una curva ROC para evaluar el rendimiento del modelo. El resultado muestra que el criterio de área bajo curva (AUC, por sus siglas en inglés) es del orden del 69%.
Originalidad/valor
El documento destaca el hecho de que el Banco Central tunecino obligó a todas las entidades del sector llevar a cabo un estudio de encuesta para recopilar datos cualitativos para un mejor registro de crédito de los prestatarios.
Palabras clave
Curva ROC, Evaluación de riesgos, Riesgo de incumplimiento, Sector bancario, Algoritmo clasificador bayesiano.
Tipo de artículo
Artículo de investigación
Details