Search results

1 – 10 of over 11000
Article
Publication date: 6 February 2017

Aytug Onan

The immense quantity of available unstructured text documents serve as one of the largest source of information. Text classification can be an essential task for many purposes in…

Abstract

Purpose

The immense quantity of available unstructured text documents serve as one of the largest source of information. Text classification can be an essential task for many purposes in information retrieval, such as document organization, text filtering and sentiment analysis. Ensemble learning has been extensively studied to construct efficient text classification schemes with higher predictive performance and generalization ability. The purpose of this paper is to provide diversity among the classification algorithms of ensemble, which is a key issue in the ensemble design.

Design/methodology/approach

An ensemble scheme based on hybrid supervised clustering is presented for text classification. In the presented scheme, supervised hybrid clustering, which is based on cuckoo search algorithm and k-means, is introduced to partition the data samples of each class into clusters so that training subsets with higher diversities can be provided. Each classifier is trained on the diversified training subsets and the predictions of individual classifiers are combined by the majority voting rule. The predictive performance of the proposed classifier ensemble is compared to conventional classification algorithms (such as Naïve Bayes, logistic regression, support vector machines and C4.5 algorithm) and ensemble learning methods (such as AdaBoost, bagging and random subspace) using 11 text benchmarks.

Findings

The experimental results indicate that the presented classifier ensemble outperforms the conventional classification algorithms and ensemble learning methods for text classification.

Originality/value

The presented ensemble scheme is the first to use supervised clustering to obtain diverse ensemble for text classification

Details

Kybernetes, vol. 46 no. 2
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 6 March 2007

Sven Sandow and Xuelong Zhou

Investors often rely on probabilistic models that were learned from small historical labeled datasets. The purpose of this article is to propose a new method for data‐efficient…

Abstract

Purpose

Investors often rely on probabilistic models that were learned from small historical labeled datasets. The purpose of this article is to propose a new method for data‐efficient model learning.

Design/methodology/approach

The proposed method, which is an extension of the standard minimum relative entropy (MRE) approach and has a clear financial interpretation, belongs to the class of semi‐supervised algorithms, which can learn from data that are only partially labeled with values of the variable of interest.

Findings

This study tests the method on an artificial dataset and uses it to learn a model for recovery of defaulted debt. In both cases, the resulting models perform better than the standard MRE model, when the number of labeled data is small.

Originality/value

The method can be applied to financial problems where labeled data are sparse but unlabeled data are readily available.

Details

The Journal of Risk Finance, vol. 8 no. 2
Type: Research Article
ISSN: 1526-5943

Keywords

Article
Publication date: 3 January 2018

Lei La, Shuyan Cao and Liangjuan Qin

As a foundational issue of social mining, sentiment classification suffered from a lack of unlabeled data. To enhance accuracy of classification with few labeled data, many…

Abstract

Purpose

As a foundational issue of social mining, sentiment classification suffered from a lack of unlabeled data. To enhance accuracy of classification with few labeled data, many semi-supervised algorithms had been proposed. These algorithms improved the classification performance when the labeled data are insufficient. However, precision and efficiency are difficult to be ensured at the same time in many semi-supervised methods. This paper aims to present a novel method for using unlabeled data in a more accurate and more efficient way.

Design/methodology/approach

First, the authors designed a boosting-based method for unlabeled data selection. The improved boosting-based method can choose unlabeled data which have the same distribution with the labeled data. The authors then proposed a novel strategy which can combine weak classifiers into strong classifiers that are more rational. Finally, a semi-supervised sentiment classification algorithm is given.

Findings

Experimental results demonstrate that the novel algorithm can achieve really high accuracy with low time consumption. It is helpful for achieving high-performance social network-related applications.

Research limitations/implications

The novel method needs a small labeled data set for semi-supervised learning. Maybe someday the authors can improve it to an unsupervised method.

Practical implications

The mentioned method can be used in text mining, image classification, audio processing and so on, and also in an unstructured data mining-related field. Overcome the problem of insufficient labeled data and achieve high precision using fewer computational time.

Social implications

Sentiment mining has wide applications in public opinion management, public security, market analysis, social network and related fields. Sentiment classification is the basis of sentiment mining.

Originality/value

According to what the authors have been informed, it is the first time transfer learning be introduced to AdaBoost for semi-supervised learning. Moreover, the improved AdaBoost uses a totally new mechanism for weighting.

Details

Kybernetes, vol. 47 no. 3
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 13 March 2017

Samira Khodabandehlou and Mahmoud Zivari Rahman

This paper aims to provide a predictive framework of customer churn through six stages for accurate prediction and preventing customer churn in the field of business.

4473

Abstract

Purpose

This paper aims to provide a predictive framework of customer churn through six stages for accurate prediction and preventing customer churn in the field of business.

Design/methodology/approach

The six stages are as follows: first, collection of customer behavioral data and preparation of the data; second, the formation of derived variables and selection of influential variables, using a method of discriminant analysis; third, selection of training and testing data and reviewing their proportion; fourth, the development of prediction models using simple, bagging and boosting versions of supervised machine learning; fifth, comparison of churn prediction models based on different versions of machine-learning methods and selected variables; and sixth, providing appropriate strategies based on the proposed model.

Findings

According to the results, five variables, the number of items, reception of returned items, the discount, the distribution time and the prize beside the recency, frequency and monetary (RFM) variables (RFMITSDP), were chosen as the best predictor variables. The proposed model with accuracy of 97.92 per cent, in comparison to RFM, had much better performance in churn prediction and among the supervised machine learning methods, artificial neural network (ANN) had the highest accuracy, and decision trees (DT) was the least accurate one. The results show the substantially superiority of boosting versions in prediction compared with simple and bagging models.

Research limitations/implications

The period of the available data was limited to two years. The research data were limited to only one grocery store whereby it may not be applicable to other industries; therefore, generalizing the results to other business centers should be used with caution.

Practical implications

Business owners must try to enforce a clear rule to provide a prize for a certain number of purchased items. Of course, the prize can be something other than the purchased item. Business owners must accept the items returned by the customers for any reasons, and the conditions for accepting returned items and the deadline for accepting the returned items must be clearly communicated to the customers. Store owners must consider a discount for a certain amount of purchase from the store. They have to use an exponential rule to increase the discount when the amount of purchase is increased to encourage customers for more purchase. The managers of large stores must try to quickly deliver the ordered items, and they should use equipped and new transporting vehicles and skilled and friendly workforce for delivering the items. It is recommended that the types of services, the rules for prizes, the discount, the rules for accepting the returned items and the method of distributing the items must be prepared and shown in the store for all the customers to see. The special services and reward rules of the store must be communicated to the customers using new media such as social networks. To predict the customer behaviors based on the data, the future researchers should use the boosting method because it increases efficiency and accuracy of prediction. It is recommended that for predicting the customer behaviors, particularly their churning status, the ANN method be used. To extract and select the important and effective variables influencing customer behaviors, the discriminant analysis method can be used which is a very accurate and powerful method for predicting the classes of the customers.

Originality/value

The current study tries to fill this gap by considering five basic and important variables besides RFM in stores, i.e. prize, discount, accepting returns, delay in distribution and the number of items, so that the business owners can understand the role services such as prizes, discount, distribution and accepting returns play in retraining the customers and preventing them from churning. Another innovation of the current study is the comparison of machine-learning methods with their boosting and bagging versions, especially considering the fact that previous studies do not consider the bagging method. The other reason for the study is the conflicting results regarding the superiority of machine-learning methods in a more accurate prediction of customer behaviors, including churning. For example, some studies introduce ANN (Huang et al., 2010; Hung and Wang, 2004; Keramati et al., 2014; Runge et al., 2014), some introduce support vector machine ( Guo-en and Wei-dong, 2008; Vafeiadis et al., 2015; Yu et al., 2011) and some introduce DT (Freund and Schapire, 1996; Qureshi et al., 2013; Umayaparvathi and Iyakutti, 2012) as the best predictor, confusing the users of the results of these studies regarding the best prediction method. The current study identifies the best prediction method specifically in the field of store businesses for researchers and the owners. Moreover, another innovation of the current study is using discriminant analysis for selecting and filtering variables which are important and effective in predicting churners and non-churners, which is not used in previous studies. Therefore, the current study is unique considering the used variables, the method of comparing their accuracy and the method of selecting effective variables.

Details

Journal of Systems and Information Technology, vol. 19 no. 1/2
Type: Research Article
ISSN: 1328-7265

Keywords

Article
Publication date: 30 April 2020

Nasim Eslamirad, Soheil Malekpour Kolbadinejad, Mohammadjavad Mahdavinejad and Mohammad Mehranrad

This research aims to introduce a new methodology for integration between urban design strategies and supervised machine learning (SML) method – by applying both energy…

Abstract

Purpose

This research aims to introduce a new methodology for integration between urban design strategies and supervised machine learning (SML) method – by applying both energy engineering modeling (evaluating phase) for the existing green sidewalks and statistical energy modeling (predicting phase) for the new ones – to offer algorithms that help to catch the optimum morphology of green sidewalks, in case of high quality of the outdoor thermal comfort and less errors in results.

Design/methodology/approach

The tools of the study are the way of processing by SML, predicting the future based on the past. Machine learning is benefited from Python advantages. The structure of the study consisted of two main parts, as the majority of the similar studies follow: engineering energy modeling and statistical energy modeling. According to the concept of the study, at first, from 2268 models, some are randomly selected, simulated and sensitively analyzed by ENVI-met. Furthermore, the Envi-met output as the quantity of thermal comfort – predicted mean vote (PMV) and weather items are inputs of Python. Then, the formed data set is processed by SML, to reach the final reliable predicted output.

Findings

The process of SML leads the study to find thermal comfort of current models and other similar sidewalks. The results are evaluated by both PMV mathematical model and SML error evaluation functions. The results confirm that the average of the occurred error is about 1%. Then the method of study is reliable to apply in the variety of similar fields. Finding of this study can be helpful in perspective of the sustainable architecture strategies in the buildings and urban scales, to determine, monitor and control energy-based behaviors (thermal comfort, heating, cooling, lighting and ventilation) in operational phase of the systems (existed elements in buildings, and constructions) and the planning and designing phase of the future built cases – all over their life spans.

Research limitations/implications

Limitations of the study are related to the study variables and alternatives that are notable impact on the findings. Furthermore, the most trustable input data will result in the more accuracy in output. Then modeling and simulation processes are most significant part of the research to reach the exact results in the final step.

Practical implications

Finding of the study can be helpful in urban design strategies. By finding outdoor thermal comfort that resulted from machine learning method, urban and landscape designers, policymakers and architects are able to estimate the features of their designs in air quality and urban health and can be sure in catching design goals in case of thermal comfort in urban atmosphere.

Social implications

By 2030, cities are delved as living spaces for about three out of five people. As green infrastructures influence in moderating the cities’ climate, the relationship between green spaces and habitants’ thermal comfort is deduced. Although the strategies to outside thermal comfort improvement, by design methods and applicants, are not new subject to discuss, applying machines that may be common in predicting results can be called as a new insight in applying more effective design strategies and in urban environment’s comfort preparation. Then study’s footprint in social implications stems in learning from the previous projects and developing more efficient strategies to prepare cities as the more comfortable and healthy places to live, with the more efficient models and consuming money and time.

Originality/value

The study achievements are expected to be applied not only in Tehran but also in other climate zones as the pattern in more eco-city design strategies. Although some similar studies are done in different majors, the concept of study is new vision in urban studies.

Details

Smart and Sustainable Built Environment, vol. 9 no. 4
Type: Research Article
ISSN: 2046-6099

Keywords

Article
Publication date: 26 July 2021

Pengcheng Li, Qikai Liu, Qikai Cheng and Wei Lu

This paper aims to identify data set entities in scientific literature. To address poor recognition caused by a lack of training corpora in existing studies, a distant supervised…

Abstract

Purpose

This paper aims to identify data set entities in scientific literature. To address poor recognition caused by a lack of training corpora in existing studies, a distant supervised learning-based approach is proposed to identify data set entities automatically from large-scale scientific literature in an open domain.

Design/methodology/approach

Firstly, the authors use a dictionary combined with a bootstrapping strategy to create a labelled corpus to apply supervised learning. Secondly, a bidirectional encoder representation from transformers (BERT)-based neural model was applied to identify data set entities in the scientific literature automatically. Finally, two data augmentation techniques, entity replacement and entity masking, were introduced to enhance the model generalisability and improve the recognition of data set entities.

Findings

In the absence of training data, the proposed method can effectively identify data set entities in large-scale scientific papers. The BERT-based vectorised representation and data augmentation techniques enable significant improvements in the generality and robustness of named entity recognition models, especially in long-tailed data set entity recognition.

Originality/value

This paper provides a practical research method for automatically recognising data set entities in scientific literature. To the best of the authors’ knowledge, this is the first attempt to apply distant learning to the study of data set entity recognition. The authors introduce a robust vectorised representation and two data augmentation strategies (entity replacement and entity masking) to address the problem inherent in distant supervised learning methods, which the existing research has mostly ignored. The experimental results demonstrate that our approach effectively improves the recognition of data set entities, especially long-tailed data set entities.

Article
Publication date: 4 June 2021

Miao Tian, Ying Cui, Haixia Long and Junxia Li

In novelty detection, the autoencoder based image reconstruction strategy is one of the mainstream solutions. The basic idea is that once the autoencoder is trained on normal…

Abstract

Purpose

In novelty detection, the autoencoder based image reconstruction strategy is one of the mainstream solutions. The basic idea is that once the autoencoder is trained on normal data, it has a low reconstruction error on normal data. However, when faced with complex natural images, the conventional pixel-level reconstruction becomes poor and does not show the promising results. This paper aims to provide a new method for improving the performance of novelty detection based autoencoder.

Design/methodology/approach

To solve the problem that conventional pixel-level reconstruction cannot effectively extract the global semantic information of the image, a novel model with the combination of attention mechanism and self-supervised learning method is proposed. First, an auxiliary task, reconstruct rotated image, is set to enable the network to learn global semantic feature information. Then, the channel attention mechanism is introduced to perform adaptive feature refinement on the intermediate feature map to optimize the correspondingly passed feature map.

Findings

Experimental results on three public data sets show that the proposed method has potential performance for novelty detection.

Originality/value

This study explores the ability of self-supervised learning methods and attention mechanism to extract features on a single class of images. In this way, the performance of novelty detection can be improved.

Details

Industrial Robot: the international journal of robotics research and application, vol. 48 no. 5
Type: Research Article
ISSN: 0143-991X

Keywords

Article
Publication date: 16 August 2023

Fanshu Zhao, Jin Cui, Mei Yuan and Juanru Zhao

The purpose of this paper is to present a weakly supervised learning method to perform health evaluation and predict the remaining useful life (RUL) of rolling bearings.

Abstract

Purpose

The purpose of this paper is to present a weakly supervised learning method to perform health evaluation and predict the remaining useful life (RUL) of rolling bearings.

Design/methodology/approach

Based on the principle that bearing health degrades with the increase of service time, a weak label qualitative pairing comparison dataset for bearing health is extracted from the original time series monitoring data of bearing. A bearing health indicator (HI) quantitative evaluation model is obtained by training the delicately designed neural network structure with bearing qualitative comparison data between different health statuses. The remaining useful life is then predicted using the bearing health evaluation model and the degradation tolerance threshold. To validate the feasibility, efficiency and superiority of the proposed method, comparison experiments are designed and carried out on a widely used bearing dataset.

Findings

The method achieves the transformation of bearing health from qualitative comparison to quantitative evaluation via a learning algorithm, which is promising in industrial equipment health evaluation and prediction.

Originality/value

The method achieves the transformation of bearing health from qualitative comparison to quantitative evaluation via a learning algorithm, which is promising in industrial equipment health evaluation and prediction.

Details

Engineering Computations, vol. 40 no. 7/8
Type: Research Article
ISSN: 0264-4401

Keywords

Article
Publication date: 9 October 2019

Francisco Villarroel Ordenes and Shunyuan Zhang

The purpose of this paper is to describe and position the state-of-the-art of text and image mining methods in business research. By providing a detailed conceptual and technical…

3521

Abstract

Purpose

The purpose of this paper is to describe and position the state-of-the-art of text and image mining methods in business research. By providing a detailed conceptual and technical review of both methods, it aims to increase their utilization in service research.

Design/methodology/approach

On a first stage, the authors review business literature in marketing, operations and management concerning the use of text and image mining methods. On a second stage, the authors identify and analyze empirical papers that used text and image mining methods in services journals and premier business. Finally, avenues for further research in services are provided.

Findings

The manuscript identifies seven text mining methods and describes their approaches, processes, techniques and algorithms, involved in their implementation. Four of these methods are positioned similarly for image mining. There are 39 papers using text mining in service research, with a focus on measuring consumer sentiment, experiences, and service quality. Due to the nonexistent use of image mining service journals, the authors review their application in marketing and management, and suggest ideas for further research in services.

Research limitations/implications

This manuscript focuses on the different methods and their implementation in service research, but it does not offer a complete review of business literature using text and image mining methods.

Practical implications

The results have a number of implications for the discipline that are presented and discussed. The authors provide research directions using text and image mining methods in service priority areas such as artificial intelligence, frontline employees, transformative consumer research and customer experience.

Originality/value

The manuscript provides an introduction to text and image mining methods to service researchers and practitioners interested in the analysis of unstructured data. This paper provides several suggestions concerning the use of new sources of data (e.g. customer reviews, social media images, employee reviews and emails), measurement of new constructs (beyond sentiment and valence) and the use of more recent methods (e.g. deep learning).

Details

Journal of Service Management, vol. 30 no. 5
Type: Research Article
ISSN: 1757-5818

Keywords

Article
Publication date: 29 November 2021

Ziming Zeng, Tingting Li, Shouqiang Sun, Jingjing Sun and Jie Yin

Twitter fake accounts refer to bot accounts created by third-party organizations to influence public opinion, commercial propaganda or impersonate others. The effective…

Abstract

Purpose

Twitter fake accounts refer to bot accounts created by third-party organizations to influence public opinion, commercial propaganda or impersonate others. The effective identification of bot accounts is conducive to accurately judge the disseminated information for the public. However, in actual fake account identification, it is expensive and inefficient to manually label Twitter accounts, and the labeled data are usually unbalanced in classes. To this end, the authors propose a novel framework to solve these problems.

Design/methodology/approach

In the proposed framework, the authors introduce the concept of semi-supervised self-training learning and apply it to the real Twitter account data set from Kaggle. Specifically, the authors first train the classifier in the initial small amount of labeled account data, then use the trained classifier to automatically label large-scale unlabeled account data. Next, iteratively select high confidence instances from unlabeled data to expand the labeled data. Finally, an expanded Twitter account training set is obtained. It is worth mentioning that the resampling technique is integrated into the self-training process, and the data class is balanced at the initial stage of the self-training iteration.

Findings

The proposed framework effectively improves labeling efficiency and reduces the influence of class imbalance. It shows excellent identification results on 6 different base classifiers, especially for the initial small-scale labeled Twitter accounts.

Originality/value

This paper provides novel insights in identifying Twitter fake accounts. First, the authors take the lead in introducing a self-training method to automatically label Twitter accounts from the semi-supervised background. Second, the resampling technique is integrated into the self-training process to effectively reduce the influence of class imbalance on the identification effect.

Details

Data Technologies and Applications, vol. 56 no. 3
Type: Research Article
ISSN: 2514-9288

Keywords

1 – 10 of over 11000