Search results
1 – 10 of 417The immense quantity of available unstructured text documents serve as one of the largest source of information. Text classification can be an essential task for many purposes in…
Abstract
Purpose
The immense quantity of available unstructured text documents serve as one of the largest source of information. Text classification can be an essential task for many purposes in information retrieval, such as document organization, text filtering and sentiment analysis. Ensemble learning has been extensively studied to construct efficient text classification schemes with higher predictive performance and generalization ability. The purpose of this paper is to provide diversity among the classification algorithms of ensemble, which is a key issue in the ensemble design.
Design/methodology/approach
An ensemble scheme based on hybrid supervised clustering is presented for text classification. In the presented scheme, supervised hybrid clustering, which is based on cuckoo search algorithm and k-means, is introduced to partition the data samples of each class into clusters so that training subsets with higher diversities can be provided. Each classifier is trained on the diversified training subsets and the predictions of individual classifiers are combined by the majority voting rule. The predictive performance of the proposed classifier ensemble is compared to conventional classification algorithms (such as Naïve Bayes, logistic regression, support vector machines and C4.5 algorithm) and ensemble learning methods (such as AdaBoost, bagging and random subspace) using 11 text benchmarks.
Findings
The experimental results indicate that the presented classifier ensemble outperforms the conventional classification algorithms and ensemble learning methods for text classification.
Originality/value
The presented ensemble scheme is the first to use supervised clustering to obtain diverse ensemble for text classification
Details
Keywords
Jianhua Zhang, Liangchen Li, Fredrick Ahenkora Boamah, Dandan Wen, Jiake Li and Dandan Guo
Traditional case-adaptation methods have poor accuracy, low efficiency and limited applicability, which cannot meet the needs of knowledge users. To address the shortcomings of…
Abstract
Purpose
Traditional case-adaptation methods have poor accuracy, low efficiency and limited applicability, which cannot meet the needs of knowledge users. To address the shortcomings of the existing research in the industry, this paper proposes a case-adaptation optimization algorithm to support the effective application of tacit knowledge resources.
Design/methodology/approach
The attribute simplification algorithm based on the forward search strategy in the neighborhood decision information system is implemented to realize the vertical dimensionality reduction of the case base, and the fuzzy C-mean (FCM) clustering algorithm based on the simulated annealing genetic algorithm (SAGA) is implemented to compress the case base horizontally with multiple decision classes. Then, the subspace K-nearest neighbors (KNN) algorithm is used to induce the decision rules for the set of adapted cases to complete the optimization of the adaptation model.
Findings
The findings suggest the rapid enrichment of data, information and tacit knowledge in the field of practice has led to low efficiency and low utilization of knowledge dissemination, and this algorithm can effectively alleviate the problems of users falling into “knowledge disorientation” in the era of the knowledge economy.
Practical implications
This study provides a model with case knowledge that meets users’ needs, thereby effectively improving the application of the tacit knowledge in the explicit case base and the problem-solving efficiency of knowledge users.
Social implications
The adaptation model can serve as a stable and efficient prediction model to make predictions for the effects of the many logistics and e-commerce enterprises' plans.
Originality/value
This study designs a multi-decision class case-adaptation optimization study based on forward attribute selection strategy-neighborhood rough sets (FASS-NRS) and simulated annealing genetic algorithm-fuzzy C-means (SAGA-FCM) for tacit knowledgeable exogenous cases. By effectively organizing and adjusting tacit knowledge resources, knowledge service organizations can maintain their competitive advantages. The algorithm models established in this study develop theoretical directions for a multi-decision class case-adaptation optimization study of tacit knowledge.
Details
Keywords
Andreas Pick and Matthijs Carpay
This chapter investigates the performance of different dimension reduction approaches for large vector autoregressions in multi-step ahead forecasts. The authors consider factor…
Abstract
This chapter investigates the performance of different dimension reduction approaches for large vector autoregressions in multi-step ahead forecasts. The authors consider factor augmented VAR models using principal components and partial least squares, random subset regression, random projection, random compression, and estimation via LASSO and Bayesian VAR. The authors compare the accuracy of iterated and direct multi-step point and density forecasts. The comparison is based on macroeconomic and financial variables from the FRED-MD data base. Our findings suggest that random subspace methods and LASSO estimation deliver the most precise forecasts.
Details
Keywords
Laouni Djafri, Djamel Amar Bensaber and Reda Adjoudj
This paper aims to solve the problems of big data analytics for prediction including volume, veracity and velocity by improving the prediction result to an acceptable level and in…
Abstract
Purpose
This paper aims to solve the problems of big data analytics for prediction including volume, veracity and velocity by improving the prediction result to an acceptable level and in the shortest possible time.
Design/methodology/approach
This paper is divided into two parts. The first one is to improve the result of the prediction. In this part, two ideas are proposed: the double pruning enhanced random forest algorithm and extracting a shared learning base from the stratified random sampling method to obtain a representative learning base of all original data. The second part proposes to design a distributed architecture supported by new technologies solutions, which in turn works in a coherent and efficient way with the sampling strategy under the supervision of the Map-Reduce algorithm.
Findings
The representative learning base obtained by the integration of two learning bases, the partial base and the shared base, presents an excellent representation of the original data set and gives very good results of the Big Data predictive analytics. Furthermore, these results were supported by the improved random forests supervised learning method, which played a key role in this context.
Originality/value
All companies are concerned, especially those with large amounts of information and want to screen them to improve their knowledge for the customer and optimize their campaigns.
Details
Keywords
Showmitra Kumar Sarkar, Swapan Talukdar, Atiqur Rahman, Shahfahad and Sujit Kumar Roy
The present study aims to construct ensemble machine learning (EML) algorithms for groundwater potentiality mapping (GPM) in the Teesta River basin of Bangladesh, including random…
Abstract
Purpose
The present study aims to construct ensemble machine learning (EML) algorithms for groundwater potentiality mapping (GPM) in the Teesta River basin of Bangladesh, including random forest (RF) and random subspace (RSS).
Design/methodology/approach
The RF and RSS models have been implemented for integrating 14 selected groundwater condition parametres with groundwater inventories for generating GPMs. The GPM were then validated using the empirical and bionormal receiver operating characteristics (ROC) curve.
Findings
The very high (831–1200 km2) and high groundwater potential areas (521–680 km2) were predicted using EML algorithms. The RSS (AUC-0.892) model outperformed RF model based on ROC's area under curve (AUC).
Originality/value
Two new EML models have been constructed for GPM. These findings will aid in proposing sustainable water resource management plans.
Details
Keywords
The purpose of this paper is to introduce a new hybrid method for reducing dimensionality of high dimensional data.
Abstract
Purpose
The purpose of this paper is to introduce a new hybrid method for reducing dimensionality of high dimensional data.
Design/methodology/approach
Literature on dimensionality reduction (DR) witnesses the research efforts that combine random projections (RP) and singular value decomposition (SVD) so as to derive the benefit of both of these methods. However, SVD is well known for its computational complexity. Clustering under the notion of concept decomposition is proved to be less computationally complex than SVD and useful for DR. The method proposed in this paper combines RP and fuzzy k‐means clustering (FKM) for reducing dimensionality of the data.
Findings
The proposed RP‐FKM is computationally less complex than SVD, RP‐SVD. On the image data, the proposed RP‐FKM has produced less amount of distortion when compared with RP. The proposed RP‐FKM provides better text retrieval results when compared with conventional RP and performs similar to RP‐SVD. For the text retrieval task, superiority of SVD over other DR methods noted here is in good agreement with the analysis reported by Moravec.
Originality/value
The hybrid method proposed in this paper, combining RP and FKM, is new. Experimental results indicate that the proposed method is useful for reducing dimensionality of high‐dimensional data such as images, text, etc.
Details
Keywords
Serkan Altuntas, Türkay Dereli and Zülfiye Erdoğan
This study aims to propose a service quality evaluation model for health-care services.
Abstract
Purpose
This study aims to propose a service quality evaluation model for health-care services.
Design/methodology/approach
In this study, a service quality evaluation model is proposed based on the service quality measurement (SERVQUAL) scale and machine learning algorithm. Primarily, items that affect the quality of service are determined based on the SERVQUAL scale. Subsequently, a service quality assessment model is generated to manage the resources that are allocated to improve the activities efficiently. Following this phase, a sample of classification model is conducted. Machine learning algorithms are used to establish the classification model.
Findings
The proposed evaluation model addresses the following questions: What are the potential impact levels of service quality dimensions on the quality of service practically? What should be prioritization among the service quality dimensions and Which dimensions of service quality should be improved primarily? A real-life case study in a public hospital is carried out to reveal how the proposed model works. The results that have been obtained from the case study show that the proposed model can be conducted easily in practice. It is also found that there is a remarkably high-service gap in the public hospital, in which the case study has been conducted, regarding the general physical conditions and food services.
Originality/value
The primary contribution of this study is threefold. The proposed evaluation model determines the impact levels of service quality dimensions on the service quality in practice. The proposed evaluation model prioritizes service quality dimensions in terms of their significance. The proposed evaluation model finds out the answer to the question of which service quality dimensions should be improved primarily?
Details
Keywords
Most prior attempts at real estate valuation have focused on the use of metadata such as size and property age, neglecting the fact that the building workmanship in the…
Abstract
Purpose
Most prior attempts at real estate valuation have focused on the use of metadata such as size and property age, neglecting the fact that the building workmanship in the construction of a house is also a key factor for the estimation of house prices. Building workmanship, such as exterior walls and floor tiling correspond to the visual attributes of a house, and it is difficult to capture and evaluate such attributes efficiently through classical models like regression analysis. Deep learning approach is taken in the valuation process to utilize this visual information.
Design/methodology/approach
The authors propose a two-input neural network comprising a multilayer perceptron and a convolutional neural network that can utilize both metadata and the visual information from images of the front view of the house.
Findings
The authors applied the two-input neural network to Guri City in Gyeonggi Province, South Korea, as a case study and found that the accuracy of house price estimations can be improved by employing image information along with metadata.
Originality/value
Few studies considered the impact of the building workmanship in the valuation process. The authors revealed that it is useful to use both photographs and metadata for enhancing the accuracy of house price estimation.
Details
Keywords
Financial statement fraud (FSF) committed by companies implies the current status of the companies may not be healthy. As such, it is important to detect FSF, since such companies…
Abstract
Purpose
Financial statement fraud (FSF) committed by companies implies the current status of the companies may not be healthy. As such, it is important to detect FSF, since such companies tend to conceal bad information, which causes a great loss to various stakeholders. Thus, the objective of the paper is to propose a novel approach to building a classification model to identify FSF, which shows high classification performance and from which human-readable rules are extracted to explain why a company is likely to commit FSF.
Design/methodology/approach
Having prepared multiple sub-datasets to cope with class imbalance problem, we build a set of decision trees for each sub-dataset; select a subset of the set as a model for the sub-dataset by removing the tree, each of whose performance is less than the average accuracy of all trees in the set; and then select one such model which shows the best accuracy among the models. We call the resulting model MRF (Modified Random Forest). Given a new instance, we extract rules from the MRF model to explain whether the company corresponding to the new instance is likely to commit FSF or not.
Findings
Experimental results show that MRF classifier outperformed the benchmark models. The results also revealed that all the variables related to profit belong to the set of the most important indicators to FSF and that two new variables related to gross profit which were unapprised in previous studies on FSF were identified.
Originality/value
This study proposed a method of building a classification model which shows the outstanding performance and provides decision rules that can be used to explain the classification results. In addition, a new way to resolve the class imbalance problem was suggested in this paper.
Details