Search results

1 – 10 of 20
Open Access
Article
Publication date: 25 July 2022

Fung Yuen Chin, Kong Hoong Lem and Khye Mun Wong

The amount of features in handwritten digit data is often very large due to the different aspects in personal handwriting, leading to high-dimensional data. Therefore, the…

Abstract

Purpose

The amount of features in handwritten digit data is often very large due to the different aspects in personal handwriting, leading to high-dimensional data. Therefore, the employment of a feature selection algorithm becomes crucial for successful classification modeling, because the inclusion of irrelevant or redundant features can mislead the modeling algorithms, resulting in overfitting and decrease in efficiency.

Design/methodology/approach

The minimum redundancy and maximum relevance (mRMR) and the recursive feature elimination (RFE) are two frequently used feature selection algorithms. While mRMR is capable of identifying a subset of features that are highly relevant to the targeted classification variable, mRMR still carries the weakness of capturing redundant features along with the algorithm. On the other hand, RFE is flawed by the fact that those features selected by RFE are not ranked by importance, albeit RFE can effectively eliminate the less important features and exclude redundant features.

Findings

The hybrid method was exemplified in a binary classification between digits “4” and “9” and between digits “6” and “8” from a multiple features dataset. The result showed that the hybrid mRMR +  support vector machine recursive feature elimination (SVMRFE) is better than both the sole support vector machine (SVM) and mRMR.

Originality/value

In view of the respective strength and deficiency mRMR and RFE, this study combined both these methods and used an SVM as the underlying classifier anticipating the mRMR to make an excellent complement to the SVMRFE.

Details

Applied Computing and Informatics, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2634-1964

Keywords

Article
Publication date: 4 December 2017

Fuzan Chen, Harris Wu, Runliang Dou and Minqiang Li

The purpose of this paper is to build a compact and accurate classifier for high-dimensional classification.

Abstract

Purpose

The purpose of this paper is to build a compact and accurate classifier for high-dimensional classification.

Design/methodology/approach

A classification approach based on class-dependent feature subspace (CFS) is proposed. CFS is a class-dependent integration of a support vector machine (SVM) classifier and associated discriminative features. For each class, our genetic algorithm (GA)-based approach evolves the best subset of discriminative features and SVM classifier simultaneously. To guarantee convergence and efficiency, the authors customize the GA in terms of encoding strategy, fitness evaluation, and genetic operators.

Findings

Experimental studies demonstrated that the proposed CFS-based approach is superior to other state-of-the-art classification algorithms on UCI data sets in terms of both concise interpretation and predictive power for high-dimensional data.

Research limitations/implications

UCI data sets rather than real industrial data are used to evaluate the proposed approach. In addition, only single-label classification is addressed in the study.

Practical implications

The proposed method not only constructs an accurate classification model but also obtains a compact combination of discriminative features. It is helpful for business makers to get a concise understanding of the high-dimensional data.

Originality/value

The authors propose a compact and effective classification approach for high-dimensional data. Instead of the same feature subset for all the classes, the proposed CFS-based approach obtains the optimal subset of discriminative feature and SVM classifier for each class. The proposed approach enhances both interpretability and predictive power for high-dimensional data.

Details

Industrial Management & Data Systems, vol. 117 no. 10
Type: Research Article
ISSN: 0263-5577

Keywords

Article
Publication date: 11 October 2019

Ahsan Mahmood and Hikmat Ullah Khan

The purpose of this paper is to apply state-of-the-art machine learning techniques for assessing the quality of the restaurants using restaurant inspection data. The machine…

Abstract

Purpose

The purpose of this paper is to apply state-of-the-art machine learning techniques for assessing the quality of the restaurants using restaurant inspection data. The machine learning techniques are applied to solve the real-world problems in all sphere of life. Health and food departments pay regular visits to restaurants for inspection and mark the condition of the restaurant on the basis of the inspection. These inspections consider many factors that determine the condition of the restaurants and make it possible for the authorities to classify the restaurants.

Design/methodology/approach

In this paper, standard machine learning techniques, support vector machines, naïve Bayes and random forest classifiers are applied to classify the critical level of the restaurants on the basis of features identified during the inspection. The importance of different factors of inspection is determined by using feature selection through the help of the minimum-redundancy-maximum-relevance and linear vector quantization feature importance methods.

Findings

The experiments are accomplished on the real-world New York City restaurant inspection data set that contains diverse inspection features. The results show that the nonlinear support vector machine achieves better accuracy than other techniques. Moreover, this research study investigates the importance of different factors of restaurant inspection and finds that inspection score and grade are significant features. The performance of the classifiers is measured by using the standard performance evaluation measures of accuracy, sensitivity and specificity.

Originality/value

This research uses a real-world data set of restaurant inspection that has, to the best of the authors’ knowledge, never been used previously by researchers. The findings are helpful in identifying the best restaurants and help finding the factors that are considered important in restaurant inspection. The results are also important in identifying possible biases in restaurant inspections by the authorities.

Details

The Electronic Library, vol. 37 no. 6
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 25 January 2022

Tobias Mueller, Alexander Segin, Christoph Weigand and Robert H. Schmitt

In the determination of the measurement uncertainty, the GUM procedure requires the building of a measurement model that establishes a functional relationship between the…

Abstract

Purpose

In the determination of the measurement uncertainty, the GUM procedure requires the building of a measurement model that establishes a functional relationship between the measurand and all influencing quantities. Since the effort of modelling as well as quantifying the measurement uncertainties depend on the number of influencing quantities considered, the aim of this study is to determine relevant influencing quantities and to remove irrelevant ones from the dataset.

Design/methodology/approach

In this work, it was investigated whether the effort of modelling for the determination of measurement uncertainty can be reduced by the use of feature selection (FS) methods. For this purpose, 9 different FS methods were tested on 16 artificial test datasets, whose properties (number of data points, number of features, complexity, features with low influence and redundant features) were varied via a design of experiments.

Findings

Based on a success metric, the stability, universality and complexity of the method, two FS methods could be identified that reliably identify relevant and irrelevant influencing quantities for a measurement model.

Originality/value

For the first time, FS methods were applied to datasets with properties of classical measurement processes. The simulation-based results serve as a basis for further research in the field of FS for measurement models. The identified algorithms will be applied to real measurement processes in the future.

Details

International Journal of Quality & Reliability Management, vol. 40 no. 3
Type: Research Article
ISSN: 0265-671X

Keywords

Open Access
Article
Publication date: 11 August 2020

Hongfang Zhou, Xiqian Wang and Yao Zhang

Feature selection is an essential step in data mining. The core of it is to analyze and quantize the relevancy and redundancy between the features and the classes. In CFR feature…

1378

Abstract

Feature selection is an essential step in data mining. The core of it is to analyze and quantize the relevancy and redundancy between the features and the classes. In CFR feature selection method, they rarely consider which feature to choose if two or more features have the same value using evaluation criterion. In order to address this problem, the standard deviation is employed to adjust the importance between relevancy and redundancy. Based on this idea, a novel feature selection method named as Feature Selection Based on Weighted Conditional Mutual Information (WCFR) is introduced. Experimental results on ten datasets show that our proposed method has higher classification accuracy.

Details

Applied Computing and Informatics, vol. 20 no. 1/2
Type: Research Article
ISSN: 2634-1964

Keywords

Article
Publication date: 16 March 2021

P. Padmavathy, S. Pakkir Mohideen and Zameer Gulzar

The purpose of this paper is to initially perform Senti-WordNet (SWN)- and point wise mutual information (PMI)-based polarity computation and based polarity updation. When the SWN…

Abstract

Purpose

The purpose of this paper is to initially perform Senti-WordNet (SWN)- and point wise mutual information (PMI)-based polarity computation and based polarity updation. When the SWN polarity and polarity mismatched, the vote flipping algorithm (VFA) is employed.

Design/methodology/approach

Recently, in domains like social media(SM), healthcare, hotel, car, product data, etc., research on sentiment analysis (SA) has massively increased. In addition, there is no approach for analyzing the positive or negative orientations of every single aspect in a document (a tweet, a review, as well as a piece of news, among others). For SA as well as polarity classification, several researchers have used SWN as a lexical resource. Nevertheless, these lexicons show lower-level performance for sentiment classification (SC) than domain-specific lexicons (DSL). Likewise, in some scenarios, the same term is utilized differently between domain and general knowledge lexicons. While concerning different domains, most words have one sentiment class in SWN, and in the annotated data set, their occurrence signifies a strong inclination with the other sentiment class. Hence, this paper chiefly concentrates on the drawbacks of adapting domain-dependent sentiment lexicon (DDSL) from a collection of labeled user reviews and domain-independent lexicon (DIL) for proposing a framework centered on the information theory that could predict the correct polarity of the words (positive, neutral and negative). The proposed work initially performs SWN- and PMI-based polarity computation and based polarity updation. When the SWN polarity and polarity mismatched, the vote flipping algorithm (VFA) is employed. Finally, the predicted polarity is inputted to the mtf-idf-based SVM-NN classifier for the SC of reviews. The outcomes are examined and contrasted to the other existing techniques to verify that the proposed work has predicted the class of the reviews more effectually for different datasets.

Findings

There is no approach for analyzing the positive or negative orientations of every single aspect in a document (a tweet, a review, as well as a piece of news, among others). For SA as well as polarity classification, several researchers have used SWN as a lexical resource. Nevertheless, these lexicons show lower-level performance for sentiment classification (SC) than domain-specific lexicons (DSL). Likewise, in some scenarios, the same term is utilized differently between domain and general knowledge lexicons. While concerning different domains, most words have one sentiment class in SWN, and in the annotated data set their occurrence signifies a strong inclination with the other sentiment class.

Originality/value

The proposed work initially performs SWN- and PMI-based polarity computation, and based polarity updation. When the SWN polarity and polarity mismatched, the vote flipping algorithm (VFA) is employed.

Article
Publication date: 20 August 2019

Sandhya N., Philip Samuel and Mariamma Chacko

Telecommunication has a decisive role in the development of technology in the current era. The number of mobile users with multiple SIM cards is increasing every second. Hence…

Abstract

Purpose

Telecommunication has a decisive role in the development of technology in the current era. The number of mobile users with multiple SIM cards is increasing every second. Hence, telecommunication is a significant area in which big data technologies are needed. Competition among the telecommunication companies is high due to customer churn. Customer retention in telecom companies is one of the major problems. The paper aims to discuss this issue.

Design/methodology/approach

The authors recommend an Intersection-Randomized Algorithm (IRA) using MapReduce functions to avoid data duplication in the mobile user call data of telecommunication service providers. The authors use the agent-based model (ABM) to predict the complex mobile user behaviour to prevent customer churn with a particular telecommunication service provider.

Findings

The agent-based model increases the prediction accuracy due to the dynamic nature of agents. ABM suggests rules based on mobile user variable features using multiple agents.

Research limitations/implications

The authors have not considered the microscopic behaviour of the customer churn based on complex user behaviour.

Practical implications

This paper shows the effectiveness of the IRA along with the agent-based model to predict the mobile user churn behaviour. The advantage of this proposed model is as follows: the user churn prediction system is straightforward, cost-effective, flexible and distributed with good business profit.

Originality/value

This paper shows the customer churn prediction of complex human behaviour in an effective and flexible manner in a distributed environment using Intersection-Randomized MapReduce Algorithm using agent-based model.

Details

Data Technologies and Applications, vol. 53 no. 3
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 2 March 2022

Francisco Elânio Bezerra, Flavio Grassi, Cleber Gustavo Dias and Fabio Henrique Pereira

This paper aims to propose an approach based upon the principal component analysis (PCA) to define a contribution rate for each variable and then select the main variables as…

Abstract

Purpose

This paper aims to propose an approach based upon the principal component analysis (PCA) to define a contribution rate for each variable and then select the main variables as inputs to a neural network for energy load forecasting in the region southeastern Brazil.

Design/methodology/approach

The proposed approach defines a contribution rate of each variable as a weighted sum of the inner product between the variable and each principal component. So, the contribution rate is used for selecting the most important features of 27 variables and 6,815 electricity data for a multilayer perceptron network backpropagation prediction model. Several tests, starting from the most significant variable as input, and adding the next most significant variable and so on, are accomplished to predict energy load (GWh). The Kaiser–Meyer–Olkin and Bartlett sphericity tests were used to verify the overall consistency of the data for factor analysis.

Findings

Although energy load forecasting is an area for which databases with tens or hundreds of variables are available, the approach could select only six variables that contribute more than 85% for the model. While the contribution rates of the variables of the plants, plus energy exchange added, have only 14.14% of contribution, the variable the stored energy has a contribution rate of 26.31% being fundamental for the prediction accuracy.

Originality/value

Besides improving the forecasting accuracy and providing a faster predictor, the proposed PCA-based approach for calculating the contribution rate of input variables providing a better understanding of the underlying process that generated the data, which is fundamental to the Brazilian reality due to the accentuated climatic and economic variations.

Details

International Journal of Energy Sector Management, vol. 16 no. 6
Type: Research Article
ISSN: 1750-6220

Keywords

Open Access
Article
Publication date: 28 July 2020

Noura AlNuaimi, Mohammad Mehedy Masud, Mohamed Adel Serhani and Nazar Zaki

Organizations in many domains generate a considerable amount of heterogeneous data every day. Such data can be processed to enhance these organizations’ decisions in real time…

3568

Abstract

Organizations in many domains generate a considerable amount of heterogeneous data every day. Such data can be processed to enhance these organizations’ decisions in real time. However, storing and processing large and varied datasets (known as big data) is challenging to do in real time. In machine learning, streaming feature selection has always been considered a superior technique for selecting the relevant subset features from highly dimensional data and thus reducing learning complexity. In the relevant literature, streaming feature selection refers to the features that arrive consecutively over time; despite a lack of exact figure on the number of features, numbers of instances are well-established. Many scholars in the field have proposed streaming-feature-selection algorithms in attempts to find the proper solution to this problem. This paper presents an exhaustive and methodological introduction of these techniques. This study provides a review of the traditional feature-selection algorithms and then scrutinizes the current algorithms that use streaming feature selection to determine their strengths and weaknesses. The survey also sheds light on the ongoing challenges in big-data research.

Details

Applied Computing and Informatics, vol. 18 no. 1/2
Type: Research Article
ISSN: 2634-1964

Keywords

Article
Publication date: 8 December 2022

Jonathan S. Greipel, Regina M. Frank, Meike Huber, Ansgar Steland and Robert H. Schmitt

To ensure product quality within a manufacturing process, inspection processes are indispensable. One task of inspection planning is the selection of inspection characteristics…

Abstract

Purpose

To ensure product quality within a manufacturing process, inspection processes are indispensable. One task of inspection planning is the selection of inspection characteristics. For optimization of costs and benefits, key characteristics can be defined by which the product quality can be checked with sufficient accuracy. The manual selection of key characteristics requires substantial planning effort and becomes uneconomic if many product variants prevail. This paper, therefore, aims to show a method for the efficient determination of key characteristics.

Design/methodology/approach

The authors present a novel Algorithm for the Selection of Key Characteristics (ASKC) based on an auto-encoder and a risk analysis. Given historical measurement data and tolerances, the algorithm clusters characteristics with redundant information and selects key characteristics based on a risk assessment. The authors compare ASKC with the algorithm Principal Feature Analysis (PFA) using artificial and historical measurement data.

Findings

The authors find that ASKC delivers superior results than PFA. Findings show that the algorithms enable the cost-efficient selection of key characteristics while maintaining the informative value of the inspection concerning the quality.

Originality/value

This paper fills an identified gap for simplified inspection planning with the method for the efficient selection of key features via ASKC.

Details

International Journal of Quality & Reliability Management, vol. 40 no. 7
Type: Research Article
ISSN: 0265-671X

Keywords

1 – 10 of 20