Search results

1 – 10 of over 6000
Article
Publication date: 6 January 2022

Deepti Sisodia and Dilip Singh Sisodia

The problem of choosing the utmost useful features from hundreds of features from time-series user click data arises in online advertising toward fraudulent publisher's…

Abstract

Purpose

The problem of choosing the utmost useful features from hundreds of features from time-series user click data arises in online advertising toward fraudulent publisher's classification. Selecting feature subsets is a key issue in such classification tasks. Practically, the use of filter approaches is common; however, they neglect the correlations amid features. Conversely, wrapper approaches could not be applied due to their complexities. Moreover, in particular, existing feature selection methods could not handle such data, which is one of the major causes of instability of feature selection.

Design/methodology/approach

To overcome such issues, a majority voting-based hybrid feature selection method, namely feature distillation and accumulated selection (FDAS), is proposed to investigate the optimal subset of relevant features for analyzing the publisher's fraudulent conduct. FDAS works in two phases: (1) feature distillation, where significant features from standard filter and wrapper feature selection methods are obtained using majority voting; (2) accumulated selection, where we enumerated an accumulated evaluation of relevant feature subset to search for an optimal feature subset using effective machine learning (ML) models.

Findings

Empirical results prove enhanced classification performance with proposed features in average precision, recall, f1-score and AUC in publisher identification and classification.

Originality/value

The FDAS is evaluated on FDMA2012 user-click data and nine other benchmark datasets to gauge its generalizing characteristics, first, considering original features, second, with relevant feature subsets selected by feature selection (FS) methods, third, with optimal feature subset obtained by the proposed approach. ANOVA significance test is conducted to demonstrate significant differences between independent features.

Details

Data Technologies and Applications, vol. 56 no. 4
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 23 September 2020

Z.F. Zhang, Wei Liu, Egon Ostrosi, Yongjie Tian and Jianping Yi

During the production process of steel strip, some defects may appear on the surface, that is, traditional manual inspection could not meet the requirements of low-cost and…

Abstract

Purpose

During the production process of steel strip, some defects may appear on the surface, that is, traditional manual inspection could not meet the requirements of low-cost and high-efficiency production. The purpose of this paper is to propose a method of feature selection based on filter methods combined with hidden Bayesian classifier for improving the efficiency of defect recognition and reduce the complexity of calculation. The method can select the optimal hybrid model for realizing the accurate classification of steel strip surface defects.

Design/methodology/approach

A large image feature set was initially obtained based on the discrete wavelet transform feature extraction method. Three feature selection methods (including correlation-based feature selection, consistency subset evaluator [CSE] and information gain) were then used to optimize the feature space. Parameters for the feature selection methods were based on the classification accuracy results of hidden Naive Bayes (HNB) algorithm. The selected feature subset was then applied to the traditional NB classifier and leading extended NB classifiers.

Findings

The experimental results demonstrated that the HNB model combined with feature selection approaches has better classification performance than other models of defect recognition. Among the results of this study, the proposed hybrid model of CSE + HNB is the most robust and effective and of highest classification accuracy in identifying the optimal subset of the surface defect database.

Originality/value

The main contribution of this paper is the development of a hybrid model combining feature selection and multi-class classification algorithms for steel strip surface inspection. The proposed hybrid model is primarily robust and effective for steel strip surface inspection.

Details

Engineering Computations, vol. 38 no. 4
Type: Research Article
ISSN: 0264-4401

Keywords

Open Access
Article
Publication date: 28 July 2020

Noura AlNuaimi, Mohammad Mehedy Masud, Mohamed Adel Serhani and Nazar Zaki

Organizations in many domains generate a considerable amount of heterogeneous data every day. Such data can be processed to enhance these organizations’ decisions in real time…

3576

Abstract

Organizations in many domains generate a considerable amount of heterogeneous data every day. Such data can be processed to enhance these organizations’ decisions in real time. However, storing and processing large and varied datasets (known as big data) is challenging to do in real time. In machine learning, streaming feature selection has always been considered a superior technique for selecting the relevant subset features from highly dimensional data and thus reducing learning complexity. In the relevant literature, streaming feature selection refers to the features that arrive consecutively over time; despite a lack of exact figure on the number of features, numbers of instances are well-established. Many scholars in the field have proposed streaming-feature-selection algorithms in attempts to find the proper solution to this problem. This paper presents an exhaustive and methodological introduction of these techniques. This study provides a review of the traditional feature-selection algorithms and then scrutinizes the current algorithms that use streaming feature selection to determine their strengths and weaknesses. The survey also sheds light on the ongoing challenges in big-data research.

Details

Applied Computing and Informatics, vol. 18 no. 1/2
Type: Research Article
ISSN: 2634-1964

Keywords

Article
Publication date: 28 February 2019

Gabrijela Dimic, Dejan Rancic, Nemanja Macek, Petar Spalevic and Vida Drasute

This paper aims to deal with the previously unknown prediction accuracy of students’ activity pattern in a blended learning environment.

Abstract

Purpose

This paper aims to deal with the previously unknown prediction accuracy of students’ activity pattern in a blended learning environment.

Design/methodology/approach

To extract the most relevant activity feature subset, different feature-selection methods were applied. For different cardinality subsets, classification models were used in the comparison.

Findings

Experimental evaluation oppose the hypothesis that feature vector dimensionality reduction leads to prediction accuracy increasing.

Research limitations/implications

Improving prediction accuracy in a described learning environment was based on applying synthetic minority oversampling technique, which had affected results on correlation-based feature-selection method.

Originality/value

The major contribution of the research is the proposed methodology for selecting the optimal low-cardinal subset of students’ activities and significant prediction accuracy improvement in a blended learning environment.

Details

Information Discovery and Delivery, vol. 47 no. 2
Type: Research Article
ISSN: 2398-6247

Keywords

Article
Publication date: 8 June 2010

Pablo A.D. Castro and Fernando J. Von Zuben

The purpose of this paper is to apply a multi‐objective Bayesian artificial immune system (MOBAIS) to feature selection in classification problems aiming at minimizing both the…

Abstract

Purpose

The purpose of this paper is to apply a multi‐objective Bayesian artificial immune system (MOBAIS) to feature selection in classification problems aiming at minimizing both the classification error and cardinality of the subset of features. The algorithm is able to perform a multimodal search maintaining population diversity and controlling automatically the population size according to the problem. In addition, it is capable of identifying and preserving building blocks (partial components of the whole solution) effectively.

Design/methodology/approach

The algorithm evolves candidate subsets of features by replacing the traditional mutation operator in immune‐inspired algorithms with a probabilistic model which represents the probability distribution of the promising solutions found so far. Then, the probabilistic model is used to generate new individuals. A Bayesian network is adopted as the probabilistic model due to its capability of capturing expressive interactions among the variables of the problem. In order to evaluate the proposal, it was applied to ten datasets and the results compared with those generated by state‐of‐the‐art algorithms.

Findings

The experiments demonstrate the effectiveness of the multi‐objective approach to feature selection. The algorithm found parsimonious subsets of features and the classifiers produced a significant improvement in the accuracy. In addition, the maintenance of building blocks avoids the disruption of partial solutions, leading to a quick convergence.

Originality/value

The originality of this paper relies on the proposal of a novel algorithm to multi‐objective feature selection.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 3 no. 2
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 17 January 2022

Syed Haroon Abdul Gafoor and Padma Theagarajan

Conventional diagnostic techniques, on the other hand, may be prone to subjectivity since they depend on assessment of motions that are often subtle to individual eyes and hence…

126

Abstract

Purpose

Conventional diagnostic techniques, on the other hand, may be prone to subjectivity since they depend on assessment of motions that are often subtle to individual eyes and hence hard to classify, potentially resulting in misdiagnosis. Meanwhile, early nonmotor signs of Parkinson’s disease (PD) can be mild and may be due to variety of other conditions. As a result, these signs are usually ignored, making early PD diagnosis difficult. Machine learning approaches for PD classification and healthy controls or individuals with similar medical symptoms have been introduced to solve these problems and to enhance the diagnostic and assessment processes of PD (like, movement disorders or other Parkinsonian syndromes).

Design/methodology/approach

Medical observations and evaluation of medical symptoms, including characterization of a wide range of motor indications, are commonly used to diagnose PD. The quantity of the data being processed has grown in the last five years; feature selection has become a prerequisite before any classification. This study introduces a feature selection method based on the score-based artificial fish swarm algorithm (SAFSA) to overcome this issue.

Findings

This study adds to the accuracy of PD identification by reducing the amount of chosen vocal features while to use the most recent and largest publicly accessible database. Feature subset selection in PD detection techniques starts by eliminating features that are not relevant or redundant. According to a few objective functions, features subset chosen should provide the best performance.

Research limitations/implications

In many situations, this is an Nondeterministic Polynomial Time (NP-Hard) issue. This method enhances the PD detection rate by selecting the most essential features from the database. To begin, the data set's dimensionality is reduced using Singular Value Decomposition dimensionality technique. Next, Biogeography-Based Optimization (BBO) for feature selection; the weight value is a vital parameter for finding the best features in PD classification.

Originality/value

PD classification is done by using ensemble learning classification approaches such as hybrid classifier of fuzzy K-nearest neighbor, kernel support vector machines, fuzzy convolutional neural network and random forest. The suggested classifiers are trained using data from UCI ML repository, and their results are verified using leave-one-person-out cross validation. The measures employed to assess the classifier efficiency include accuracy, F-measure, Matthews correlation coefficient.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 15 no. 4
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 12 June 2020

Sandeepkumar Hegde and Monica R. Mundada

According to the World Health Organization, by 2025, the contribution of chronic disease is expected to rise by 73% compared to all deaths and it is considered as global burden of…

Abstract

Purpose

According to the World Health Organization, by 2025, the contribution of chronic disease is expected to rise by 73% compared to all deaths and it is considered as global burden of disease with a rate of 60%. These diseases persist for a longer duration of time, which are almost incurable and can only be controlled. Cardiovascular disease, chronic kidney disease (CKD) and diabetes mellitus are considered as three major chronic diseases that will increase the risk among the adults, as they get older. CKD is considered a major disease among all these chronic diseases, which will increase the risk among the adults as they get older. Overall 10% of the population of the world is affected by CKD and it is likely to double in the year 2030. The paper aims to propose novel feature selection approach in combination with the machine-learning algorithm which can early predict the chronic disease with utmost accuracy. Hence, a novel feature selection adaptive probabilistic divergence-based feature selection (APDFS) algorithm is proposed in combination with the hyper-parameterized logistic regression model (HLRM) for the early prediction of chronic disease.

Design/methodology/approach

A novel feature selection APDFS algorithm is proposed which explicitly handles the feature associated with the class label by relevance and redundancy analysis. The algorithm applies the statistical divergence-based information theory to identify the relationship between the distant features of the chronic disease data set. The data set required to experiment is obtained from several medical labs and hospitals in India. The HLRM is used as a machine-learning classifier. The predictive ability of the framework is compared with the various algorithm and also with the various chronic disease data set. The experimental result illustrates that the proposed framework is efficient and achieved competitive results compared to the existing work in most of the cases.

Findings

The performance of the proposed framework is validated by using the metric such as recall, precision, F1 measure and ROC. The predictive performance of the proposed framework is analyzed by passing the data set belongs to various chronic disease such as CKD, diabetes and heart disease. The diagnostic ability of the proposed approach is demonstrated by comparing its result with existing algorithms. The experimental figures illustrated that the proposed framework performed exceptionally well in prior prediction of CKD disease with an accuracy of 91.6.

Originality/value

The capability of the machine learning algorithms depends on feature selection (FS) algorithms in identifying the relevant traits from the data set, which impact the predictive result. It is considered as a process of choosing the relevant features from the data set by removing redundant and irrelevant features. Although there are many approaches that have been already proposed toward this objective, they are computationally complex because of the strategy of following a one-step scheme in selecting the features. In this paper, a novel feature selection APDFS algorithm is proposed which explicitly handles the feature associated with the class label by relevance and redundancy analysis. The proposed algorithm handles the process of feature selection in two separate indices. Hence, the computational complexity of the algorithm is reduced to O(nk+1). The algorithm applies the statistical divergence-based information theory to identify the relationship between the distant features of the chronic disease data set. The data set required to experiment is obtained from several medical labs and hospitals of karkala taluk ,India. The HLRM is used as a machine learning classifier. The predictive ability of the framework is compared with the various algorithm and also with the various chronic disease data set. The experimental result illustrates that the proposed framework is efficient and achieved competitive results are compared to the existing work in most of the cases.

Details

International Journal of Pervasive Computing and Communications, vol. 17 no. 1
Type: Research Article
ISSN: 1742-7371

Keywords

Article
Publication date: 4 May 2021

Nor Hamizah Miswan, Chee Seng Chan and Chong Guan Ng

This paper develops a robust hospital readmission prediction framework by combining the feature selection algorithm and machine learning (ML) classifiers. The improved feature

Abstract

Purpose

This paper develops a robust hospital readmission prediction framework by combining the feature selection algorithm and machine learning (ML) classifiers. The improved feature selection is proposed by considering the uncertainty in patient's attributes that leads to the output variable.

Design/methodology/approach

First, data preprocessing is conducted which includes how raw data is managed. Second, the impactful features are selected through feature selection process. It started with calculating the relational grade of each patient towards readmission using grey relational analysis (GRA) and the grade is used as the target values for feature selection. Then, the influenced features are selected using the Least Absolute Shrinkage and Selection Operator (LASSO) method. This proposed method is termed as Grey-LASSO feature selection. The final task is the readmission prediction using ML classifiers.

Findings

The proposed method offered good performances with a minimum feature subset up to 54–65% discarded features. Multi-Layer Perceptron with Grey-LASSO gave the best performance.

Research limitations/implications

The performance of Grey-LASSO is justified in two readmission datasets. Further research is required to examine the generalisability to other datasets.

Originality/value

In designing the feature selection algorithm, the selection on influenced input variables was based on the integration of GRA and LASSO. Specifically, GRA is a part of the grey system theory, which was employed to analyse the relation between systems under uncertain conditions. The LASSO approach was adopted due to its ability for sparse data representation.

Details

Grey Systems: Theory and Application, vol. 11 no. 4
Type: Research Article
ISSN: 2043-9377

Keywords

Book part
Publication date: 30 September 2020

B. G. Deepa and S. Senthil

Breast cancer (BC) is one of the leading cancer in the world, BC risk has been there for women of the middle age also, it is the malignant tumor. However, identifying BC in the…

Abstract

Breast cancer (BC) is one of the leading cancer in the world, BC risk has been there for women of the middle age also, it is the malignant tumor. However, identifying BC in the early stage will save most of the women’s life. As there is an advancement in the technology research used Machine Learning (ML) algorithm Random Forest for ranking the feature, Support Vector Machine (SVM), and Naïve Bayes (NB) supervised classifiers for selection of best optimized features and prediction of BC accuracy. The estimation of prediction accuracy has been done by using the dataset Wisconsin Breast Cancer Data from University of California Irvine (UCI) ML repository. To perform all these operation, Anaconda one of the open source distribution of Python has been used. The proposed work resulted in extemporize improvement in the NB and SVM classifier accuracy. The performance evaluation of the proposed model is estimated by using classification accuracy, confusion matrix, mean, standard deviation, variance, and root mean-squared error.

The experimental results shows that 70-30 data split will result in best accuracy. SVM acts as a feature optimizer of 12 best features with the result of 97.66% accuracy and improvement of 1.17% after feature reduction. NB results with feature optimizer 17 of best features with the result of 96.49% accuracy and improvement of 1.17% after feature reduction.

The study shows that proposal model works very effectively as compare to the existing models with respect to accuracy measures.

Details

Big Data Analytics and Intelligence: A Perspective for Health Care
Type: Book
ISBN: 978-1-83909-099-8

Keywords

Article
Publication date: 4 December 2017

Fuzan Chen, Harris Wu, Runliang Dou and Minqiang Li

The purpose of this paper is to build a compact and accurate classifier for high-dimensional classification.

Abstract

Purpose

The purpose of this paper is to build a compact and accurate classifier for high-dimensional classification.

Design/methodology/approach

A classification approach based on class-dependent feature subspace (CFS) is proposed. CFS is a class-dependent integration of a support vector machine (SVM) classifier and associated discriminative features. For each class, our genetic algorithm (GA)-based approach evolves the best subset of discriminative features and SVM classifier simultaneously. To guarantee convergence and efficiency, the authors customize the GA in terms of encoding strategy, fitness evaluation, and genetic operators.

Findings

Experimental studies demonstrated that the proposed CFS-based approach is superior to other state-of-the-art classification algorithms on UCI data sets in terms of both concise interpretation and predictive power for high-dimensional data.

Research limitations/implications

UCI data sets rather than real industrial data are used to evaluate the proposed approach. In addition, only single-label classification is addressed in the study.

Practical implications

The proposed method not only constructs an accurate classification model but also obtains a compact combination of discriminative features. It is helpful for business makers to get a concise understanding of the high-dimensional data.

Originality/value

The authors propose a compact and effective classification approach for high-dimensional data. Instead of the same feature subset for all the classes, the proposed CFS-based approach obtains the optimal subset of discriminative feature and SVM classifier for each class. The proposed approach enhances both interpretability and predictive power for high-dimensional data.

Details

Industrial Management & Data Systems, vol. 117 no. 10
Type: Research Article
ISSN: 0263-5577

Keywords

1 – 10 of over 6000