Search results

1 – 10 of over 1000
Article
Publication date: 23 August 2022

Kamlesh Kumar Pandey and Diwakar Shukla

The K-means (KM) clustering algorithm is extremely responsive to the selection of initial centroids since the initial centroid of clusters determines computational effectiveness…

Abstract

Purpose

The K-means (KM) clustering algorithm is extremely responsive to the selection of initial centroids since the initial centroid of clusters determines computational effectiveness, efficiency and local optima issues. Numerous initialization strategies are to overcome these problems through the random and deterministic selection of initial centroids. The random initialization strategy suffers from local optimization issues with the worst clustering performance, while the deterministic initialization strategy achieves high computational cost. Big data clustering aims to reduce computation costs and improve cluster efficiency. The objective of this study is to achieve a better initial centroid for big data clustering on business management data without using random and deterministic initialization that avoids local optima and improves clustering efficiency with effectiveness in terms of cluster quality, computation cost, data comparisons and iterations on a single machine.

Design/methodology/approach

This study presents the Normal Distribution Probability Density (NDPD) algorithm for big data clustering on a single machine to solve business management-related clustering issues. The NDPDKM algorithm resolves the KM clustering problem by probability density of each data point. The NDPDKM algorithm first identifies the most probable density data points by using the mean and standard deviation of the datasets through normal probability density. Thereafter, the NDPDKM determines K initial centroid by using sorting and linear systematic sampling heuristics.

Findings

The performance of the proposed algorithm is compared with KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms through Davies Bouldin score, Silhouette coefficient, SD Validity, S_Dbw Validity, Number of Iterations and CPU time validation indices on eight real business datasets. The experimental evaluation demonstrates that the NDPDKM algorithm reduces iterations, local optima, computing costs, and improves cluster performance, effectiveness, efficiency with stable convergence as compared to other algorithms. The NDPDKM algorithm minimizes the average computing time up to 34.83%, 90.28%, 71.83%, 92.67%, 69.53% and 76.03%, and reduces the average iterations up to 40.32%, 44.06%, 32.02%, 62.78%, 19.07% and 36.74% with reference to KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms.

Originality/value

The KM algorithm is the most widely used partitional clustering approach in data mining techniques that extract hidden knowledge, patterns and trends for decision-making strategies in business data. Business analytics is one of the applications of big data clustering where KM clustering is useful for the various subcategories of business analytics such as customer segmentation analysis, employee salary and performance analysis, document searching, delivery optimization, discount and offer analysis, chaplain management, manufacturing analysis, productivity analysis, specialized employee and investor searching and other decision-making strategies in business.

Book part
Publication date: 1 September 2021

Ronald Klimberg and Samuel Ratick

In a previous chapter (Klimberg, Ratick, & Smith, 2018), we introduced a novel approach in which cluster centroids were used as input data for the predictor variables of a…

Abstract

In a previous chapter (Klimberg, Ratick, & Smith, 2018), we introduced a novel approach in which cluster centroids were used as input data for the predictor variables of a multiple linear regression (MLR) used to forecast fleet maintenance costs. We applied this approach to a real data set and significantly improved the predictive accuracy of the MLR model. In this chapter, we develop a methodology for adjusting moving average forecasts of the future values of fleet service occurrences by interpolating those forecast values using their relative distances from cluster centroids. We illustrate and evaluate the efficacy of this approach with our previously used data set on fleet maintenance.

Details

Advances in Business and Management Forecasting
Type: Book
ISBN: 978-1-83982-091-5

Keywords

Article
Publication date: 9 May 2016

Chao-Lung Yang and Thi Phuong Quyen Nguyen

Class-based storage has been studied extensively and proved to be an efficient storage policy. However, few literature addressed how to cluster stuck items for class-based…

2534

Abstract

Purpose

Class-based storage has been studied extensively and proved to be an efficient storage policy. However, few literature addressed how to cluster stuck items for class-based storage. The purpose of this paper is to develop a constrained clustering method integrated with principal component analysis (PCA) to meet the need of clustering stored items with the consideration of practical storage constraints.

Design/methodology/approach

In order to consider item characteristic and the associated storage restrictions, the must-link and cannot-link constraints were constructed to meet the storage requirement. The cube-per-order index (COI) which has been used for location assignment in class-based warehouse was analyzed by PCA. The proposed constrained clustering method utilizes the principal component loadings as item sub-group features to identify COI distribution of item sub-groups. The clustering results are then used for allocating storage by using the heuristic assignment model based on COI.

Findings

The clustering result showed that the proposed method was able to provide better compactness among item clusters. The simulated result also shows the new location assignment by the proposed method was able to improve the retrieval efficiency by 33 percent.

Practical implications

While number of items in warehouse is tremendously large, the human intervention on revealing storage constraints is going to be impossible. The developed method can be easily fit in to solve the problem no matter what the size of the data is.

Originality/value

The case study demonstrated an example of practical location assignment problem with constraints. This paper also sheds a light on developing a data clustering method which can be directly applied on solving the practical data analysis issues.

Details

Industrial Management & Data Systems, vol. 116 no. 4
Type: Research Article
ISSN: 0263-5577

Keywords

Article
Publication date: 4 May 2012

Amine Jaafar, Bruno Sareni and Xavier Roboam

A wide number of applications requires classifying or grouping data into a set of categories or clusters. The most popular clustering techniques to achieve this objective are…

Abstract

Purpose

A wide number of applications requires classifying or grouping data into a set of categories or clusters. The most popular clustering techniques to achieve this objective are K‐means clustering and hierarchical clustering. However, both of these methods necessitate the a priori setting of the cluster number. The purpose of this paper is to present a clustering method based on the use of a niching genetic algorithm to overcome this problem.

Design/methodology/approach

The proposed approach aims at finding the best compromise between the inter‐cluster distance maximization and the intra‐cluster distance minimization through the silhouette index optimization. It is capable of investigating in parallel multiple cluster configurations without requiring any assumption about the cluster number.

Findings

The effectiveness of the proposed approach is demonstrated on 2D benchmarks with non‐overlapping and overlapping clusters.

Originality/value

The proposed approach is also applied to the clustering analysis of railway driving profiles in the context of hybrid supply design. Such a method can help designers to identify different system configurations in compliance with the corresponding clusters: it may guide suppliers towards “market segmentation”, not only fulfilling economic constraints but also technical design objectives.

Details

COMPEL - The international journal for computation and mathematics in electrical and electronic engineering, vol. 31 no. 3
Type: Research Article
ISSN: 0332-1649

Keywords

Article
Publication date: 14 March 2016

Gebeyehu Belay Gebremeskel, Chai Yi, Zhongshi He and Dawit Haile

Among the growing number of data mining (DM) techniques, outlier detection has gained importance in many applications and also attracted much attention in recent times. In the…

Abstract

Purpose

Among the growing number of data mining (DM) techniques, outlier detection has gained importance in many applications and also attracted much attention in recent times. In the past, outlier detection researched papers appeared in a safety care that can view as searching for the needles in the haystack. However, outliers are not always erroneous. Therefore, the purpose of this paper is to investigate the role of outliers in healthcare services in general and patient safety care, in particular.

Design/methodology/approach

It is a combined DM (clustering and the nearest neighbor) technique for outliers’ detection, which provides a clear understanding and meaningful insights to visualize the data behaviors for healthcare safety. The outcomes or the knowledge implicit is vitally essential to a proper clinical decision-making process. The method is important to the semantic, and the novel tactic of patients’ events and situations prove that play a significant role in the process of patient care safety and medications.

Findings

The outcomes of the paper is discussing a novel and integrated methodology, which can be inferring for different biological data analysis. It is discussed as integrated DM techniques to optimize its performance in the field of health and medical science. It is an integrated method of outliers detection that can be extending for searching valuable information and knowledge implicit based on selected patient factors. Based on these facts, outliers are detected as clusters and point events, and novel ideas proposed to empower clinical services in consideration of customers’ satisfactions. It is also essential to be a baseline for further healthcare strategic development and research works.

Research limitations/implications

This paper mainly focussed on outliers detections. Outlier isolation that are essential to investigate the reason how it happened and communications how to mitigate it did not touch. Therefore, the research can be extended more about the hierarchy of patient problems.

Originality/value

DM is a dynamic and successful gateway for discovering useful knowledge for enhancing healthcare performances and patient safety. Clinical data based outlier detection is a basic task to achieve healthcare strategy. Therefore, in this paper, the authors focussed on combined DM techniques for a deep analysis of clinical data, which provide an optimal level of clinical decision-making processes. Proper clinical decisions can obtain in terms of attributes selections that important to know the influential factors or parameters of healthcare services. Therefore, using integrated clustering and nearest neighbors techniques give more acceptable searched such complex data outliers, which could be fundamental to further analysis of healthcare and patient safety situational analysis.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 9 no. 1
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 28 March 2008

Stefan Janson, Daniel Merkle and Martin Middendorf

The purpose of this paper is to present an approach for the decentralization of swarm intelligence algorithms that run on computing systems with autonomous components that are…

1874

Abstract

Purpose

The purpose of this paper is to present an approach for the decentralization of swarm intelligence algorithms that run on computing systems with autonomous components that are connected by a network. The approach is applied to a particle swarm optimization (PSO) algorithm with multiple sub‐swarms. PSO is a nature inspired metaheuristic where a swarm of particles searches for an optimum of a function. A multiple sub‐swarms PSO can be used for example in applications where more than one optimum has to be found.

Design/methodology/approach

In the studied scenario the particles of the PSO algorithm correspond to data packets that are sent through the network of the computing system. Each data packet contains among other information the position of the corresponding particle in the search space and its sub‐swarm number. In the proposed decentralized PSO algorithm the application specific tasks, i.e. the function evaluations, are done by the autonomous components of the system. The more general tasks, like the dynamic clustering of data packets, are done by the routers of the network.

Findings

Simulation experiments show that the decentralized PSO algorithm can successfully find a set of minimum values for the used test functions. It was also shown that the PSO algorithm works well for different type of networks, like scale‐free network and ring like networks.

Originality/value

The proposed decentralization approach is interesting for the design of optimization algorithms that can run on computing systems that use principles of self‐organization and have no central control.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 1 no. 1
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 23 March 2021

Hendri Murfi

The aim of this research is to develop an eigenspace-based fuzzy c-means method for scalable topic detection.

Abstract

Purpose

The aim of this research is to develop an eigenspace-based fuzzy c-means method for scalable topic detection.

Design/methodology/approach

The eigenspace-based fuzzy c-means (EFCM) combines representation learning and clustering. The textual data are transformed into a lower-dimensional eigenspace using truncated singular value decomposition. Fuzzy c-means is performed on the eigenspace to identify the centroids of each cluster. The topics are provided by transforming back the centroids into the nonnegative subspace of the original space. In this paper, we extend the EFCM method for scalability by using the two approaches, i.e. single-pass and online. We call the developed topic detection methods as oEFCM and spEFCM.

Findings

Our simulation shows that both oEFCM and spEFCM methods provide faster running times than EFCM for data sets that do not fit in memory. However, there is a decrease in the average coherence score. For both data sets that fit and do not fit into memory, the oEFCM method provides a tradeoff between running time and coherence score, which is better than spEFCM.

Originality/value

This research produces a scalable topic detection method. Besides this scalability capability, the developed method also provides a faster running time for the data set that fits in memory.

Details

Data Technologies and Applications, vol. 55 no. 4
Type: Research Article
ISSN: 2514-9288

Keywords

Abstract

Details

Self-Learning and Adaptive Algorithms for Business Applications
Type: Book
ISBN: 978-1-83867-174-7

Article
Publication date: 28 October 2014

Hongkang Lin

The clustering/classification method proposed in this study, designated as the PFV-index method, provides the means to solve the following problems for a data set characterized by…

Abstract

Purpose

The clustering/classification method proposed in this study, designated as the PFV-index method, provides the means to solve the following problems for a data set characterized by imprecision and uncertainty: first, discretizing the continuous values of all the individual attributes within a data set; second, evaluating the optimality of the discretization results; third, determining the optimal number of clusters per attribute; and fourth, improving the classification accuracy (CA) of data sets characterized by uncertainty. The paper aims to discuss these issues.

Design/methodology/approach

The proposed method for the solution of the clustering/classifying problem, designated as PFV-index method, combines a particle swarm optimization algorithm, fuzzy C-means method, variable precision rough sets theory, and a new cluster validity index function.

Findings

This method could cluster the values of the individual attributes within the data set and achieves both the optimal number of clusters and the optimal CA.

Originality/value

The validity of the proposed approach is investigated by comparing the classification results obtained for UCI data sets with those obtained by supervised classification BPNN, decision-tree methods.

Details

Engineering Computations, vol. 31 no. 8
Type: Research Article
ISSN: 0264-4401

Keywords

Article
Publication date: 19 April 2013

Barileé B. Baridam and M. Montaz Ali

The K‐means clustering algorithm has been intensely researched owing to its simplicity of implementation and usefulness in the clustering task. However, there have also been…

Abstract

Purpose

The K‐means clustering algorithm has been intensely researched owing to its simplicity of implementation and usefulness in the clustering task. However, there have also been criticisms on its performance, in particular, for demanding the value of K before the actual clustering task. It is evident from previous researches that providing the number of clusters a priori does not in any way assist in the production of good quality clusters. The authors' investigations in this paper also confirm this finding. The purpose of this paper is to investigate further, the usefulness of the K‐means clustering in the clustering of high and multi‐dimensional data by applying it to biological sequence data.

Design/methodology/approach

The authors suggest a scheme which maps the high dimensional data into low dimensions, then show that the K‐means algorithm with pre‐processor produces good quality, compact and well‐separated clusters of the biological data mapped in low dimensions. For the purpose of clustering, a character‐to‐numeric conversion was conducted to transform the nucleic/amino acids symbols to numeric values.

Findings

A preprocessing technique has been suggested.

Originality/value

Conceptually this is a new paper with new results.

1 – 10 of over 1000