Search results

1 – 10 of over 7000
Open Access
Article
Publication date: 10 August 2022

Jie Ma, Zhiyuan Hao and Mo Hu

The density peak clustering algorithm (DP) is proposed to identify cluster centers by two parameters, i.e. ρ value (local density) and δ value (the distance between a point and…

Abstract

Purpose

The density peak clustering algorithm (DP) is proposed to identify cluster centers by two parameters, i.e. ρ value (local density) and δ value (the distance between a point and another point with a higher ρ value). According to the center-identifying principle of the DP, the potential cluster centers should have a higher ρ value and a higher δ value than other points. However, this principle may limit the DP from identifying some categories with multi-centers or the centers in lower-density regions. In addition, the improper assignment strategy of the DP could cause a wrong assignment result for the non-center points. This paper aims to address the aforementioned issues and improve the clustering performance of the DP.

Design/methodology/approach

First, to identify as many potential cluster centers as possible, the authors construct a point-domain by introducing the pinhole imaging strategy to extend the searching range of the potential cluster centers. Second, they design different novel calculation methods for calculating the domain distance, point-domain density and domain similarity. Third, they adopt domain similarity to achieve the domain merging process and optimize the final clustering results.

Findings

The experimental results on analyzing 12 synthetic data sets and 12 real-world data sets show that two-stage density peak clustering based on multi-strategy optimization (TMsDP) outperforms the DP and other state-of-the-art algorithms.

Originality/value

The authors propose a novel DP-based clustering method, i.e. TMsDP, and transform the relationship between points into that between domains to ultimately further optimize the clustering performance of the DP.

Details

Data Technologies and Applications, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 18 June 2021

Shuai Luo, Hongwei Liu and Ershi Qi

The purpose of this paper is to recognize and label the faults in wind turbines with a new density-based clustering algorithm, named contour density scanning clustering (CDSC…

Abstract

Purpose

The purpose of this paper is to recognize and label the faults in wind turbines with a new density-based clustering algorithm, named contour density scanning clustering (CDSC) algorithm.

Design/methodology/approach

The algorithm includes four components: (1) computation of neighborhood density, (2) selection of core and noise data, (3) scanning core data and (4) updating clusters. The proposed algorithm considers the relationship between neighborhood data points according to a contour density scanning strategy.

Findings

The first experiment is conducted with artificial data to validate that the proposed CDSC algorithm is suitable for handling data points with arbitrary shapes. The second experiment with industrial gearbox vibration data is carried out to demonstrate that the time complexity and accuracy of the proposed CDSC algorithm in comparison with other conventional clustering algorithms, including k-means, density-based spatial clustering of applications with noise, density peaking clustering, neighborhood grid clustering, support vector clustering, random forest, core fusion-based density peak clustering, AdaBoost and extreme gradient boosting. The third experiment is conducted with an industrial bearing vibration data set to highlight that the CDSC algorithm can automatically track the emerging fault patterns of bearing in wind turbines over time.

Originality/value

Data points with different densities are clustered using three strategies: direct density reachability, density reachability and density connectivity. A contours density scanning strategy is proposed to determine whether the data points with the same density belong to one cluster. The proposed CDSC algorithm achieves automatically clustering, which means that the trends of the fault pattern could be tracked.

Details

Data Technologies and Applications, vol. 55 no. 5
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 22 February 2024

Yumeng Feng, Weisong Mu, Yue Li, Tianqi Liu and Jianying Feng

For a better understanding of the preferences and differences of young consumers in emerging wine markets, this study aims to propose a clustering method to segment the super-new…

Abstract

Purpose

For a better understanding of the preferences and differences of young consumers in emerging wine markets, this study aims to propose a clustering method to segment the super-new generation wine consumers based on their sensitivity to wine brand, origin and price and then conduct user profiles for segmented consumer groups from the perspectives of demographic attributes, eating habits and wine sensory attribute preferences.

Design/methodology/approach

We first proposed a consumer clustering perspective based on their sensitivity to wine brand, origin and price and then conducted an adaptive density peak and label propagation layer-by-layer (ADPLP) clustering algorithm to segment consumers, which improved the issues of wrong centers' selection and inaccurate classification of remaining sample points for traditional DPC (DPeak clustering algorithm). Then, we built a consumer profile system from the perspectives of demographic attributes, eating habits and wine sensory attribute preferences for segmented consumer groups.

Findings

In this study, 10 typical public datasets and 6 basic test algorithms are used to evaluate the proposed method, and the results showed that the ADPLP algorithm was optimal or suboptimal on 10 datasets with accuracy above 0.78. The average improvement in accuracy over the base DPC algorithm is 0.184. As an outcome of the wine consumer profiles, sensitive consumers prefer wines with medium prices of 100–400 CNY and more personalized brands and origins, while casual consumers are fond of popular brands, popular origins and low prices within 50 CNY. The wine sensory attributes preferred by super-new generation consumers are red, semi-dry, semi-sweet, still, fresh tasting, fruity, floral and low acid.

Practical implications

Young Chinese consumers are the main driver of wine consumption in the future. This paper provides a tool for decision-makers and marketers to identify the preferences of young consumers quickly which is meaningful and helpful for wine marketing.

Originality/value

In this study, the ADPLP algorithm was introduced for the first time. Subsequently, the user profile label system was constructed for segmented consumers to highlight their characteristics and demand partiality from three aspects: demographic characteristics, consumers' eating habits and consumers' preferences for wine attributes. Moreover, the ADPLP algorithm can be considered for user profiles on other alcoholic products.

Details

Kybernetes, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 23 August 2022

Kamlesh Kumar Pandey and Diwakar Shukla

The K-means (KM) clustering algorithm is extremely responsive to the selection of initial centroids since the initial centroid of clusters determines computational effectiveness…

Abstract

Purpose

The K-means (KM) clustering algorithm is extremely responsive to the selection of initial centroids since the initial centroid of clusters determines computational effectiveness, efficiency and local optima issues. Numerous initialization strategies are to overcome these problems through the random and deterministic selection of initial centroids. The random initialization strategy suffers from local optimization issues with the worst clustering performance, while the deterministic initialization strategy achieves high computational cost. Big data clustering aims to reduce computation costs and improve cluster efficiency. The objective of this study is to achieve a better initial centroid for big data clustering on business management data without using random and deterministic initialization that avoids local optima and improves clustering efficiency with effectiveness in terms of cluster quality, computation cost, data comparisons and iterations on a single machine.

Design/methodology/approach

This study presents the Normal Distribution Probability Density (NDPD) algorithm for big data clustering on a single machine to solve business management-related clustering issues. The NDPDKM algorithm resolves the KM clustering problem by probability density of each data point. The NDPDKM algorithm first identifies the most probable density data points by using the mean and standard deviation of the datasets through normal probability density. Thereafter, the NDPDKM determines K initial centroid by using sorting and linear systematic sampling heuristics.

Findings

The performance of the proposed algorithm is compared with KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms through Davies Bouldin score, Silhouette coefficient, SD Validity, S_Dbw Validity, Number of Iterations and CPU time validation indices on eight real business datasets. The experimental evaluation demonstrates that the NDPDKM algorithm reduces iterations, local optima, computing costs, and improves cluster performance, effectiveness, efficiency with stable convergence as compared to other algorithms. The NDPDKM algorithm minimizes the average computing time up to 34.83%, 90.28%, 71.83%, 92.67%, 69.53% and 76.03%, and reduces the average iterations up to 40.32%, 44.06%, 32.02%, 62.78%, 19.07% and 36.74% with reference to KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms.

Originality/value

The KM algorithm is the most widely used partitional clustering approach in data mining techniques that extract hidden knowledge, patterns and trends for decision-making strategies in business data. Business analytics is one of the applications of big data clustering where KM clustering is useful for the various subcategories of business analytics such as customer segmentation analysis, employee salary and performance analysis, document searching, delivery optimization, discount and offer analysis, chaplain management, manufacturing analysis, productivity analysis, specialized employee and investor searching and other decision-making strategies in business.

Open Access
Article
Publication date: 5 September 2016

Qingyuan Wu, Changchen Zhan, Fu Lee Wang, Siyang Wang and Zeping Tang

The quick growth of web-based and mobile e-learning applications such as massive open online courses have created a large volume of online learning resources. Confronting such a…

3515

Abstract

Purpose

The quick growth of web-based and mobile e-learning applications such as massive open online courses have created a large volume of online learning resources. Confronting such a large amount of learning data, it is important to develop effective clustering approaches for user group modeling and intelligent tutoring. The paper aims to discuss these issues.

Design/methodology/approach

In this paper, a minimum spanning tree based approach is proposed for clustering of online learning resources. The novel clustering approach has two main stages, namely, elimination stage and construction stage. During the elimination stage, the Euclidean distance is adopted as a metrics formula to measure density of learning resources. Resources with quite low densities are identified as outliers and therefore removed. During the construction stage, a minimum spanning tree is built by initializing the centroids according to the degree of freedom of the resources. Online learning resources are subsequently partitioned into clusters by exploiting the structure of minimum spanning tree.

Findings

Conventional clustering algorithms have a number of shortcomings such that they cannot handle online learning resources effectively. On the one hand, extant partitional clustering methods use a randomly assigned centroid for each cluster, which usually cause the problem of ineffective clustering results. On the other hand, classical density-based clustering methods are very computationally expensive and time-consuming. Experimental results indicate that the algorithm proposed outperforms the traditional clustering algorithms for online learning resources.

Originality/value

The effectiveness of the proposed algorithms has been validated by using several data sets. Moreover, the proposed clustering algorithm has great potential in e-learning applications. It has been demonstrated how the novel technique can be integrated in various e-learning systems. For example, the clustering technique can classify learners into groups so that homogeneous grouping can improve the effectiveness of learning. Moreover, clustering of online learning resources is valuable to decision making in terms of tutorial strategies and instructional design for intelligent tutoring. Lastly, a number of directions for future research have been identified in the study.

Details

Asian Association of Open Universities Journal, vol. 11 no. 2
Type: Research Article
ISSN: 1858-3431

Keywords

Article
Publication date: 1 November 2021

Jingwei Guo, Ji Zhang, Yongxiang Zhang, Peijuan Xu, Lutian Li, Zhongqi Xie and Qinglin Li

Density-based spatial clustering of applications with noise (DBSCAN) is the most commonly used density-based clustering algorithm, while it cannot be directly applied to the…

Abstract

Purpose

Density-based spatial clustering of applications with noise (DBSCAN) is the most commonly used density-based clustering algorithm, while it cannot be directly applied to the railway investment risk assessment. To overcome the shortcomings of calculation method and parameter limits of DBSCAN, this paper proposes a new algorithm called Improved Multiple Density-based Spatial clustering of Applications with Noise (IM-DBSCAN) based on the DBSCAN and rough set theory.

Design/methodology/approach

First, the authors develop an improved affinity propagation (AP) algorithm, which is then combined with the DBSCAN (hereinafter referred to as AP-DBSCAN for short) to improve the parameter setting and efficiency of the DBSCAN. Second, the IM-DBSCAN algorithm, which consists of the AP-DBSCAN and a modified rough set, is designed to investigate the railway investment risk. Finally, the IM-DBSCAN algorithm is tested on the China–Laos railway's investment risk assessment, and its performance is compared with other related algorithms.

Findings

The IM-DBSCAN algorithm is implemented on China–Laos railway's investment risk assessment and compares with other related algorithms. The clustering results validate that the AP-DBSCAN algorithm is feasible and efficient in terms of clustering accuracy and operating time. In addition, the experimental results also indicate that the IM-DBSCAN algorithm can be used as an effective method for the prospective risk assessment in railway investment.

Originality/value

This study proposes IM-DBSCAN algorithm that consists of the AP-DBSCAN and a modified rough set to study the railway investment risk. Different from the existing clustering algorithms, AP-DBSCAN put forward the density calculation method to simplify the process of optimizing DBSCAN parameters. Instead of using Euclidean distance approach, the cutoff distance method is introduced to improve the similarity measure for optimizing the parameters. The developed AP-DBSCAN is used to classify the China–Laos railway's investment risk indicators more accurately. Combined with a modified rough set, the IM-DBSCAN algorithm is proposed to analyze the railway investment risk assessment. The contributions of this study can be summarized as follows: (1) Based on AP, DBSCAN, an integrated methodology AP-DBSCAN, which considers improving the parameter setting and efficiency, is proposed to classify railway risk indicators. (2) As AP-DBSCAN is a risk classification model rather than a risk calculation model, an IM-DBSCAN algorithm that consists of the AP-DBSCAN and a modified rough set is proposed to assess the railway investment risk. (3) Taking the China–Laos railway as a real-life case study, the effectiveness and superiority of the proposed IM-DBSCAN algorithm are verified through a set of experiments compared with other state-of-the-art algorithms.

Details

Data Technologies and Applications, vol. 56 no. 3
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 2 November 2015

Desh Deepak Sharma and S.N. Singh

This paper aims to detect abnormal energy uses which relate to undetected consumption, thefts, measurement errors, etc. The detection of irregular power consumption, with…

Abstract

Purpose

This paper aims to detect abnormal energy uses which relate to undetected consumption, thefts, measurement errors, etc. The detection of irregular power consumption, with variation in irregularities, helps the electric utilities in planning and making strategies to transfer reliable and efficient electricity from generators to the end-users. Abnormal peak load demand is a kind of aberration that needs to be detected.

Design/methodology/approach

This paper proposes a Density-Based Micro Spatial Clustering of Applications with Noise (DBMSCAN) clustering algorithm, which is implemented for identification of ranked irregular electricity consumption and occurrence of peak and valley loads. In the proposed algorithm, two parameters, a and ß, are introduced, and, on tuning of these parameters, after setting of global parameters, a varied number of micro-clusters and ranked irregular consumptions, respectively, are obtained. An approach is incorporated with the introduction of a new term Irregularity Variance in the suggested algorithm to find variation in the irregular consumptions according to anomalous behaviors.

Findings

No set of global parameters in DBSCAN is found in clustering of load pattern data of a practical system as the data. The proposed DBMSCAN approach finds clustering results and ranked irregular consumption such as different types of abnormal peak demands, sudden change in the demand, nearly zero demand, etc. with computational ease without any iterative control method.

Originality/value

The DBMSCAN can be applied on any data set to find ranked outliers. It is an unsupervised approach of clustering technique to find the clustering results and ranked irregular consumptions while focusing on the analysis of and variations in anomalous behaviors in electricity consumption.

Details

International Journal of Energy Sector Management, vol. 9 no. 4
Type: Research Article
ISSN: 1750-6220

Keywords

Article
Publication date: 3 November 2022

Reza Edris Abadi, Mohammad Javad Ershadi and Seyed Taghi Akhavan Niaki

The overall goal of the data mining process is to extract information from an extensive data set and make it understandable for further use. When working with large volumes of…

Abstract

Purpose

The overall goal of the data mining process is to extract information from an extensive data set and make it understandable for further use. When working with large volumes of unstructured data in research information systems, it is necessary to divide the information into logical groupings after examining their quality before attempting to analyze it. On the other hand, data quality results are valuable resources for defining quality excellence programs of any information system. Hence, the purpose of this study is to discover and extract knowledge to evaluate and improve data quality in research information systems.

Design/methodology/approach

Clustering in data analysis and exploiting the outputs allows practitioners to gain an in-depth and extensive look at their information to form some logical structures based on what they have found. In this study, data extracted from an information system are used in the first stage. Then, the data quality results are classified into an organized structure based on data quality dimension standards. Next, clustering algorithms (K-Means), density-based clustering (density-based spatial clustering of applications with noise [DBSCAN]) and hierarchical clustering (balanced iterative reducing and clustering using hierarchies [BIRCH]) are applied to compare and find the most appropriate clustering algorithms in the research information system.

Findings

This paper showed that quality control results of an information system could be categorized through well-known data quality dimensions, including precision, accuracy, completeness, consistency, reputation and timeliness. Furthermore, among different well-known clustering approaches, the BIRCH algorithm of hierarchical clustering methods performs better in data clustering and gives the highest silhouette coefficient value. Next in line is the DBSCAN method, which performs better than the K-Means method.

Research limitations/implications

In the data quality assessment process, the discrepancies identified and the lack of proper classification for inconsistent data have led to unstructured reports, making the statistical analysis of qualitative metadata problems difficult and thus impossible to root out the observed errors. Therefore, in this study, the evaluation results of data quality have been categorized into various data quality dimensions, based on which multiple analyses have been performed in the form of data mining methods.

Originality/value

Although several pieces of research have been conducted to assess data quality results of research information systems, knowledge extraction from obtained data quality scores is a crucial work that has rarely been studied in the literature. Besides, clustering in data quality analysis and exploiting the outputs allows practitioners to gain an in-depth and extensive look at their information to form some logical structures based on what they have found.

Details

Information Discovery and Delivery, vol. 51 no. 4
Type: Research Article
ISSN: 2398-6247

Keywords

Article
Publication date: 8 May 2023

Saad Ahmed Al-Saad, Rana N. Jawarneh and Areej Shabib Aloudat

To test the applicability of the user-generated content (UGC) derived from social travel network sites for online reputation management, the purpose of this study is to analyze…

Abstract

Purpose

To test the applicability of the user-generated content (UGC) derived from social travel network sites for online reputation management, the purpose of this study is to analyze the spatial clustering of the reputable hotels (based on the TripAdvisor Best-Value indicator) and reputable outdoor seating restaurants (based on ranking indicator).

Design/methodology/approach

This study used data mining techniques to obtain the UGC from TripAdvisor. The Hierarchical Density-Based Spatial Clustering method based on algorithm (HDBSCAN) was used for robust cluster analysis.

Findings

The findings of this study revealed that best value (BV) hotels and reputable outdoor seating restaurants are most likely to be located in and around the central districts of the urban tourist destinations where population and economic activities are denser. BV hotels' spatiotemporal cluster analysis formed clusters of different sizes, densities and shape patterns.

Research limitations/implications

This study showed that reputable hotels and restaurants (H&Rs) are concentrated within districts near historic city centers. This should be an impetus for applied research on urban investment environments.

Practical implications

The findings would be rational guidance for entrepreneurs and potential investors on the most attractive tourism investment environments.

Originality/value

There has been a lack of studies focusing on analyzing the spatial clustering of the H&Rs using UGC. Therefore, to the best of the authors’ knowledge, this study is the first to map and analyze the spatiotemporal clustering patterns of reputable hotels (TripAdvisor BV indicator) and restaurants (ranking indicator). As such, this study makes a significant methodological contribution to urban tourism research by showing pattern change in H&Rs clustering using data mining and the HDBSCAN algorithm.

研究目的

为了测试社交旅游网站 (STNS) 的用户生成内容 (UGC) 对在线声誉管理 (ORM) 的适用性, 本研究分析了知名酒店的空间聚类(基于 TripAdvisor 最佳价值指标) 和信誉良好的户外座位 (ODS) 餐厅(基于排名指标)。

研究设计/方法/途径

该研究使用数据挖掘技术从 TripAdvisor 获取 UGC。 基于(HDBSCAN)算法的分层基于密度的空间聚类方法用于鲁棒聚类分析。

研究发现

调查结果显示, 最具价值 (BV) 酒店和信誉良好的 ODS 餐厅最有可能位于人口和经济活动较为密集的城市旅游目的地的中心区及其周边地区。 BV 酒店的时空聚类分析形成了不同大小、密度和形状模式的聚类。

研究原创性

目前的文献扔缺乏专注于分析利用 UGC 的酒店和餐厅 (H&R) 空间聚类的研究。 因此, 本研究首次绘制并分析了知名酒店(TripAdvisor BV 指标)和餐厅(排名指标)的时空聚类模式。 因此, 本研究通过利用数据挖掘和 HDBSCAN 算法显示 H&Rs 聚类的模式变化, 为城市旅游研究做出了重要的方法论贡献。

理论意义

这项研究表明, 著名的 H&R 集中在历史悠久的市中心附近的地区。 这应该是对城市投资环境的应用研究的推动力。

实践意义

研究结果将为企业家和潜在投资者提供最具吸引力的旅游投资环境的理性指导。

Article
Publication date: 30 April 2021

Faruk Bulut, Melike Bektaş and Abdullah Yavuz

In this study, supervision and control of the possible problems among people over a large area with a limited number of drone cameras and security staff is established.

Abstract

Purpose

In this study, supervision and control of the possible problems among people over a large area with a limited number of drone cameras and security staff is established.

Design/methodology/approach

These drones, namely unmanned aerial vehicles (UAVs) will be adaptively and automatically distributed over the crowds to control and track the communities by the proposed system. Since crowds are mobile, the design of the drone clusters will be simultaneously re-organized according to densities and distributions of people. An adaptive and dynamic distribution and routing mechanism of UAV fleets for crowds is implemented to control a specific given region. The nine popular clustering algorithms have been used and tested in the presented mechanism to gain better performance.

Findings

The nine popular clustering algorithms have been used and tested in the presented mechanism to gain better performance. An outperformed clustering performance from the aggregated model has been received when compared with a singular clustering method over five different test cases about crowds of human distributions. This study has three basic components. The first one is to divide the human crowds into clusters. The second one is to determine an optimum route of UAVs over clusters. The last one is to direct the most appropriate security personnel to the events that occurred.

Originality/value

This study has three basic components. The first one is to divide the human crowds into clusters. The second one is to determine an optimum route of UAVs over clusters. The last one is to direct the most appropriate security personnel to the events that occurred.

Details

International Journal of Intelligent Unmanned Systems, vol. 12 no. 1
Type: Research Article
ISSN: 2049-6427

Keywords

1 – 10 of over 7000