TMsDP: two-stage density peak clustering based on multi-strategy optimization

Purpose – The density peak clustering algorithm (DP) is proposed to identify cluster centers by two parameters, i.e. ρ value (local density) and δ value (the distance between a point and another point with a higher ρ value). According to the center-identifying principle of the DP, the potential cluster centers should have a higher ρ value and a higher δ value than other points. However, this principle may limit the DP from identifying some categories with multi-centers or the centers in lower-density regions. In addition, the improper assignment strategy of the DP could cause a wrong assignment result for the non-center points. This paper aims to address the aforementioned issues and improve the clustering performance of the DP. Design/methodology/approach – First, to identify as many potential cluster centers as possible, the authors construct a point-domain by introducing the pinhole imaging strategy to extend the searching range of the potential cluster centers. Second, they design di ﬀ erent novel calculation methods for calculating the domain distance, point-domain density and domain similarity. Third, they adopt domain similarity to achieve the domain merging process and optimize the ﬁ nal clustering results. Findings – The experimental results on analyzing 12 synthetic data sets and 12 real-world data sets show that two-stage density peak clustering based on multi-strategy optimization (TMsDP) outperforms the DP and other state-of-the-art algorithms. Originality/value – The authors propose a novel DP-based clustering method, i.e. TMsDP, and transform the relationship between points into that between domains to ultimately further optimize the clustering performance of the DP.


Introduction
As a powerful machine learning method in the data mining field, the clustering strategy has a broad research prospect in effectively identifying the internal structure of data samples, such as mining spatiotemporal co-location events in trajectory data sets (Ansari et al., 2021), conducting customer segmentation  and detecting CT scan images (Singh and Bose, 2021). In addition, as an important branch in clustering algorithms, density-based clustering has been concerned and studied by a large number of researchers. Various density-based clustering methods have been proposed and widely utilized in different fields to date, such as fault recognition in wind turbines with a density-based clustering algorithm (Luo et al., 2021) and risk assessment on railway investment with an improved density-based approach (Guo et al., 2021). In 2014, the density peak clustering algorithm (DP) was proposed by American scholars in Science (Rodriguez and Laio, 2014). Since its establishment, the DP has been studied and applied by a large number of investigators in various fields, such as text clustering (Jo, 2020), medical analysis (Medeghri and Sabeur, 2021) and image recognition He et al., 2021). Specifically, there are three significant parameters, i.e. the d c value (cutoff distance), the ρ value (local density) and the δ value (the distance between a point and another point with a high ρ value), and an important principle in the original DP, i.e. the cluster centers should have a higher ρ value and a higher δ value than other points (Abbas et al., 2021;Wang et al., 2021). Although the DP has better clustering performance than other traditional densitybased clustering algorithms, it still contains a critical limitation, i.e. the higher ρ value and the higher δ value could not accurately reflect whether a point is a cluster center.
To give a concrete example, two different situations are discussed in this paper; Figures 1 and 2 show situation 1 and situation 2, respectively. For situation 1, it is clearly shown in Figure 1(a) that the data set flame should have two different categories, and the two potential cluster centers both have a higher ρ value and a higher δ value than other points. Actually, Figure 1(b) shows that the DP could indeed obtain a clustering result which is close to the natural category. The combination results of Figure 1 seem to demonstrate that the aforementioned principle about the ρ value, the δ value and the cluster centers is reasonable. However, situation 2 illustrates that the principle is unreasonable yet. As shown in Figure 2, the DP could just obtain the inferior clustering results when analyzing the data sets D1 and compound, which are not consistent with the principle mentioned above.
Obviously, the DP could identify only two potential cluster centers for data set D1 (it has three different natural categories), while it could just identify six wrong clusters for the data set compound (it has six different natural categories). The difference between situation 1 and situation 2 reflects the following deficiencies of the DP: (1) Figure 1. The selection of cluster center points (they are the data points in the oblong) and the clustering result of flame DTA the DP could not detect the accurate density peak points when analyzing some data sample with multi-density or variable density; (2) the DP is challenging to identify some data samples with non-single cluster center accurately and (3) the drawback of the original density calculation method and the improper assignment of the noncentral points ultimately affect the overall clustering performance.
To address the aforementioned issues, the authors develop an enhanced DP-based clustering method, i.e. two-stage density peak clustering based on multi-strategy optimization (TMsDP), to further optimize the clustering performance of the DP. The main contributions and innovations of the TMsDP are as follows: (1) Point-domain is constructed by introducing the pinhole imaging strategy to confirm the search scope of potential centers. The point-domain improves the clustering efficiency by transforming the relationship between points into that between domains.
(2) Point-domain density is determined to measure the distribution of points in a pointdomain, while the domain distance is calculated by introducing the Hausdorff distance to improve the clustering accuracy.
(3) Domain similarity is proposed to achieve the domain merging process. In a data space, the domain similarity between point-domains is higher, and it is more likely to merge with each other.

TMsDP
The details of TMsDP are discussed in this study. Specifically, Section 2 presents a brief introduction of the DP, Section 3 describes the specific technical details of the proposed TMsDP, Section 4 analyzes the experimental results with different data sets to verify the clustering performance of the TMsDP and Section 5 summarizes this study by discussing the results and future areas for potential investigations.

Density peak clustering 2.1 Preparation
In the original DP (Rodriguez and Laio, 2014), d c is set as the manual parameter, which denotes the appropriate position in an ascending distance sequence, and the definition processes are shown as follows (assuming Sample = {s 1 , s 2 , s 3 , …, s n }): where N indicates the manual inputting value and dis (s i , s j ) represents the distance between the point s i and the point s j . Rodriguez and Laio (2014) define that ρ i denotes the number of points in a circle with the point s i as the center and the d c value as the radius, and the process is shown as follows: where the function χ(o) is equal to 1 or 0. If the variable o is greater than 0, χ(o) is equal to 0. Otherwise, χ(o) is equal to 1. In addition, the calculation process of the δ value is shown as follows: 2.2 Related work Based on the aforementioned contents, it is clear that the δ value and the ρ value are limited by the threshold parameter, i.e. d c value, and utilizing different d c values could even provide completely different clustering results when analyzing the same data set (Hou et al., 2020;Lu et al., 2020;Jangra and Toshniwal, 2020;Flores and Garza, 2020;Zhu et al., 2020). For addressing the threshold parameter selection issue, Xu et al. (2020) proposed a robust DP with density-sensitive similarity to find accurate cluster centers automatically and reduce the effect of the d c value selection on clustering results. D'Errico et al. (2021) provided a feasible approach for solving the classification problem of data with different shapes and distributions in order to avoid the drawback of the d c value. Ding et al. (2018) developed an automatic DP based on a generalized extreme value distribution. At the same time, DTA the assignment strategy of non-cluster center points often affects the final clustering results. To address the assignment issues, Jiang et al. (2019) introduced logistic distribution theory and K-nearest neighbor (kNN) theory into DP. Xu et al. (2021) designed a novel sparse search strategy to measure the similarity between the nearest neighbors of each point. Yu et al. (2021) proposed a three-way density peak clustering method based on evidence theory. Seyedi et al. (2019) utilized a graph-based label propagation to assign labels to remaining points and proposed the dynamic graph-based label propagation for density peak clustering. Apart from the d c value selection issue and the non-center point assignment issue, it is challenging to identify the potential centers in low-density regions and to analyze data with varying density distributions using the DP. For solving these issues, Yan et al.  To optimize the performance of original DP, the authors delineate a novel DP-based clustering method in this paper. In the novel method, they propose four main significant strategies, i.e. point-domain, point-domain density, domain distance and domain similarity. The framework of TMsDP is shown in Figure 3.
3. The proposed clustering method 3.1 Point-domain strategy based on pinhole imaging theory In order to explore the potential cluster centers in low-density regions, the proposed TMsDP constructs the point-domain by introducing the pinhole imaging theory. Pinhole imaging is a physics phenomenon where a light source passes through a pinhole and its inverted image will be formed on a screen (Long et al., 2021). Inspired by the related literature (Long et al., 2021;Lu et al., 2018), this paper introduces the pinhole imaging theory into the search strategy of potential cluster centers, which can help the TMsDP to expand the range of center exploration. Assume that the point S i x s i ; y s i À Á is a potential cluster center in Sample and The whole process of constructing pointdomains by utilizing the pinhole imaging strategy DTA other potential cluster centers, like the point S k x s k ; y s k À Á and the point S j x s j ; y s j À Á , may also exist in the same point-domain. If we want to construct a point-domain for data S i , we should comprehend the preliminary definitions which are shown in Figure 4 (this paper mainly utilizes two-dimensional (2D) data as the examples to explain the following preliminary definitions).
Definition 1. (upper bound for searching of the first dimension data). As a rule of thumb, if the point S i x s i ; y s i À Á is a cluster center, the ρ value of other potential cluster centers should be close to ρ s i . Therefore, this study should determine a searching range to explore these potential cluster centers. The first exploration concept is the search upper set (SUS); the SUS is a point-set where the points have a higher first dimension data value and a higher ρ value than the point S i and they are nearly closest to the point S i . Based on the SUS, the calculation processes of upper bound for searching of the first dimension data are shown as follows: Definition 2. (lower bound for searching of the first dimension data). To maximize the odds of finding more potential cluster centers, this study should consider a situation where some potential centers may exist in a region with a slightly lower ρ value than the point S i . Therefore, the second exploration concept is the search lower set (SLS); the SLS is also a pointset where the points have a lower first dimension data value than the point S i , and the ρ values of these points are much closer to ρ s i . Based on the abovementioned contents, the calculation processes of lower bound for searching of the first dimension data are shown as follows: Definition 3. (basis point in the first dimension). In this paper, the basis point in the first dimension denotes a middle value between upper bound for searching of the first dimension data and lower bound for searching of the first dimension data. The definition is shown as follows: TMsDP Definition 4. (upper bound for searching of the second dimension data). The process of upper bound for searching of the second dimension data is similar to Definition 1, and the definitions are shown as follows: Definition 5. (lower bound for searching of the second dimension data). The process of lower bound for searching of the second dimension data is similar to Definition 2, and the definitions are shown as follows: Definition 6. (basis point in the second dimension). The definition of the basis point in the second dimension is similar to Definition 3, and it is shown as follows: For the example shown in Figure 4, it is a point-domain of the point S i . In the point-domain S i , the authors set the x-axis value of receiving screen (first dimension) to x r , the y-axis value of receiving screen (first dimension) to y r , the x-axis value of receiving screen (second dimension) to x′ r and the y-axis value of receiving screen (second dimension) to y′ r . Based on the triangular similarity theory, the relationships between four searching bounds and two basis points are shown as follows: where the control thresholds ψ and ξ could be set manually for different clustering demands. According to formula (16) and formula (17), the side values of the point-domain can be obtained as follows: 3.2 Domain merging strategy based on point-domain similarity Although the TMsDP transforms the relationships between points into that between pointdomains, it is still a density-based clustering method. Therefore, how to perform the density analysis on point-domains is a highlight in this section. This paper defines the point-domain density as follows: Definition 7. (point-domain density). In this paper, the point-domain density denotes the amount of points per unit area of a point-domain (the definition emphasizes the distribution of data points, which has statistical significance). According to the aforementioned contents, the authors could assume a set D = {D 1 , D 2 , D 3 , …, D n }, where n indicates the amount of point-domains and D indicates a domain-set which includes all point-domains, and the calculation process of point-domain density is shown as follows (applying the function amount(θ) to calculate the amount of data points in a point-domain): The point-domain density could show the inner characteristic of a point-domain; moreover, the authors consider the outer characteristics between point-domains. Therefore, this paper constructs a novel distance definition, i.e. domain distance.
Definition 8. (domain distance). Inspired by the literature (Vavpetic and Zagar, 2021;Ryu and Kamata, 2021;Nie et al., 2021), the authors adopt the Hausdorff distance to calculate the domain distance between point-domains. Assume that a point-domain D1 = {d1 1 , d1 2 , d1 3 , …, d1 i } and the other point-domain D2 = {d2 1 , d2 2 , d2 3 , …, d2 j }, where d1 i and d2 j denote the two different points and i and j denote the serial number of data points in D1 and D2, respectively. The calculation process of the domain distance is shown as follows: For calculating the domain distance, the authors still need to consider two additional situations: (1) is there an intersection part between the two point-domains? (2) whether the points in these two point-domains are uniformly distributed? The authors take the data set spiral as an example to describe these two situations, and the results are shown in Figures 5 and 6. As shown in Figure 5(a), point-domain 1 and point-domain 2 have no intersection part, which means that the point-domain similarity could take the domain distance as the only calculation criterion. But in Figure 5(b), point-domain 1 and point-domain 3 have an intersection part and there is also an intersection part between point-domain 2 and point-domain 3. When there exist TMsDP intersection parts between point-domains, the calculation of point-domain similarity needs to take into account the intersection part, and it is shown as follows: As shown in Figure 6(a), point-domain 1 has some independent sparse points in the red circle region, and it could be clearly seen that these sparse points deviate from the overall distribution trend of the points in the point-domain. Therefore, the calculation process of point-domain similarity should be performed on the points in the overall distribution trend other than the sparse points. In Figure 6(b), point-domain 2 does not have sparse points and the overall distribution trend of points is relatively stable. Inspired by the literature (Yarinezhad and Hashemi, 2019), the authors propose a strategy to identify the sparse points in this paper. Obviously, the points in the manifold data sets could identify easily whether they are the sparse points. However, for other data sets with different types, the sparse points could not be judged visibly.
Assume that a point-domain could be divided into two regions with equal areas and the density values of these two regions are set to ρ 1 and ρ 2 , respectively. In addition, the authors assume that ρ 1 is greater than ρ 2 , the density value of the whole pointdomain is ρ and the difference between ρ 1 and ρ 2 will be compared with the value of  8ρ. If the difference between ρ 1 and ρ 2 is greater than 0.8ρ, the points in the region with small density could be identified as the sparse points.
According to the rule of thumb, if there are more intersection parts between two subjects and these two subjects are much nearer, the two subjects are more likely to merge into one. Therefore, the TMsDP adopts the domain distance and the intersection part between two point-domains to calculate the domain similarity. The calculation formula is shown as follows: where sim denotes the domain similarity, γ denotes a random parameter with a range of values in (0, 1) and s denotes the adjustment operator which aims to make the value of domain similarity in (0, 1). Considering that the distance value between point-domains with intersection must be smaller than that between point-domains without intersection, the larger the distance value between point-domains is, the smaller the similarity is. Therefore, this paper adds the adjustment operator o(θ) and the adjustment parameter γ to ensure that the domain similarity between point-domains without intersection is less than that between  Figure 7 shows the merging situation of two pointdomains, which takes the data set 2circles as an example. In fact, these strategies and methods proposed in this paper increase the impact of the parameters on the clustering result. Apart from the original parameters d c , the TMsDP adds the parameters τ and ω to determine the exploration range of the potential cluster centers, adds the parameters ψ and ξ to determine the size of the point-domain and the value of the domain density and adds the parameters o(θ) and γ to determine the domain similarity. Actually, the most significant parameter in the TMsDP is the side value of different pointdomains, and the parameters mentioned above are finally utilized to calculate the side value. The side value of point-domains will be shown in the following specific experimental results (in the following experiments, the authors set the side length and side width to equal values in a point-domain). The overall procedures of the TMsDP are shown in Algorithm 1 (Table I).

Time complexity analysis
For the TMsDP, the time complexity analysis is considered from the following aspects: (1) the time complexity of the point-domain is close to O(n); (2) the time complexity of the calculation about the domain distance is close to O(n 2 ) and (3) the time complexity of the domain similarity is close to O(n 2 ). Thus, the time complexity of the TMsDP is close to O(n 2 + n 2 + n), which is close to the original DP (the time complexity of the DP clustering is O(n 2 )).

Experimental results and analysis
To illustrate the performance of the proposed method, this section selects 12 synthetic data sets and 12 real-world data sets as the experiment samples [1]. The 12 synthetic data sets include 2circles, compound, twocirclesnoise2, spiral, pathbase, jain, flame, D1, D2, DS5, skewed and unbalance. The 12 real-world data sets include thyroid, breast, glass, liver, heart, seeds, zoo, wine, vote, iris, dna and msplice. The specific characteristics of these experiment data sets are shown in Table II. In addition, to further demonstrate the clustering performance of the proposed method, the TMsDP is compared with DP (Rodriguez and Laio, 2014), density peaks clustering based on logistic distribution and gravitation (DPC-LG) , DBSCAN (Ester et al., 1996), Affinity Propagation Algorithm (AP) (Frey and Dueck, 2007) and K-means (Jain, 2010). This paper takes the Rand index (RI, the range of values is from −1.0 to 1.0), F-measure (FM, the range of values is from −1.0 to 1.0), Jaccard index (JI, the range of values is from 0 to 1.0) and normalized mutual information (NMI, the range of values is from −1.0 to 1.0) as the evaluation criteria to measure the clustering performance.

DTA
According to the visualization of the clustering results, this study could find that the proposed method, DBSCAN, DP and DPC-LG can obtain more accurate clustering results when analyzing some manifold data sets (such as jain and spiral). However, when analyzing some data sets with multiple centers (such as 2circles, compound and twocirclesnoise2) and the data sets with unbalanced and skewed size (such as unbalance and skewed), only the proposed TMsDP can obtain more accurate clustering results among the six algorithms in the comparison experiments. Meanwhile, when analyzing some data sets with varying sizes (such as D1 and DS5) and some data sets with irregular shapes (such as flame and DS5), the TMsDP still obtains more accurate clustering results than the other five comparison algorithms. In order to compare the clustering performance of these six methods more sharply, Tables III and IV present the evaluation index values of different algorithms with different parameter value settings, which demonstrate that the TMsDP outperforms other compared algorithms.  Tables V and VI, the TMsDP could obtain larger values in almost all the four evaluation metrics than the other five comparison algorithms when analyzing 12 real-world data sets. Of course, considering the diversity of data structural characteristics, the TMsDP could not obtain the best values in all evaluation metrics when analyzing all test data sets. Nevertheless, according to the available comparison results, the better clustering performance of TMsDP could still be shown.

Robustness analysis
In this experiment, the authors select the seeds and liver with different degrees of noise to evaluate the robustness of the compared algorithms. The authors generate different amounts of random data points as noise in the value space of the original data set. The   Figure 20. As shown in Figure 20, with the increasing proportion of noise, the average FM value of each algorithm decreases. However, the average FM value of the TMsDP drops at a minimum rate, while that of AP drops at a maximum rate. Due to the small sample size of the data sets, the average FM values of TMsDP, DP, DPC-LG, DBSCAN and K-means are almost identical when the noise level rises from 1.0 per cent to 10.0 per cent. Therefore, the TMsDP retains higher accuracy in each case and illustrates higher robustness than the compared algorithms.

Running time analysis
In this section, the authors compare the running time of TMsDP with DPC-LG and DP on the 24 data sets, which include five different categories, i.e. (1) the synthetic 2D data sets with the data volume being less than 1,000, (2) the synthetic 2D data sets with the data volume being greater than or equal to 1,000, (3) the real-world data sets with the range of dimensions being from 2 to 10 and the data volume being less than 1,000, (4) the real-world data sets with the dimensions being greater than 10 and the data volume being less than 1,000 and (5) the real-world data sets with the dimensions being greater than 150 and the data volume being greater than or equal to 2,000 (selecting the average running time in 30 times of these three algorithms). The overall running speed is slow when dealing with higher dimensional data sets due to the limited running environment (Intel Core i5, Figure 12. The clustering result for data set compound Figure 13. The clustering result for data set pathbase DTA 2.40 GHz, 8 GB RAM and MATLAB 2014a); therefore, when running some data sets with a large sample size and high dimensions, the overall running time of these three comparison algorithms is relatively long. In addition, because the TMsDP is an improved algorithm based on the DP, three DP-based algorithms (TMsDP, DPC-LG and DP) are selected for comparison. The running time result is shown in Table VII. As shown in Table VII, the running time of the TMsDP is about twice as long as that of the DP. According to Section 3.3, the time complexity of the TMsDP is close to O (n 2 + n 2 + n), which is close to DP (the time complexity of the DP is O(n 2 )). Therefore, the actual running time of the TMsDP is not more than twice as long as that of the traditional DP. TMsDP 4.5 Overall performance review In this paper, 12 synthetic data sets and 12 real-world data sets are utilized as experimental samples to demonstrate the clustering performance of the proposed method.
According to the clustering results of the 12 synthetic data sets, it could be seen that the proposed TMsDP shows better clustering performance than others when facing the manifold data sets, such as the spiral and jain. Moreover, when facing the multiple center data sets, such as the 2circles, compound and twocirclesnoise2, and the data sets with an unbalanced and skewed size, such as the unbalance and skewed, the original DP is challenging to find potential centers in low-density regions, while the TMsDP method could adopt point-domains to explore more potential cluster centers Figure 16. The clustering result for data set DS5 Figure 17. The clustering result for data set flame DTA for achieving better clustering results. In addition, when facing the irregularly shaped data sets, such as DS5, flame and spiral, and the data sets with varying sizes, such as D1 and DS5, the TMsDP could still obtain more accurate clustering results than other compared algorithms.
According to the clustering results of 12 real-world data sets, it is clearly shown that the TMsDP method could obtain better values in almost all evaluation metrics than other mentioned algorithms. In summary, the TMsDP improves the clustering performance compared with the original DP and expands the theoretical prospects of the densitybased algorithms.

Conclusion
To address the deficiencies of DP (i.e. failing to identify the cluster centers in low-density regions and being challenging to analyze a category with multi-centers), this paper proposes the TMsDP. The TMsDP shows three significant contributions: (1) constructing point-domain by introducing the pinhole imaging strategy to expand the search range for finding potential cluster centers; (2) proposing the novel methods to calculate point-domain density, domain distance and domain similarity and (3) finishing the clustering process based on domain similarity. The experimental results on 12 synthetic data sets and 12 realworld data sets illustrate that the TMsDP shows significantly improved clustering performance compared with original DP and the other algorithms experimentally compared in the paper. DTA Although the proposed method shows better clustering performance, it adds six additional parameters, which could have more impacts on the clustering results. Therefore, the authors divide the future research plane into two aspects. In the theoretical aspect, the first part is to explore an improved calculation method of side values for reducing the number of parameters while preserving the clustering performance of the TMsDP; the second part is to update the calculation strategies of point-domain similarity and domain distance to accelerate the algorithm and the third part is to redesign a novel search mechanism and structure to automatically explore the potential cluster centers. In the application aspect, the authors extend the application fields of the TMsDP. When facing some data from the real-world problems, it could be found that the structures of these data are different from those of the experimental data sets mentioned above. Most of these data have diverse characteristics, including having multiple clustering centers, the clustering centers in the low-density region, unbalanced density distribution and unbalanced sample size distribution. For example, the text data, the consumption data of consumers, the stock DTA data, the financial data and the image data all have complex data features. Therefore, this study could apply the TMsDP to solve some related real-world problems, such as the topic identification of the online public opinion (mainly performing the text clustering), the customer segmentation for some enterprises (mainly performing the clustering analysis on the consumption data of consumers) and the problems of the facial image segmentation and detecting the CT scan images (mainly performing the image recognition). In addition, this study could also combine the TMsDP with some swarm intelligence optimization algorithms to solve the optimization problems in the real world.