A K-means clustering method with feature learning for unbalanced vehicle fault diagnosis

Purpose – Vehicle fault diagnosis is a key factor in ensuring the safe and ef ﬁ cient operation of the railway system. Due to the numerous vehicle categories and different fault mechanisms, there is an unbalanced fault category problem. Most of the current methods to solve this problem have complex algorithm structures, low ef ﬁ ciency and require prior knowledge. This study aims to propose a new method which has a simple structure and does not require any prior knowledge to achieve a fast diagnosis of unbalanced vehicle faults. Design/methodology/approach – This study proposes a novel K-means with feature learning based on the feature learning K-means-improved cluster-centers selection (FKM-ICS) method, which includes the ICS and the FKM. Speci ﬁ cally, this study de ﬁ nes cluster centers approximation to select the initialized cluster centers in the ICS. This study uses improved term frequency-inverse document frequency to measure and adjust the feature word weights in each cluster, retaining the top t feature words with the highest weight in each cluster and perform the clustering process again in the FKM. With the FKM-ICS method, clustering performance for unbalanced vehicle fault diagnosis can be signi ﬁ cantly enhanced. Findings – This study ﬁ nds that the FKM-ICS can achieve a fast diagnosis of vehicle faults on the vehicle fault text (VFT) data set from a railway station in the 2017 (VFT) data set. The experimental results on VFT indicate theproposedmethodin this paper, outperforms several state-of-the-art methods. Originality/value – This is the ﬁ rst effort to address the vehicle fault diagnostic problem and the proposed method performs effectively and ef ﬁ ciently. The ICS enables the FKM-ICS method to exclude the effect of outliers, solves the disadvantages of the fault text data contained a certain amount of noisy data, which effectively enhanced the method stability. The FKM enhances the distribution of feature words that discriminate between different fault categories and reduces the number of feature words to make the FKM-ICS method faster


Introduction
With the rapid increase in the number of railway transport vehicles, inertial fault frequently occurs, which dramatically reduces the overall efficiency of vehicle operating (Yang et al., 2018).Therefore, to figure out what is the main factor leading to the vehicle faults is necessary, which can make the subsequent repair work well be done and formulate corresponding procedures in this mode.According to the research (Li et al., 2017), the main causes of vehicle faults are human, machine, environmental and management factors.The four factors are the major causes of vehicle faults and even passenger safety problems.However, this is an unbalanced fault category data that the vast majority of vehicle faults are caused by the same factors, while only a small number of faults are caused by the remaining factors.In the age of intelligent railways, the use of text mining and other machine learning algorithms to achieve intelligent diagnostic of unbalanced vehicle fault text (VFT) data is the current urgent technical method (Li et al., 2021).

Related work
Recently, vehicle fault diagnosis has attracted a lot of attention, and the topic is to perform the fault diagnostic and find fault-causing factors in machine learning methods.Zheng et al. (2007) implemented the text classification model based on rough set and fuzzy clustering theory to improve the adaptive performance and classification capability of text data classification system for vehicle faults without the unbalanced fault category data problem.Gao et al. (2020) proposed a K-fold cross-validation þ stacking classification model to implement a single-level classification model for multi-level classification of signaling equipment faults and to solve the sample unbalance problem.However, it is not very efficient in practice.Hu and Meng (2020) applied the Hadoop platform, adopted an adaptive syntheticgradient boosting decision tree decision tree algorithm based on the rolling stock fault diagnosis method to compensate for the defects brought about by the imbalance of data distribution to the fault diagnosis.However, the cost of data annotation is too high.Frequently it is hard to get strong supervised information such as full-truth labels.Zhao and Tang (2009) fused the fault diagnosis expert system of an unmanned vehicle based on back propagation neural networks.However, the unbalanced fault category data problem was not involved.Zhang et al. (2020) proposed the expert system for fault diagnosis of station signal control equipment, which was susceptible to outliers and less robust in practical application.

Contributions
The feature learning K-means-improved cluster-centers selection (FKM-ICS) method is proposed in this paper, to address the fault imbalance problem in vehicle fault diagnosis.The data unbalance in vehicle fault diagnosis means that the vast majority of vehicle fault is caused by the same factors (Yang et al., 2018), while only a small number of faults are caused by the remaining factors.To solve the unbalance problem in vehicle fault diagnosis, the ICS is used to eliminate the influence of isolated points and improve the stability of the method, the FKM divides different fault categories by enhancing the discrimination among different feature word clusters.In ICS, owing to a random setting at the initial clustering centers of the traditional Kmeans algorithm, the cluster results are unstable and sensitive to outliers when faced with highdimensional and sparse VFT data (Zhao and Xu, 2015).Therefore, we propose a method for initial cluster center selection based on density and distance based on the density peak method idea.In particular, the method combines the sample density (SD) with the distance between the selected cluster centers (CD) (Wei et al., 2020) to define a cluster center approximation w .The initial cluster centers are selected by w to obtain the initial input parameters of the K-means algorithm.This process excludes the effect of outliers, solves the disadvantages of the traditional

Unbalanced vehicle fault diagnosis
K-means with poor noise immunity effectively and improves the stability of the algorithm.In the FKM, owing to the traditional term frequency-inverse document frequency (TF-IDF) method (Qin and Li, 2013) treats the text set as a whole and does not consider the unbalanced fault word distribution among clusters in the VFT data set, so some obscure words are often mistaken as fault keywords.Therefore, we propose to use the probability ratio of occurrence of feature words among clusters instead of the frequency ratio of occurrence of feature words among clusters to solve the above problem.The improved TF-IDF feature learning method fully considers the influence of the fault word distribution on the unbalanced fault categories, which is not only simpler to implement but also can achieve a better clustering performance.
The major contributions in this paper are as follows: We propose a novel FKM-ICS method to solve the unbalanced faults in vehicle fault diagnosis, which can well address the high-dimensional sparse unbalanced vehicle faults text data clustering.
Using the FKM to generate different feature word representations for faults caused by different factors, which not only increases the distinction between fault categories but also greatly reduces the number of feature words, makes subsequent vehicle fault diagnosis simpler and more efficient.
The ICS eliminates the effect of outliers by selecting the initial cluster centers and can make subsequent vehicle fault diagnosis more efficient and stable.
The remaining part of the paper is organized as follows: Section 2 revisits the traditional K-means algorithm.Section 3 introduces the FKM-ICS method.The experiments on the VFT data set are presented in Section 4. We conclude this paper in Section 5.

Revisit: the K-means algorithm
The basic idea of the traditional K-means algorithm (MacQueen,1967) is: after inputting the number of clusters K, first select K samples randomly from the data set X = {x 1 , x 2 ,Á Á Á, x n } as the initial cluster centers, then calculate the distance from each sample to the K initial cluster center, classify the samples according to the principle of minimum distance to form K clusters, and then calculate the average value of the clusters to get the new cluster center.Keep repeating the above process until the center of the cluster stops changing or the algorithm ends after the iterations attain the set value.The objective function is as follows: where K is the number of clusters, n is the number of samples, m is the feature dimension and u i = {u i1 , u i2 , Á Á Á u im } is the i th cluster center.Given a set of data points and the objective function, the K-means algorithm partitions SRT 3,2 where x i = {x i1 , x i2 ,Á Á Á, x im } and x j = {x j1 , x j2 ,Á Á Á, x jm }.The cluster center u i is the mean value of all samples belonging to the cluster.That is, where jc i j is the number of samples in i th cluster.
3. The feature learning K-means based on the improved cluster-centers selection method 3.1 The proposed feature learning K-means-improved cluster-centers selection method To improve the clustering accuracy of the traditional K-means algorithm, we propose the FKM-ICS method to achieve a fast diagnosis of vehicle faults, which includes the ICS method and the FKM method.Specifically, we first define a cluster centers approximation based on the SD and the distance between the samples and the selected cluster centers in the ICS.The improved TF-IDF is used to measure the importance of feature words for each cluster and adjust the weights of feature words in different clusters, keep the first t feature words of each cluster with the highest weights and perform the cluster process again in the FKM.With the improved FKM-ICS, clustering performance for unbalanced vehicle fault diagnosis can be significantly enhanced.We find that the FKM-ICS can achieve a fast diagnosis of vehicle faults on the VFT data set.
The experimental results on VFT show that the methods proposed in this paper, outperform several state-of-the-art methods.The results of VFT show that the method proposed in this paper outperforms some recent methods.

3.2
The ICS: an improved cluster-centers selection method K-means algorithm is sensitive to the initial cluster centers and different initial cluster centers correspond to different cluster results (Wang and Shi, 2017).When faced with highdimensional and sparse locomotive fault text data because the initial cluster center is a random set, the cluster result is unstable and easy to fall into a local optimal and the result is easily affected by noise points.
Based on the idea of density peak and the previous research experience, a clustering center selection method based on density and distance is proposed.Among them, this method combines the SD and the distance between the clusters to define a cluster center approximation w .Use w to select the initial cluster center and to obtain the initial input parameters of the K-means algorithm.This process eliminates the influence of isolated points, effectively solves the shortcomings of the classical K-means algorithm in terms of poor noise immunity and the tendency to fall into a local optimum and improves the stability of the algorithm.
Definition 1: The average sample distance MeanDist (X) formula of data set X: where jx ip À x jp j is Manhattan distance between samples x i and x j , and the distances referred to later in the paper are Manhattan distances.

Unbalanced vehicle fault diagnosis
Definition 2: The density of sample points x i : where the function A visual explanation of the defined SD is shown in Figure 1.
From Definition 2, r (i) means the number of sample points contained within a circle with sample point x i as the center and MeanDist(X) as the radius.
Definition 3: The distance d i ð Þ between samples and clusters, which represents the average distance from sample x i to center C of all identified clusters (Huang et al., 2019).The mathematical expression is as follows: where jCj is the number of selected clusters.Definition 4: The cluster center approximation w i for x i is calculated as follows: From Definition 4, it can be seen that the larger r i (the more samples around x i ), the higher the cluster center approximation and the larger d i ð Þ (the larger the inter-cluster distance), Figure 1.Density of sample points SRT 3,2 the higher the cluster center approximation.We summarize the above initial cluster center selection process in Algorithm 1.
Algorithm 1 The ICS method.
Output: The initial cluster centers U = {u 1 ,u 2 ,Á Á Á,u K }. Method: 1: compute the density of all samples in the data set; 2: the first initial cluster center u 1 is chosen as the sample with the highest density in X. then it is added to the set U of cluster centers, where U = {u 1 } and then all points in X that are less than MeanDist(X) away from point c 1 are removed; 3: select x i with the largest w i as the next initial cluster center, noting it as u 2 and add u 2 to the set U, where U = {u 1 ,u 2 }, similar to step 2 and remove all samples in X that are less than MeanDist(X) from u 2 ; 4: execute step 2, until K initial cluster centers are selected.
The first cluster center is chosen to be the largest dense point, which can effectively avoid the influence of isolated samples and noise.The later cluster center selection relies on the value of w i , from the definition of w i equation ( 8), we can see that the later selected cluster centers are larger density, a large distance between the clusters, which not only eliminates the possibility of randomly selecting the initial clustering center to select noise and isolated samples such as the traditional K-means algorithm but also consistent with the density peaks cluster algorithm: the cluster center of a data set is surrounded by data points of low local density that are located at larger distances from other points of high local density.

The FKM: a feature learning K-means method
The K-means cluster algorithm based on the improved TF-IDF feature learning is proposed in this paper, which introduces the importance of the sample features to the cluster into the objective function (Zhou et al., 2018) and fully considers the influence of the sample feature distribution characteristics on the unbalanced data, better cluster effect can be achieved and simple implementation.

3.3.1
The traditional term frequency-inverse document frequency method.The TF-IDF is a statistical method used to evaluate the importance of a feature word.It is based on the idea that the weight of a feature word in a text is the frequency it appears in the text and the number of texts that contain it.The TF is the occurrence frequency of feature word t in text x and IDF is related to the ratio of the number of all texts in the data set to the number of all texts containing the feature word t.
The common calculation is as follows:

Unbalanced vehicle fault diagnosis
where m denotes the number of times of the feature word t in text x and M denotes the total number of the feature words t in text x.
where N is the total number of texts in the data set, n is the total number of texts containing the feature word t, the reason to add 1, is to avoid the denominator is 0 (that is, all texts do not contain t).
It is found that the traditional text feature weight representation method TF-IDF has some deficiencies: the TF-IDF treats the text set as a whole and does not take into account the uneven distribution of text features among clusters.In particular, the calculation of IDF has obvious defects in text cluster: in equation ( 10), n 1 ) n 2 , n 1 means the number of texts containing feature word t in cluster c 1 and n 2 means the number of texts containing feature word t in other clusters.When n 1 ) n 2 , In the case where N is certain, the value of IDF is small.However, the reality is that the feature word t occurs much more frequently in cluster c 1 than in other clusters and t should have a better ability to distinguish, but here it is the opposite of the desired result.

3.3.2
The improved term frequency-inverse document frequency method.According to the traditional TF-IDF function, the IDF of some obscure words is often higher, so these obscure words are often mistaken as text keywords.This is because IDF only considers the relationship between feature words and the number of texts in which they occur and ignores the distribution of feature words within a cluster and across clusters.(In other words, if a feature word occurs in large numbers in only a single text within a cluster and in few occurrences in most of the other texts within the cluster, then it is not excluded that these individual texts are special cases within that cluster, and therefore, such a feature word is not representative.).The above problem can be solved by replacing the ratio of the number of occurrences of feature words between clusters with the ratio of the probability of their occurrence to calculate the IDF.
Definition 5: According to the properties of The Law of Large Numbers, it is assumed that which words are used by the author of a certain type of text in writing that type of text is a random event, so P(t i ) can be used to denote the probability of the occurrence of the feature word t in cluster c i and count(t i ) to denote the number of times t occurs in c i .Finally, P(t i ) can be formalized as follows: Definition 6: The feature word of a cluster of texts should represent well the feature information of the texts, denoted by P(t i ) for the probability of occurrence of the word t in the cluster c i , and Q(t i ) denotes the sum of the probabilities that t occurs in clusters besides c i .Then: SRT 3,2 Then the IDF can be redefined as: , When P(t i ) is very lager, the absolute value of IDF but small, then take the opposite of it, according to the characteristics of the log function, the independent variable to be greater than 0, IDF to be positive and to avoid the denominator of 0 (that is, all texts do not contain the word), and finally, get the corrected IDF: Let TF = P(t i ) denote the probability of feature word t occurring in a certain cluster of texts, consistent with the meaning denoted by equation ( 12).Finally, the improved TF-IDF formula is obtained as follows: As compared to the traditional TF-IDF, the proposed method can not only reduce the weights of obscure words, reduce the influence of high-dimensional sparsity of sample features on the experimental results but also preserve the relevant information of the sample more completely.In addition, it also takes into account the differences in the representation of feature words among different clusters, so that the information of feature words can be extracted more effectively.

Final objective function of feature learning
where jc k j is the number of samples in cluster c k , d w (x i , x j ) is the weighted Manhattan distance between x i and x j .The formula is as follows: where w kp is the feature word weight for cluster c k .The improved TF-IDF calculates the feature words weight in cluster c k as follows:

Unbalanced vehicle fault diagnosis
where function g TF words with a TF-IDF rank lower than t .We summarize the FKM-ICS in Algorithm 2.
Input: A data set X = {x 1 ,x 2 ,Á Á Á,x n }, number of clusters K and number of dimensions to be retained t . Output: 1: initialize the cluster centers and obtain the K initialized cluster centers; 2: calculate the distance of all samples from these K cluster centers and assign them to the nearest cluster according to the minimum distance principle; 3: calculate the feature words weights for each cluster according to equation ( 19); 4: normalize the weights, perform K-means cluster based on the new weights, Execute step 5 if convergent, 3 otherwise; 5: output the global partition.

Experiments
To verify the validity of the FKM-ICS algorithm in vehicle fault analysis, this paper tests it on the VFT data set.The experimental environment configuration is shown in Table 1.

Data set
The VFT data set is derived from a railway station's 2017 VFT data, which is mainly formed by field personnel through natural language records and contains a total of 9,263 texts.The main information of those texts is that the occurrence of a fault, the description of a fault causes and the fault categories.Figure 2 shows the distribution of VFT data set fault categories.
From Figure 2, the faults are mainly in relationship person and relationship devices and the faults for relationship machines and not applicable are less, with an unbalanced ratio of 1:30, which is a typical data unbalance problem.
The VFT structured processing firstly needs to implement the partitioning of the fault text.Mainstream word segmentation techniques mainly include Chinese word segmentation based on dictionary matching, Chinese word segmentation based on word statistical model, Chinese word segmentation based on word annotation and Chinese word segmentation based on deep learning.In this paper, we use jieba.cutword segmentation tool to implement VFT segmentation using a general dictionary and custom domain dictionary (Su et al., 2006).In addition, the vector representation of the text fault data is obtained by the bag of words model (Zhong et al., 2018).
KM: KM is a popular partition-based clustering algorithm that divides n data objects into k clusters so that there is a high similarity within clusters and a low similarity between clusters.sIB: sIB is a method of data analysis based on information theory.It treats the extraction of data patterns as a data compression process, by compressing data objects into a "bottleneck variable" and maximizing to preserve the information in data, to obtain the inner patterns implicit in the data objects.Cutting the graph consisting of all data subjects so that the sum of edge weights between different subgraphs is lower and the sum of edge weights within subgraphs is higher, to achieve the clustering effect.Ncut: Ncut is based on the SC of cutting a graph by considering the connection between clusters relative to the density of each cluster.Generally, the clustering performance of Ncut is better than that of SC.
WKM: WKM is a feature weighting method based on KM and proposed by Zhexue Huang, which reduces the effect of the noise dimension and improves the robustness of KM.

Evaluation indicators
In this paper, Acc and F1-score indicators are used to evaluate the validity of the cluster results.Acc is the most commonly used metric in cluster problems and it calculates the ratio of the number of correctly clustered predictions to the total number of predictions, with a maximum of 1 and a minimum of 0. That is, where n_correct indicates the number of samples that were correctly clustered, n_total indicates the total number of samples.F1-score is the weighted average of accuracy and recall, with a maximum of 1 and a minimum of 0. The formula is as follows: where precision is represented by the proportion of true frame in the cluster results, which reflects the model's ability to discriminate between negative samples.recall uses the proportion of the true samples to the detected frame, which reflects the model's ability to identify the positive samples.The F1-score is a combination of accuracy and recall and the higher F1-score, the greater robustness of the algorithm.

Experimental settings
For all baseline methods, we refer to the parameter settings of the original paper.The optimal cluster results are then searched in the parameter space.Note that we only report the best cluster results.For the proposed FKM-ICS algorithm, one parameter t needs to be set, as shown in the hyper-parameter sensitivity test experiments.The following experiments were repeated 30 times and averaged.
4.5 Experimental results 4.5.1 Comparative experiment.Table 2 shows that the FKM-ICS algorithm achieves the best results among all the baseline methods on the VFT data set.The experimental results show the FKM-ICS algorithm can realize the vehicle fault diagnostic and has good diagnostic performance.The algorithm improves Acc by 4.58% and F1-score by 8.01% compared to sIB.The time cost was reduced by 12.59s compared to KM. 4.5.2Ablation experiments.To verify the contribution of ICS and FKM to the algorithm, the ablation experiments were conducted on the VFT data set.In ICSþKM, the clustering centers selected by ICS are used as the initial cluster center for the KM algorithm.In the FKM the initial clustering centers are K samples randomly selected from the data set.The results of the ablation experiments are shown in Table 3.
Table 3 shows the effects of ICS and FKM on the algorithm and the VFT data set.As shown, the ICS can effectively improve the stability of the algorithm; the FKM can significantly improve the clustering performance and the combination of both can achieve a better fault diagnosis performance.
4.5.3Hyper-parameter sensitivity test experiments.To study the sensitivity of the proposed FKM-ICS to the parameters t by varying its values.The parameter t is the number of feature words retained in the clustering process.Normally, the larger number of retained feature words, the better performance of clustering results, and finally,  Figure 3 shows the best performance of 48.94% obtained by the FKM-ICS on Acc when parameters t is 30 and Figure 4 shows that when parameters t is 50, the FKM-ICS obtains the best result on F1-score 57.65%.For the evaluation indicators in Figures 3 and 4, as parameters t increase, the clustering performance of the FKM-ICS on the VFT data set gradually increases until convergence, proving the stability and convergence of the FKM-ICS method.
4.5.4Convergence analysis.We empirically study the convergence property of our FKM-ICS algorithm.Figure 5 shows the objective value of the proposed method on the VFT data  set.It is observed that the objective value increases monotonically with each iteration and finally, converges to an optimal solution within at most eight iterations.

Conclusion
In this paper, we propose the FKM-ICS algorithm to achieve the diagnosis of unbalanced VFT data in a more economical and faster way.The ICS enables the FKM-ICS method to exclude the effect of outliers, effectively solving the disadvantages of the fault text data contains a certain amount of noise data and improving the stability of the method.The FKM enhances the distribution of feature words that discriminate between different fault categories and reducing the number of feature words enables the FKM-ICS method faster and better cluster for unbalanced vehicle fault diagnostic.The experiments on the unbalanced VFT data of a railway station verify the Acc and validity of the proposed method and provide a new idea and solution for the intelligent classification of vehicle faults.In the future, we intend to expand the application of this method to other segments of the railway domain.
Figure 2. Distribution of VFT data set fault categories Figure 3. Dimension sensitivity test of FKM-ICS algorithm on VFT data set: Acc Figure 4. Dimension sensitivity test of FKM-ICS algorithm on VFT data set: F1-score Figure 5. Convergence of FKM-ICS algorithm on VFT data set