An anomaly detection method to improve the intelligent level of smart articles based on multiple group correlation probability models

Purpose – The purpose of this paper is to detect abnormal data of complex and sophisticated industrial equipment with sensors quickly and accurately. Due to the rapid development of the Internet of Things, more and more equipment is equipped with sensors, especially more complex and sophisticated industrial equipment is installed with a large number of sensors. A large amount of monitoring data is quickly collected to monitor the operation of the equipment. How to detect abnormal data quickly and accurately has become a challenge. Design/methodology/approach – In this paper, the authors propose an approach called Multiple Group Correlation-based Anomaly Detection (MGCAD), which can detect equipment anomaly quickly and accurately. The single-point anomaly degree of equipment and the correlation of each kind of data sequence are modeled by using multi-group correlation probability model (a probability distribution model which is helpful to the anomaly detection of equipment), and the anomaly detection of equipment is realized. Findings – The simulation data set experiments based on real data show that MGCAD has better performance than existing methods in processingmultiple monitoring data sequences.


Introduction
As human beings enter the network era, the phenomenon of intelligence is becoming more and more extensive and complex.The individuals, enterprises, governmental agencies, smart equipment and articles in the physical space are becoming more and more intelligent in future Web-based industrial operation systems and social operation management patterns (Chai et al., 2017).The smart article is a kind of intelligent subject in CrowdIntell Network (Wang et al., 2019).And the smart articles have one or more functions and abilities to solve problems and complete tasks, which can assist other intelligent subjects (such as people) to complete certain transactions.In CrowdIntell Network, the smart articles mainly include intelligent monitoring equipment, intelligent transportation equipment, intelligent communication equipment and other intelligent auxiliary equipment.
Unlike the other three Digital-selfs in CrowdIntell Network, the Digital-self of smart article has no mental model (Wang et al., 2019) and its intelligent level is reflected in the level of the ability to complete the task.For example, in CrowdIntell Network, sensors can collect data and assist to project the real world into cyber space.The ability of sensors to collect data and project raw data into cyber space reflects their level of intelligence.For example, sensors that can collect heterogeneous data may have higher level of intelligence than sensors that can only collect homogeneous data.In the process of collecting real world data and projecting it into cyber space, whether the sensors can actively detect abnormal data also reflect the intelligent level of the smart articles.The more intelligent the sensor is, the more abnormal data can be detected; then the abnormal results will be fed back to the individual.In the meantime, CrowdIntell Network needs comprehensive, real, correct and synchronous projection, so the sensor should be able to achieve rapid detection of anomalies in various monitoring data.
The Internet of Things technology (Atzori et al., 2010) is an important application of smart articles in CrowdIntell Network.It can be observed that the Internet of Things technology has developed rapidly in recent years.The Internet of Things collects information through various information sensors and monitors objects in real time.The working condition of equipment is generally monitored by the condition monitoring system.The condition monitoring system generates multiple monitoring data while working.If the device runs abnormally, it can typically affect the collected monitoring data during the fault period.The identification of abnormal monitoring data can help us identify abnormal equipment and avoid incorrect data projection into CrowdIntell Network as far as possible.
However, due to the rapid increase in the amount of monitoring data and monitoring data types continue to increase, the rapid detection of anomaly in various monitoring data has the following several challenges: If a condition monitoring system is to monitor multiple equipment, each equipment is fitted with multiple sensors.With the time increasing, the steady stream of data is collected by the sensor.It is very difficult to detect abnormal data effectively and accurately in a large amount of data.

IJCS 3,3
Because equipment is fitted with multiple sensors, a large number of isomerous data are generated.In the multiple monitoring data sequences, some have similar trend.The original method is no longer adapted to so many of the monitoring data sequence types.
Different sensors produce a large number of isomerous monitoring data while the same sensors produce a large number of isomorphic data.First, observe a group of monitoring data sequences from equipment, which will be used as a point of discussion.The group of monitoring data sequence from the equipment is displayed The group of monitoring data from the machine speed, pump speed of equipment, is displayed in Figure 4. Observation of the machine speed and pump speed in the normal operation of the equipment, the pump speed is increased with the increase of the machine speed, and reduced with the reduction of the machine speed.In the Figure 4, seeing a part of the green rectangle, the pump speed and the speed of the machine are in the normal range,

Intelligent level of smart articles
but with the significant increase in the speed of the machine, the pump speed is not obvious.The correlation of the pump speed and the machine speed is abnormal, which can indicate the section of green rectangle is abnormal.Isomerous data correlation is normal or a certain extent normal, if there is data in isomorphic data beyond its normal range, it may also appear abnormal.The correlations of data sequences compress several data sequences into a few numbers.The anomaly of data transience is possibly weakened.A set of monitoring data is derived from equipment's machine speed, pump speed, and driving device for lubricating oil temperature, which is shown in Figure 5. Observing the part of the green rectangle enclosed in the Figure 5, the isomerous data correlation is normal or a certain extent normal, but the temperature is beyond the normal range (60°C), which also can be identified as an anomaly about the part of the green rectangle.
In Figure 2, there are tendency charts from two data sequences of equipment.The middle part of the line is the normal scope of their respective.It can be observed that the first hydraulic pressure and second hydraulic pressure have a similar trend.But the two curves of the rectangular part are abnormal, and they are beyond the normal range.In Figure 2(a), the part of the green rectangle is enlarged in Figure 3.It can be observed that the first hydraulic pressure is beyond the normal range (34000-36000 KPa).We quantify the anomaly degrees of the single point according to the magnitude of the excess.
In this paper, we propose Multiple Group Correlation-based Anomaly Detection (MGCAD) which can quickly and accurately detect anomalies of equipment.In MGCAD, we first use correlation coefficient to cluster monitor data sequence.If there are more than one kind of monitoring data sequence in the class, we use latent correlation vector to quantify the latent correlation of multiple monitoring data sequences in the class, and get the normal range of each monitoring data sequence; if there is only one monitoring data sequence in the class, we can only obtain the normal range of the monitoring data sequence.Finally, we use Multiple Group Correlation Probability Models (MGCPM) to model latent correlation vectors (lcv) and abnormal degree of single points.Using MGCPM we can quickly and accurately detect the anomaly of the equipment.We propose Multiple Group Correlation-based Anomaly Detection, which can quickly and accurately detect anomalies of equipment.Our main contributions are as follows: We use correlation to cluster all of the monitoring data sequences, and the sequence of monitoring data in each class is related to each other.We find out the normal range of each monitoring data sequence, according to the extent of the data sequence over or below the normal range, to quantify the anomaly degree of the single point.
The correlation coefficient between the monitoring data sequence is used to show the correlation between the monitoring data, finally, the correlation of each monitoring data sequence with other monitoring data sequence is expressed as the square sum of each column vector element.
We propose Multiple Group Correlation-based Anomaly Detection method about a large number of monitoring data sequences.We model the anomaly degree of single point and correlation about each class of data sequence using Multiple Group Correlation Probability Models, which makes the anomaly detection work very well.
We reduce the dimension of the monitoring data sequence by clustering, and then we use the method of the respective detection.
The rest of the paper is shown below.Section 2 is related to work.In Section 3, we set the problem, show the outline of our method.Section 4 is a detailed introduction of our approach.Section 5 shows the test evaluation.Section 6 gives a summary of the full text.

Related work
To the best of our knowledge, there are varied abnormal detection methods, mainly as classification based, nearest neighbor based, clustering based, statistical and others.They are displayed in Figure 6: Classification-based. Classification based is an anomaly detection method, using classification rules to classify data into normal data and abnormal data, to carry out anomaly detection.Classification-based anomaly detection approaches consist of support vector machine-based approaches (Ma and Perkins, 2003) and neural network-based approaches (Schlechtingen and Santos, 2011) (Aggarwal and Yu, 2008) and nonparametric approaches.
There are some existing methods for anomaly detection of multiple time series.Ding et al. (2014Ding et al. ( , 2016) ) proposed LCAD that only considers the correlation of the data, and no correlation is used for clustering.Zhang et al. (2009) proposed abnormal trends detection method for multiple data streams, but their method is very time-consuming.

Intelligent level of smart articles
All of the aforementioned methods focus on the isomorphic data, without considering the correlation between the monitoring data sequence.For a large number of monitoring data, it is not easy to detect abnormal.If only consider the correlation, there may be uncorrelated section in a large number of monitoring data sequence as shown in Figure 1.If only consider the correlation of the isomerous data, while ignoring the isomorphic data, the anomaly may not be detected as shown in Figure 4.In short, the existing methods cannot well solve the anomaly detection based on a large number of relevant time series.3. Problem settings and outline of proposed approach 3.1 Collection plan of monitoring data A large number of equipment for the same type are given, defined as O = {E 1 , E 2 , E 3 , Á Á Á, E N }, where E N representsthe N-th equipment.Each equipment is fitted with K sensors, which is expressed as S = {S 1 , S 2 , S 3 , Á Á ÁS K }, where S K represents the K-th sensor.Each equipment generates K types of monitoring data sequence, and the sequence of monitoring data for the N-th equipment is represented as , where represents the K-th sensor monitoring data for the N-th equipment, and V N K T ð Þ represents the collected data at the T time.

Outline of proposed approach
We have N same types of equipment and each equipment is equipped with K sensors.Let this equipment to work with the sensor.Equipment E i produced a total of L i working cycle sequence groups; make In the data pre-processing phase, our main job is to carry out data collation and cleaning, to maintain the same dimensions of the isomerous data in a work cycle, and to prepare for the second part of the job.
In the latent correlation extraction phase, extracting correlation from the data generated from the first part using correlation coefficients and sum of squares.
In the monitoring data sequence clustering phase, we use correlation coefficient to cluster all the monitoring data sequence.Each class of monitoring data sequence is related to each other, and the correlation of the monitoring data sequence between the classes is very small.
In the setting the normal range of each data sequence phase, we set the normal range of each data sequence according to observation data, access to information, consulting engineers.
In the multiple group correlation probability models training phase, we model the anomaly degree of single point and correlation about each class of data sequences using multiple group correlation probability models.
In the abnormal detection phase, we use the model of the fifth parts to carry out anomaly detection.

Intelligent level of smart articles
4. MGCAD: our anomaly detection approach 4.1 Data pre-processing Due to the frequency of different sensors to collect data is different and monitoring data may become dirty, to facilitate the extraction of latent correlations between isomerous monitoring data sequence, we need to reconstruct the data, so that monitoring data sequence to maintain the same dimension and to maximize the meaning of the original data in a working cycle.Piecewise aggregate approximation (PAA) is a famous dimension reconstruction technology, which is widely used in data processing.As shown in Figure 7, after the processing of the PAA technology (Chakrabarti et al., 2002;Faloutsos, 2000), the original data is eventually represented by 10 data.
In the l work cycle, the k monitoring data sequence , use formula:

Latent correlation extraction
First, we consider a work cycle, in a work cycle sequence group, there are K work cycle sequences, in this part, and we define the latent correlation between the work cycle sequences.
Let (X, Y) be a two dimensional random variable, and Var X Which is the correlation coefficient between X and Y.If 0 < |Corr(X, Y)| < 1, X and Y have a certain degree of linear relationship.|Corr(X, Y)| is close to 1, and the linear degree is higher; |Corr(X, Y)| is close to 0, then the linear degree is lower.However, the covariance is not that, if the covariance is small and the two standard deviations are small, the ratio is not necessarily very small (Fisz and Bartoszy nski, 2018).In this paper, we use correlation coefficients to represent latent correlations between isomerous data sequences.In the l-th work cycle, We define a latent correlation coefficient matrix to measure the latent correlation of the l-th work cycle sequences, which is expressed as LCCM 1 , Calculation formula: In LCCM 1 elements C 1 ij is latent correlation parameter in between the i-th cycle sequence and the j-th work cycle sequence that is computed as: We use correlation coefficient to extract the correlation between the monitoring data sequence of each work cycle, which is expressed as: 3 Monitoring data sequence clustering K sensors are installed on one equipment, and each equipment can collect K types of monitoring data sequence.We use the correlation between monitoring data sequence to cluster the monitoring data sequence (Guha et al., 1998;Kanungo et al., 2002;Cacciari et al., 2008).Monitoring data sequences of each class is related to each other, and the correlation of the monitoring data sequence between the classes is very small.We find a correlation coefficient matrix about monitoring data sequence of each working cycle, using the formula (4).Therefore, we obtain L correlation coefficient matrix: LCCM 1, LCCM 2 Á Á Á LCCM L .Then, these L correlation coefficient matrix is used to calculate the average correlation coefficient matrix: LCCM , Calculated as: We cluster monitoring data sequence according to equation ( 5) Clustering results are shown in Table I.

Set the normal range of each data sequence
We set the normal range of each data sequence according to observation data, access to information, and consulting engineers.In the modeling, the single point data can be more than or less than the normal range, which can help us quantify the anomaly degree of the

Last class W4
Intelligent level of smart articles single point.For example, through investigating data we know: when concrete pump truck in uninterrupted running time, oil temperature does not exceed 70 degrees Celsius, otherwise it should stop to test.Therefore, we design the oil temperature < = 60.

Multiple group correlation probability models
In Section 4.3, we have clustered the monitoring data sequences, and each class of the monitoring data sequence is related to each other, and the correlation of the monitoring data sequence between classes is very small.We use the method of the respective detection to model each class of data sequence.There are two cases after clustering.The number of species of monitoring data sequence in the first case is more than one, and the other is equal to one.First, we consider the first case.As in Table I, second class has three kinds of monitoring data sequence:W 3 ; W 5 ; W 6 .We extracted the three monitoring data sequence of L cycle: Therefore, we can get L correlation coefficient matrix:LCCM 1 2 ; LCCM 2 2 ; LCCM 3 2 ; Á Á Á ; LCCM L 2 .Where LCCM 1 2 is expressed as: We find the sum of square of each column vector element of the l-th latent correlation coefficient matrix, that is: Expressed as 1cv 1 2 ¼ l 1 3 ; l 1 5 ; l 1 6 È É , in which element l 1 i is defined as latent correlation factors, it represents the correlation between the i-th work cycle sequence and the sequence of every other work cycle.LCCM 1 2 ; LCCM 2 2 ; LCCM 3 2 ; Á Á Á ; LCCM L 2 is expressed as: 1cv 1 2 ; 1cv 2 2 ; 1cv 3 2 ; Á Á Á ; 1cv L 2 . [ Where Vector V k represents the k-th latent correlation factor vector, which reflects the latent correlation between the k-th monitoring data sequence and each sequence of other monitoring data sequence in each work cycle.
We assume that all the latent correlation factor l of the vector V k is satisfied with a Gauss distribution.To verify our idea, we use IBM SPSS Statistics tool tests the results of our experimental data distribution.
We use the characteristics of the distribution of Gauss to carry out anomaly detection.We can get three Gaussian distributions, each Gaussian distribution corresponding to a group (m , s ).Because our method uses: l 1 are on the left, and the closer to the left, the more abnormal.If the l 1 k is calculated and satisfying l 1 k ¼ m k À 3s k then we think that the equipment is an anomaly in the l-th cycle.It is possible that the correlation between isomerous data is normal or a certain extent normal, but only considering the correlation between the isomerous data, the abnormal monitoring results may be missing.The correlations of data sequences compress several data sequences into a few numbers.The anomaly of data transience is possibly weakened.The correlation of a few data sequences to a certain extent is normal.We must also consider abnormal degree of a single point.The method of amplifying the abnormal probability of l i by adding the anomaly factor that is to change l i .Expressed as: Among them, l i is the k-th latent correlation factor of the test data, W k indicates that the collected monitoring data sequence of the k-th sensor.b 1 k as higher limiting value.b 2 k as lower limiting value.
At this time, we can set up the model of second class of monitoring data sequence: Then we consider the second cases, as shown in Table I, last class has one monitoring data sequence: W4.At this point, we cannot use the correlation anomaly to detect the abnormal data, we can only consider the anomaly degree of the single point.
Among them, m k is the k-th mean value of the monitoring data, s k is the k-th standard deviation of the monitoring data, W k indicates that the collected monitoring data sequence of the k-th sensor.b 1 k as higher limiting value.b 2 k as lower limiting value.

Intelligent level of smart articles
used for clustering.We use the LCAD method to model and detect our data sets.
Experimental results are shown in Table IV.After we calculate the lcvof each cycle, we use the Euc-KNN method (Peterson, 2009) to detect the anomaly, and the experimental results are in Table IV.Euc-KNN method is a classifier based on k-nearest neighbor and Euclidean distance.
In the sixth work cycle, the oil temperature is higher than 60 degrees Celsius, which appears two times.And machine speed increased significantly; pump speed was not obvious.In the fifth work cycle, first hydraulic pressure and second hydraulic pressure have multiple single point anomalies.Through the experimental results, we can see that our method is time saving and accurate.

Summary
In CrowdIntell Network, the smart articles such as sensors can assist to realize a comprehensive, real, correct and synchronous projection from the real world to the cyber space, and ultimately realize the transaction between Digital-selfs.To better understand the operation status of equipment, it is very important for equipment managers to detect abnormal equipment quickly and accurately.At the same time, the faster and more accurately the sensors can detect the abnormal data, the higher its intelligent level is.In this paper, we have employed MGCPM to model the latent correlation between the monitoring data sequences and the anomaly degree of the single point.Extensive experimental results show that our method has better performance than previous methods.

Figure 1 .
It can be observed that the curves of Figure 1(a), Figure 1(b) and Figure 1(c) are similar; the curves of Figure 1(d) and Figure 1(e) are similar; the curve of Figure 1(f) is not similar to the other curves.Thus, it can be seen that monitoring data sequence can be clustered by using correlation.
Figure 1.The group of monitoring data sequence from the equipment Figure 2.There are tendency chart from two data sequences of equipment Figure 4.The above line represents the collection of machine speed; the following line represents the Collection of pump speed Figure 6.Classification of the anomaly detection method for a single time series Figure 7.The raw data is represented by the dotted line, the data through the PAA technology to reconstruct is indicated by real lines (Izakian and Pedrycz, 2013)arest neighbor-based anomaly detection method, using the distance between the data by which the normal data can be distributed in the dense region and the abnormal data away from their Nearest Neighbor.Nearest neighbor-based approaches consist of density-based(Pokrajac et al., 2007)and distance-based.Cluster-based.Cluster-based anomaly detection method(Izakian and Pedrycz, 2013), which divides the data into multiple clusters, among them, normal data in a cluster and abnormal data do not belong to any cluster.Statistical.Statistical approaches consider normal data as the high probability region, and the abnormal data are in the low probability region.These approaches mainly consist of parametric have produced a total of L working cycle sequence groups.The MGCAD is mainly composed of six parts: Data pre-processing, latent correlation extraction, monitoring data sequence clustering, set the normal range of each data sequence, Multiple Group Correlation Probability Models training, and abnormal detection.