Search results
1 – 10 of 302Shuai Luo, Hongwei Liu and Ershi Qi
The purpose of this paper is to recognize and label the faults in wind turbines with a new density-based clustering algorithm, named contour density scanning clustering (CDSC…
Abstract
Purpose
The purpose of this paper is to recognize and label the faults in wind turbines with a new density-based clustering algorithm, named contour density scanning clustering (CDSC) algorithm.
Design/methodology/approach
The algorithm includes four components: (1) computation of neighborhood density, (2) selection of core and noise data, (3) scanning core data and (4) updating clusters. The proposed algorithm considers the relationship between neighborhood data points according to a contour density scanning strategy.
Findings
The first experiment is conducted with artificial data to validate that the proposed CDSC algorithm is suitable for handling data points with arbitrary shapes. The second experiment with industrial gearbox vibration data is carried out to demonstrate that the time complexity and accuracy of the proposed CDSC algorithm in comparison with other conventional clustering algorithms, including k-means, density-based spatial clustering of applications with noise, density peaking clustering, neighborhood grid clustering, support vector clustering, random forest, core fusion-based density peak clustering, AdaBoost and extreme gradient boosting. The third experiment is conducted with an industrial bearing vibration data set to highlight that the CDSC algorithm can automatically track the emerging fault patterns of bearing in wind turbines over time.
Originality/value
Data points with different densities are clustered using three strategies: direct density reachability, density reachability and density connectivity. A contours density scanning strategy is proposed to determine whether the data points with the same density belong to one cluster. The proposed CDSC algorithm achieves automatically clustering, which means that the trends of the fault pattern could be tracked.
Details
Keywords
Reza Edris Abadi, Mohammad Javad Ershadi and Seyed Taghi Akhavan Niaki
The overall goal of the data mining process is to extract information from an extensive data set and make it understandable for further use. When working with large volumes of…
Abstract
Purpose
The overall goal of the data mining process is to extract information from an extensive data set and make it understandable for further use. When working with large volumes of unstructured data in research information systems, it is necessary to divide the information into logical groupings after examining their quality before attempting to analyze it. On the other hand, data quality results are valuable resources for defining quality excellence programs of any information system. Hence, the purpose of this study is to discover and extract knowledge to evaluate and improve data quality in research information systems.
Design/methodology/approach
Clustering in data analysis and exploiting the outputs allows practitioners to gain an in-depth and extensive look at their information to form some logical structures based on what they have found. In this study, data extracted from an information system are used in the first stage. Then, the data quality results are classified into an organized structure based on data quality dimension standards. Next, clustering algorithms (K-Means), density-based clustering (density-based spatial clustering of applications with noise [DBSCAN]) and hierarchical clustering (balanced iterative reducing and clustering using hierarchies [BIRCH]) are applied to compare and find the most appropriate clustering algorithms in the research information system.
Findings
This paper showed that quality control results of an information system could be categorized through well-known data quality dimensions, including precision, accuracy, completeness, consistency, reputation and timeliness. Furthermore, among different well-known clustering approaches, the BIRCH algorithm of hierarchical clustering methods performs better in data clustering and gives the highest silhouette coefficient value. Next in line is the DBSCAN method, which performs better than the K-Means method.
Research limitations/implications
In the data quality assessment process, the discrepancies identified and the lack of proper classification for inconsistent data have led to unstructured reports, making the statistical analysis of qualitative metadata problems difficult and thus impossible to root out the observed errors. Therefore, in this study, the evaluation results of data quality have been categorized into various data quality dimensions, based on which multiple analyses have been performed in the form of data mining methods.
Originality/value
Although several pieces of research have been conducted to assess data quality results of research information systems, knowledge extraction from obtained data quality scores is a crucial work that has rarely been studied in the literature. Besides, clustering in data quality analysis and exploiting the outputs allows practitioners to gain an in-depth and extensive look at their information to form some logical structures based on what they have found.
Details
Keywords
Toan Van Nguyen, Minh Hoang Do and Jaewon Jo
Collision avoidance is considered as a crucial issue in mobile robotic navigation to guarantee the safety of robots as well as working surroundings, especially for humans…
Abstract
Purpose
Collision avoidance is considered as a crucial issue in mobile robotic navigation to guarantee the safety of robots as well as working surroundings, especially for humans. Therefore, the position and velocity of obstacles appearing in the working space of the self-driving mobile robot should be observed to help the robot predict the collision and choose traversable directions. This paper aims to propose a new approach for obstacle tracking, dubbed MoDeT.
Design/methodology/approach
First, all long lines, such as walls, are extracted from the 2D-laser scan and considered as static obstacles (or mapped obstacles). Second, a density-based procedure is implemented to cluster nonwall obstacles. These clusters are then geometrically fitted as ellipses. Finally, the combination of Kalman filter and global nearest-neighbor (GNN) method is used to track obstacles’ position and velocity.
Findings
The proposed method (MoDeT) is experimentally verified by using an autonomous mobile robot (AMR) named AMR SR300. The MoDeT is found to provide better performance in comparison with previous methods for self-driving mobile robots.
Research limitations/implications
The robot can only see a part of the object, depending on the light detection and ranging scan view. As a consequence, geometrical features of the obstacle are sometimes changed, especially when the robot is moving fast.
Practical implications
This proposed method is to serve the navigation and path planning for the AMR.
Originality/value
(a) Proposing an extended weighted line extractor, (b) proposing a density-based obstacle detection and (c) implementing a combination of methods [in (a) and (b) constant acceleration Kalman and GNN] to obtain obstacles’ properties.
Details
Keywords
Desh Deepak Sharma and S.N. Singh
This paper aims to detect abnormal energy uses which relate to undetected consumption, thefts, measurement errors, etc. The detection of irregular power consumption, with…
Abstract
Purpose
This paper aims to detect abnormal energy uses which relate to undetected consumption, thefts, measurement errors, etc. The detection of irregular power consumption, with variation in irregularities, helps the electric utilities in planning and making strategies to transfer reliable and efficient electricity from generators to the end-users. Abnormal peak load demand is a kind of aberration that needs to be detected.
Design/methodology/approach
This paper proposes a Density-Based Micro Spatial Clustering of Applications with Noise (DBMSCAN) clustering algorithm, which is implemented for identification of ranked irregular electricity consumption and occurrence of peak and valley loads. In the proposed algorithm, two parameters, a and ß, are introduced, and, on tuning of these parameters, after setting of global parameters, a varied number of micro-clusters and ranked irregular consumptions, respectively, are obtained. An approach is incorporated with the introduction of a new term Irregularity Variance in the suggested algorithm to find variation in the irregular consumptions according to anomalous behaviors.
Findings
No set of global parameters in DBSCAN is found in clustering of load pattern data of a practical system as the data. The proposed DBMSCAN approach finds clustering results and ranked irregular consumption such as different types of abnormal peak demands, sudden change in the demand, nearly zero demand, etc. with computational ease without any iterative control method.
Originality/value
The DBMSCAN can be applied on any data set to find ranked outliers. It is an unsupervised approach of clustering technique to find the clustering results and ranked irregular consumptions while focusing on the analysis of and variations in anomalous behaviors in electricity consumption.
Details
Keywords
Krista Nerinckx, Jan Vierendeels and Erik Dick
To present conversion of the advection upwind splitting method (AUSM+) from the conventional density‐based and coupled formulation to the pressure‐based and segregated formulation.
Abstract
Purpose
To present conversion of the advection upwind splitting method (AUSM+) from the conventional density‐based and coupled formulation to the pressure‐based and segregated formulation.
Design/methodology/approach
The spatial discretization is done by a finite volume method. A collocated grid cell‐center formulation is used. The pressure‐correction procedure is set up in the usual way for a compressible flow problem. The conventional Rhie‐Chow interpolation methodology for the determination of the transporting velocity, and the conventional central interpolation for the pressure at the control volume faces, are replaced by AUSM+ definitions.
Findings
The AUSM+ flux definitions are spontaneously well suited for use in a collocated pressure‐correction formulation. The formulation does not require extensions to these flux definitions. As a consequence, the results of a density‐based fully coupled method, are identical to the results of a pressure‐based segregated formulation. The advantage of the pressure‐correction method with respect to the density‐based method, is the higher efficiency for low Mach number applications. The advantage of the AUSM+ flux definition for the transporting velocity with respect to the conventional Rhie‐Chow interpolation, is the improved accuracy in high Mach number flows. As a consequence, the combination of AUSM+ with a pressure‐correction method leads to an algorithm with improved performance for flows at all Mach numbers.
Originality/value
A new methodology, with obvious advantages, is composed by the combination of ingredients from an existing spatial discretization method (AUSM+) and an existing time stepping method (pressure‐correction).
Details
Keywords
The purpose of this paper is to propose a simple, fast, and effective method for detecting measurement errors in data collected with low-cost environmental sensors typically used…
Abstract
Purpose
The purpose of this paper is to propose a simple, fast, and effective method for detecting measurement errors in data collected with low-cost environmental sensors typically used in building monitoring, evaluation, and automation applications.
Design/methodology/approach
The method combines two unsupervised learning techniques: a distance-based anomaly detection algorithm analyzing temporal patterns in data, and a density-based algorithm comparing data across different spatially related sensors.
Findings
Results of tests using 60,000 observations of temperature and humidity collected from 20 sensors during three weeks show that the method effectively identified measurement errors and was not affected by valid unusual events. Precision, recall, and accuracy were 0.999 or higher for all cases tested.
Originality/value
The method is simple to implement, computationally inexpensive, and fast enough to be used in real-time with modest open-source microprocessors and a wide variety of environmental sensors. It is a robust and convenient approach for overcoming the hardware constraints of low-cost sensors, allowing users to improve the quality of collected data at almost no additional cost and effort.
Details
Keywords
Jie Ma, Zhiyuan Hao and Mo Hu
The density peak clustering algorithm (DP) is proposed to identify cluster centers by two parameters, i.e. ρ value (local density) and δ value (the distance between a point and…
Abstract
Purpose
The density peak clustering algorithm (DP) is proposed to identify cluster centers by two parameters, i.e. ρ value (local density) and δ value (the distance between a point and another point with a higher ρ value). According to the center-identifying principle of the DP, the potential cluster centers should have a higher ρ value and a higher δ value than other points. However, this principle may limit the DP from identifying some categories with multi-centers or the centers in lower-density regions. In addition, the improper assignment strategy of the DP could cause a wrong assignment result for the non-center points. This paper aims to address the aforementioned issues and improve the clustering performance of the DP.
Design/methodology/approach
First, to identify as many potential cluster centers as possible, the authors construct a point-domain by introducing the pinhole imaging strategy to extend the searching range of the potential cluster centers. Second, they design different novel calculation methods for calculating the domain distance, point-domain density and domain similarity. Third, they adopt domain similarity to achieve the domain merging process and optimize the final clustering results.
Findings
The experimental results on analyzing 12 synthetic data sets and 12 real-world data sets show that two-stage density peak clustering based on multi-strategy optimization (TMsDP) outperforms the DP and other state-of-the-art algorithms.
Originality/value
The authors propose a novel DP-based clustering method, i.e. TMsDP, and transform the relationship between points into that between domains to ultimately further optimize the clustering performance of the DP.
Details
Keywords
Hongqi Han, Yongsheng Yu, Lijun Wang, Xiaorui Zhai, Yaxin Ran and Jingpeng Han
The aim of this study is to present a novel approach based on semantic fingerprinting and a clustering algorithm called density-based spatial clustering of applications with noise…
Abstract
Purpose
The aim of this study is to present a novel approach based on semantic fingerprinting and a clustering algorithm called density-based spatial clustering of applications with noise (DBSCAN), which can be used to convert investor records into 128-bit semantic fingerprints. Inventor disambiguation is a method used to discover a unique set of underlying inventors and map a set of patents to their corresponding inventors. Resolving the ambiguities between inventors is necessary to improve the quality of the patent database and to ensure accurate entity-level analysis. Most existing methods are based on machine learning and, while they often show good performance, this comes at the cost of time, computational power and storage space.
Design/methodology/approach
Using DBSCAN, the meta and textual data in inventor records are converted into 128-bit semantic fingerprints. However, rather than using a string comparison or cosine similarity to calculate the distance between pair-wise fingerprint records, a binary number comparison function was used in DBSCAN. DBSCAN then clusters the inventor records based on this distance to disambiguate inventor names.
Findings
Experiments conducted on the PatentsView campaign database of the United States Patent and Trademark Office show that this method disambiguates inventor names with recall greater than 99 per cent in less time and with substantially smaller storage requirement.
Research limitations/implications
A better semantic fingerprint algorithm and a better distance function may improve precision. Setting of different clustering parameters for each block or other clustering algorithms will be considered to improve the accuracy of the disambiguation results even further.
Originality/value
Compared with the existing methods, the proposed method does not rely on feature selection and complex feature comparison computation. Most importantly, running time and storage requirements are drastically reduced.
Details
Keywords
Jingwei Guo, Ji Zhang, Yongxiang Zhang, Peijuan Xu, Lutian Li, Zhongqi Xie and Qinglin Li
Density-based spatial clustering of applications with noise (DBSCAN) is the most commonly used density-based clustering algorithm, while it cannot be directly applied to the…
Abstract
Purpose
Density-based spatial clustering of applications with noise (DBSCAN) is the most commonly used density-based clustering algorithm, while it cannot be directly applied to the railway investment risk assessment. To overcome the shortcomings of calculation method and parameter limits of DBSCAN, this paper proposes a new algorithm called Improved Multiple Density-based Spatial clustering of Applications with Noise (IM-DBSCAN) based on the DBSCAN and rough set theory.
Design/methodology/approach
First, the authors develop an improved affinity propagation (AP) algorithm, which is then combined with the DBSCAN (hereinafter referred to as AP-DBSCAN for short) to improve the parameter setting and efficiency of the DBSCAN. Second, the IM-DBSCAN algorithm, which consists of the AP-DBSCAN and a modified rough set, is designed to investigate the railway investment risk. Finally, the IM-DBSCAN algorithm is tested on the China–Laos railway's investment risk assessment, and its performance is compared with other related algorithms.
Findings
The IM-DBSCAN algorithm is implemented on China–Laos railway's investment risk assessment and compares with other related algorithms. The clustering results validate that the AP-DBSCAN algorithm is feasible and efficient in terms of clustering accuracy and operating time. In addition, the experimental results also indicate that the IM-DBSCAN algorithm can be used as an effective method for the prospective risk assessment in railway investment.
Originality/value
This study proposes IM-DBSCAN algorithm that consists of the AP-DBSCAN and a modified rough set to study the railway investment risk. Different from the existing clustering algorithms, AP-DBSCAN put forward the density calculation method to simplify the process of optimizing DBSCAN parameters. Instead of using Euclidean distance approach, the cutoff distance method is introduced to improve the similarity measure for optimizing the parameters. The developed AP-DBSCAN is used to classify the China–Laos railway's investment risk indicators more accurately. Combined with a modified rough set, the IM-DBSCAN algorithm is proposed to analyze the railway investment risk assessment. The contributions of this study can be summarized as follows: (1) Based on AP, DBSCAN, an integrated methodology AP-DBSCAN, which considers improving the parameter setting and efficiency, is proposed to classify railway risk indicators. (2) As AP-DBSCAN is a risk classification model rather than a risk calculation model, an IM-DBSCAN algorithm that consists of the AP-DBSCAN and a modified rough set is proposed to assess the railway investment risk. (3) Taking the China–Laos railway as a real-life case study, the effectiveness and superiority of the proposed IM-DBSCAN algorithm are verified through a set of experiments compared with other state-of-the-art algorithms.
Details
Keywords
Samir Al-Janabi and Ryszard Janicki
Data quality is a major challenge in data management. For organizations, the cleanliness of data is a significant problem that affects many business activities. Errors in data…
Abstract
Purpose
Data quality is a major challenge in data management. For organizations, the cleanliness of data is a significant problem that affects many business activities. Errors in data occur for different reasons, such as violation of business rules. However, because of the huge amount of data, manual cleaning alone is infeasible. Methods are required to repair and clean the dirty data through automatic detection, which are data quality issues to address. The purpose of this work is to extend the density-based data cleaning approach using conditional functional dependencies to achieve better data repair.
Design/methodology/approach
A set of conditional functional dependencies is introduced as an input to the density-based data cleaning algorithm. The algorithm repairs inconsistent data using this set.
Findings
This new approach was evaluated through experiments on real-world as well as synthetic datasets. The repair quality was determined using the F-measure. The results showed that the quality and scalability of the density-based data cleaning approach improved when conditional functional dependencies were introduced.
Originality/value
Conditional functional dependencies capture semantic errors among data values. This work demonstrates that the density-based data cleaning approach can be improved in terms of repairing inconsistent data by using conditional functional dependencies.
Details