Search results

1 – 10 of 101
Article
Publication date: 1 November 2021

Jingwei Guo, Ji Zhang, Yongxiang Zhang, Peijuan Xu, Lutian Li, Zhongqi Xie and Qinglin Li

Density-based spatial clustering of applications with noise (DBSCAN) is the most commonly used density-based clustering algorithm, while it cannot be directly applied to the…

Abstract

Purpose

Density-based spatial clustering of applications with noise (DBSCAN) is the most commonly used density-based clustering algorithm, while it cannot be directly applied to the railway investment risk assessment. To overcome the shortcomings of calculation method and parameter limits of DBSCAN, this paper proposes a new algorithm called Improved Multiple Density-based Spatial clustering of Applications with Noise (IM-DBSCAN) based on the DBSCAN and rough set theory.

Design/methodology/approach

First, the authors develop an improved affinity propagation (AP) algorithm, which is then combined with the DBSCAN (hereinafter referred to as AP-DBSCAN for short) to improve the parameter setting and efficiency of the DBSCAN. Second, the IM-DBSCAN algorithm, which consists of the AP-DBSCAN and a modified rough set, is designed to investigate the railway investment risk. Finally, the IM-DBSCAN algorithm is tested on the China–Laos railway's investment risk assessment, and its performance is compared with other related algorithms.

Findings

The IM-DBSCAN algorithm is implemented on China–Laos railway's investment risk assessment and compares with other related algorithms. The clustering results validate that the AP-DBSCAN algorithm is feasible and efficient in terms of clustering accuracy and operating time. In addition, the experimental results also indicate that the IM-DBSCAN algorithm can be used as an effective method for the prospective risk assessment in railway investment.

Originality/value

This study proposes IM-DBSCAN algorithm that consists of the AP-DBSCAN and a modified rough set to study the railway investment risk. Different from the existing clustering algorithms, AP-DBSCAN put forward the density calculation method to simplify the process of optimizing DBSCAN parameters. Instead of using Euclidean distance approach, the cutoff distance method is introduced to improve the similarity measure for optimizing the parameters. The developed AP-DBSCAN is used to classify the China–Laos railway's investment risk indicators more accurately. Combined with a modified rough set, the IM-DBSCAN algorithm is proposed to analyze the railway investment risk assessment. The contributions of this study can be summarized as follows: (1) Based on AP, DBSCAN, an integrated methodology AP-DBSCAN, which considers improving the parameter setting and efficiency, is proposed to classify railway risk indicators. (2) As AP-DBSCAN is a risk classification model rather than a risk calculation model, an IM-DBSCAN algorithm that consists of the AP-DBSCAN and a modified rough set is proposed to assess the railway investment risk. (3) Taking the China–Laos railway as a real-life case study, the effectiveness and superiority of the proposed IM-DBSCAN algorithm are verified through a set of experiments compared with other state-of-the-art algorithms.

Details

Data Technologies and Applications, vol. 56 no. 3
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 25 February 2020

Wolfram Höpken, Marcel Müller, Matthias Fuchs and Maria Lexhagen

The purpose of this study is to analyse the suitability of photo-sharing platforms, such as Flickr, to extract relevant knowledge on tourists’ spatial movement and point of…

Abstract

Purpose

The purpose of this study is to analyse the suitability of photo-sharing platforms, such as Flickr, to extract relevant knowledge on tourists’ spatial movement and point of interest (POI) visitation behaviour and compare the most prominent clustering approaches to identify POIs in various application scenarios.

Design/methodology/approach

The study, first, extracts photo metadata from Flickr, such as upload time, location and user. Then, photo uploads are assigned to latent POIs by density-based spatial clustering of applications with noise (DBSCAN) and k-means clustering algorithms. Finally, association rule analysis (FP-growth algorithm) and sequential pattern mining (generalised sequential pattern algorithm) are used to identify tourists’ behavioural patterns.

Findings

The approach has been demonstrated for the city of Munich, extracting 13,545 photos for the year 2015. POIs, identified by DBSCAN and k-means clustering, could be meaningfully assigned to well-known POIs. By doing so, both techniques show specific advantages for different usage scenarios. Association rule analysis revealed strong rules (support: 1.0-4.6 per cent; lift: 1.4-32.1 per cent), and sequential pattern mining identified relevant frequent visitation sequences (support: 0.6-1.7 per cent).

Research limitations/implications

As a theoretic contribution, this study comparatively analyses the suitability of different clustering techniques to appropriately identify POIs based on photo upload data as an input to association rule analysis and sequential pattern mining as an alternative but also complementary techniques to analyse tourists’ spatial behaviour.

Practical implications

From a practical perspective, the study highlights that big data sources, such as Flickr, show the potential to effectively substitute traditional data sources for analysing tourists’ spatial behaviour and movement patterns within a destination. Especially, the approach offers the advantage of being fully automatic and executable in a real-time environment.

Originality/value

The study presents an approach to identify POIs by clustering photo uploads on social media platforms and to analyse tourists’ spatial behaviour by association rule analysis and sequential pattern mining. The study gains novel insights into the suitability of different clustering techniques to identify POIs in different application scenarios.

摘要 研究目的

本论文旨在分析图片分享平台Flickr对截取游客空间动线信息和景点(POI)游览行为的适用性, 并且对比最知名的几种聚类分析手段, 以确定不同情况下的POI。

研究设计/方法/途径

本论文首先从Flickr上摘录下图片大数据, 比如上传时间、地点、用户等。其次, 本论文使用DBSCAN和k-means聚类分析参数来将上传图片分配给POI隐性变量。最后, 本论文采用关联规则挖掘分析(FP-growth参数)和序列样式勘探分析(GSP参数)以确认游客行为模式。

研究结果

本论文以慕尼黑城市为样本, 截取2015年13,545张图片。POIs由DBSCAN和k-means聚类分析将其分配到有名的POIs。由此, 本论文证明了两种技术对不同用法的各自优势。关联规则挖掘分析显示了显著联系(support:1%−4.6%;lift:1.4%−32.1%), 序列样式勘探分析确立了相关频率游览次序(support:0.6%−1.7%。

研究理论限制/意义

本论文的理论贡献在于, 根据图片数据, 通过对比分析不同聚类分析技术对确立POIs, 并且证明关联规则挖掘分析和序列样式勘探分析各有千秋又互相补充的分析技术以确立游客空间行为。

研究现实意义

本论文的现实意义在于, 强调了大数据的来源, 比如Flickr,证明了其对于有效代替传统数据的潜力, 以分析在游客在一个旅游目的地的空间行为和动线模式。特别是这种方法实现了实时自动可操作性等优势。

研究原创性/价值

本论文展示了一种方法, 这种方法通过聚类分析社交媒体上的上传图片以确立POIs, 以及通过关联规则挖掘分析和序列样式勘探分析来分析游客空间行为。本论文对于不同聚类分析以确立不同适用情况下的POIs的确立提出了独到见解。

Article
Publication date: 3 June 2019

Hongqi Han, Yongsheng Yu, Lijun Wang, Xiaorui Zhai, Yaxin Ran and Jingpeng Han

The aim of this study is to present a novel approach based on semantic fingerprinting and a clustering algorithm called density-based spatial clustering of applications with noise…

Abstract

Purpose

The aim of this study is to present a novel approach based on semantic fingerprinting and a clustering algorithm called density-based spatial clustering of applications with noise (DBSCAN), which can be used to convert investor records into 128-bit semantic fingerprints. Inventor disambiguation is a method used to discover a unique set of underlying inventors and map a set of patents to their corresponding inventors. Resolving the ambiguities between inventors is necessary to improve the quality of the patent database and to ensure accurate entity-level analysis. Most existing methods are based on machine learning and, while they often show good performance, this comes at the cost of time, computational power and storage space.

Design/methodology/approach

Using DBSCAN, the meta and textual data in inventor records are converted into 128-bit semantic fingerprints. However, rather than using a string comparison or cosine similarity to calculate the distance between pair-wise fingerprint records, a binary number comparison function was used in DBSCAN. DBSCAN then clusters the inventor records based on this distance to disambiguate inventor names.

Findings

Experiments conducted on the PatentsView campaign database of the United States Patent and Trademark Office show that this method disambiguates inventor names with recall greater than 99 per cent in less time and with substantially smaller storage requirement.

Research limitations/implications

A better semantic fingerprint algorithm and a better distance function may improve precision. Setting of different clustering parameters for each block or other clustering algorithms will be considered to improve the accuracy of the disambiguation results even further.

Originality/value

Compared with the existing methods, the proposed method does not rely on feature selection and complex feature comparison computation. Most importantly, running time and storage requirements are drastically reduced.

Details

The Electronic Library , vol. 37 no. 2
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 25 February 2019

Celia Hireche and Habiba Drias

This paper is an extended version of Hireche and Drias (2018) presented at the WORLD-CIST’18 conference. The major contribution, in this work, is defined in two phases. First of…

Abstract

Purpose

This paper is an extended version of Hireche and Drias (2018) presented at the WORLD-CIST’18 conference. The major contribution, in this work, is defined in two phases. First of all, the use of data mining technologies and especially the tools of data preprocessing for instances of hard and complex problems prior to their resolution. The authors focus on clustering the instance aiming at reducing its complexity. The second phase is to solve the instance using the knowledge acquired in the first step and problem-solving methods. The paper aims to discuss these issues.

Design/methodology/approach

Because different clustering techniques may offer different results for a data set, a prior knowledge on data helps to determine the adequate type of clustering that should be applied. The first part of this work deals with a study on data descriptive characteristics in order to better understand the data. The dispersion and distribution of the variables in the problem instances is especially explored to determine the most suitable clustering technique to apply.

Findings

Several experiments were performed on different kinds of instances and different kinds of data distribution. The obtained results show the importance and the efficiency of the proposed appropriate preprocessing approaches prior to problem solving.

Practical implications

The proposed approach is developed, in this paper, on the Boolean satisfiability problem because of its well-recognised importance, with the aim of complexity reduction which allows an easier resolution of the later problem and particularly an important time saving.

Originality/value

State of the art of problem solving describes plenty of algorithms and solvers of hard problems that are still a challenge because of their complexity. The originality of this work lies on the investigation of appropriate preprocessing techniques to tackle and overcome this complexity prior to the resolution which becomes easier with an important time saving.

Details

Data Technologies and Applications, vol. 53 no. 1
Type: Research Article
ISSN: 2514-9288

Keywords

Open Access
Article
Publication date: 26 April 2024

Xue Xin, Yuepeng Jiao, Yunfeng Zhang, Ming Liang and Zhanyong Yao

This study aims to ensure reliable analysis of dynamic responses in asphalt pavement structures. It investigates noise reduction and data mining techniques for pavement dynamic…

Abstract

Purpose

This study aims to ensure reliable analysis of dynamic responses in asphalt pavement structures. It investigates noise reduction and data mining techniques for pavement dynamic response signals.

Design/methodology/approach

The paper conducts time-frequency analysis on signals of pavement dynamic response initially. It also uses two common noise reduction methods, namely, low-pass filtering and wavelet decomposition reconstruction, to evaluate their effectiveness in reducing noise in these signals. Furthermore, as these signals are generated in response to vehicle loading, they contain a substantial amount of data and are prone to environmental interference, potentially resulting in outliers. Hence, it becomes crucial to extract dynamic strain response features (e.g. peaks and peak intervals) in real-time and efficiently.

Findings

The study introduces an improved density-based spatial clustering of applications with Noise (DBSCAN) algorithm for identifying outliers in denoised data. The results demonstrate that low-pass filtering is highly effective in reducing noise in pavement dynamic response signals within specified frequency ranges. The improved DBSCAN algorithm effectively identifies outliers in these signals through testing. Furthermore, the peak detection process, using the enhanced findpeaks function, consistently achieves excellent performance in identifying peak values, even when complex multi-axle heavy-duty truck strain signals are present.

Originality/value

The authors identified a suitable frequency domain range for low-pass filtering in asphalt road dynamic response signals, revealing minimal amplitude loss and effective strain information reflection between road layers. Furthermore, the authors introduced the DBSCAN-based anomaly data detection method and enhancements to the Matlab findpeaks function, enabling the detection of anomalies in road sensor data and automated peak identification.

Details

Smart and Resilient Transportation, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2632-0487

Keywords

Article
Publication date: 2 November 2015

Desh Deepak Sharma and S.N. Singh

This paper aims to detect abnormal energy uses which relate to undetected consumption, thefts, measurement errors, etc. The detection of irregular power consumption, with…

Abstract

Purpose

This paper aims to detect abnormal energy uses which relate to undetected consumption, thefts, measurement errors, etc. The detection of irregular power consumption, with variation in irregularities, helps the electric utilities in planning and making strategies to transfer reliable and efficient electricity from generators to the end-users. Abnormal peak load demand is a kind of aberration that needs to be detected.

Design/methodology/approach

This paper proposes a Density-Based Micro Spatial Clustering of Applications with Noise (DBMSCAN) clustering algorithm, which is implemented for identification of ranked irregular electricity consumption and occurrence of peak and valley loads. In the proposed algorithm, two parameters, a and ß, are introduced, and, on tuning of these parameters, after setting of global parameters, a varied number of micro-clusters and ranked irregular consumptions, respectively, are obtained. An approach is incorporated with the introduction of a new term Irregularity Variance in the suggested algorithm to find variation in the irregular consumptions according to anomalous behaviors.

Findings

No set of global parameters in DBSCAN is found in clustering of load pattern data of a practical system as the data. The proposed DBMSCAN approach finds clustering results and ranked irregular consumption such as different types of abnormal peak demands, sudden change in the demand, nearly zero demand, etc. with computational ease without any iterative control method.

Originality/value

The DBMSCAN can be applied on any data set to find ranked outliers. It is an unsupervised approach of clustering technique to find the clustering results and ranked irregular consumptions while focusing on the analysis of and variations in anomalous behaviors in electricity consumption.

Details

International Journal of Energy Sector Management, vol. 9 no. 4
Type: Research Article
ISSN: 1750-6220

Keywords

Article
Publication date: 3 April 2018

Ha Yoon Song and Dabin You

The purpose of this paper is to understand urban mobility model.

Abstract

Purpose

The purpose of this paper is to understand urban mobility model.

Design/methodology/approach

The authors have used deep learning as tools of analysis and taxi transportation data as sources of mobility.

Findings

The authors have found urban mobility model of weekdays and weekends for a metropolitan city.

Research limitations/implications

There could be many sources of transportation data but the authors have used public taxi data solely.

Practical implications

With the urban mobility model proposed in this paper, other researchers and industries can improve their own service based on urban mobility model.

Social implications

The result would be a good model for urban traffic control or traffic modeling.

Originality/value

This works is an improvement of the paper published in The 15th International Conference on Advances in Mobile Computing & Multimedia (MoMM2017) by recommendation of conference editor, Ismail Khalil, IJPCC editor-in-chief.

Details

International Journal of Pervasive Computing and Communications, vol. 14 no. 1
Type: Research Article
ISSN: 1742-7371

Keywords

Article
Publication date: 18 June 2021

Shuai Luo, Hongwei Liu and Ershi Qi

The purpose of this paper is to recognize and label the faults in wind turbines with a new density-based clustering algorithm, named contour density scanning clustering (CDSC…

Abstract

Purpose

The purpose of this paper is to recognize and label the faults in wind turbines with a new density-based clustering algorithm, named contour density scanning clustering (CDSC) algorithm.

Design/methodology/approach

The algorithm includes four components: (1) computation of neighborhood density, (2) selection of core and noise data, (3) scanning core data and (4) updating clusters. The proposed algorithm considers the relationship between neighborhood data points according to a contour density scanning strategy.

Findings

The first experiment is conducted with artificial data to validate that the proposed CDSC algorithm is suitable for handling data points with arbitrary shapes. The second experiment with industrial gearbox vibration data is carried out to demonstrate that the time complexity and accuracy of the proposed CDSC algorithm in comparison with other conventional clustering algorithms, including k-means, density-based spatial clustering of applications with noise, density peaking clustering, neighborhood grid clustering, support vector clustering, random forest, core fusion-based density peak clustering, AdaBoost and extreme gradient boosting. The third experiment is conducted with an industrial bearing vibration data set to highlight that the CDSC algorithm can automatically track the emerging fault patterns of bearing in wind turbines over time.

Originality/value

Data points with different densities are clustered using three strategies: direct density reachability, density reachability and density connectivity. A contours density scanning strategy is proposed to determine whether the data points with the same density belong to one cluster. The proposed CDSC algorithm achieves automatically clustering, which means that the trends of the fault pattern could be tracked.

Details

Data Technologies and Applications, vol. 55 no. 5
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 3 November 2022

Reza Edris Abadi, Mohammad Javad Ershadi and Seyed Taghi Akhavan Niaki

The overall goal of the data mining process is to extract information from an extensive data set and make it understandable for further use. When working with large volumes of…

Abstract

Purpose

The overall goal of the data mining process is to extract information from an extensive data set and make it understandable for further use. When working with large volumes of unstructured data in research information systems, it is necessary to divide the information into logical groupings after examining their quality before attempting to analyze it. On the other hand, data quality results are valuable resources for defining quality excellence programs of any information system. Hence, the purpose of this study is to discover and extract knowledge to evaluate and improve data quality in research information systems.

Design/methodology/approach

Clustering in data analysis and exploiting the outputs allows practitioners to gain an in-depth and extensive look at their information to form some logical structures based on what they have found. In this study, data extracted from an information system are used in the first stage. Then, the data quality results are classified into an organized structure based on data quality dimension standards. Next, clustering algorithms (K-Means), density-based clustering (density-based spatial clustering of applications with noise [DBSCAN]) and hierarchical clustering (balanced iterative reducing and clustering using hierarchies [BIRCH]) are applied to compare and find the most appropriate clustering algorithms in the research information system.

Findings

This paper showed that quality control results of an information system could be categorized through well-known data quality dimensions, including precision, accuracy, completeness, consistency, reputation and timeliness. Furthermore, among different well-known clustering approaches, the BIRCH algorithm of hierarchical clustering methods performs better in data clustering and gives the highest silhouette coefficient value. Next in line is the DBSCAN method, which performs better than the K-Means method.

Research limitations/implications

In the data quality assessment process, the discrepancies identified and the lack of proper classification for inconsistent data have led to unstructured reports, making the statistical analysis of qualitative metadata problems difficult and thus impossible to root out the observed errors. Therefore, in this study, the evaluation results of data quality have been categorized into various data quality dimensions, based on which multiple analyses have been performed in the form of data mining methods.

Originality/value

Although several pieces of research have been conducted to assess data quality results of research information systems, knowledge extraction from obtained data quality scores is a crucial work that has rarely been studied in the literature. Besides, clustering in data quality analysis and exploiting the outputs allows practitioners to gain an in-depth and extensive look at their information to form some logical structures based on what they have found.

Details

Information Discovery and Delivery, vol. 51 no. 4
Type: Research Article
ISSN: 2398-6247

Keywords

Article
Publication date: 26 May 2023

Chunhua Liu, Ming Li, Peng Chen and Chaoyun Zhang

This study aims to solve the problems of ambiguous localization, large calculation, poor real-time and limited applicability of bolt thread defect detection.

Abstract

Purpose

This study aims to solve the problems of ambiguous localization, large calculation, poor real-time and limited applicability of bolt thread defect detection.

Design/methodology/approach

First, the acquired ultrasound image is used to acquire the larger area of the image, which is set as the compliant threaded area. Second, based on the determined coordinates of the center point in each selected region, the set of coordinates on the left and right sides of the bolts is acquired by DBSCAN method with parameters eps and MinPts, which is determined by data set dimension D and the k-distance curve. Finally, the defect detection boundary line fitting is completed using the acquired coordinate set, and the relationship between the distance from each detection point to the curve and d, which is obtained from the measurement of the standard bolt sample with known thread defect, is used to locate the bolt thread defect simultaneously.

Findings

In this paper, the bolt thread defect detection method with ultrasonic image is proposed; meanwhile, the ultrasonic image acquisition system is designed to complete the real-time localization of bolt thread defects.

Originality/value

The detection results show that the method can effectively detect bolt thread defects and locate the bolt thread defect location with wide applicability, small calculation and good real-time performance.

Details

Anti-Corrosion Methods and Materials, vol. 70 no. 4
Type: Research Article
ISSN: 0003-5599

Keywords

1 – 10 of 101