Search results
1 – 10 of 134Jingwei Guo, Ji Zhang, Yongxiang Zhang, Peijuan Xu, Lutian Li, Zhongqi Xie and Qinglin Li
Density-based spatial clustering of applications with noise (DBSCAN) is the most commonly used density-based clustering algorithm, while it cannot be directly applied to the…
Abstract
Purpose
Density-based spatial clustering of applications with noise (DBSCAN) is the most commonly used density-based clustering algorithm, while it cannot be directly applied to the railway investment risk assessment. To overcome the shortcomings of calculation method and parameter limits of DBSCAN, this paper proposes a new algorithm called Improved Multiple Density-based Spatial clustering of Applications with Noise (IM-DBSCAN) based on the DBSCAN and rough set theory.
Design/methodology/approach
First, the authors develop an improved affinity propagation (AP) algorithm, which is then combined with the DBSCAN (hereinafter referred to as AP-DBSCAN for short) to improve the parameter setting and efficiency of the DBSCAN. Second, the IM-DBSCAN algorithm, which consists of the AP-DBSCAN and a modified rough set, is designed to investigate the railway investment risk. Finally, the IM-DBSCAN algorithm is tested on the China–Laos railway's investment risk assessment, and its performance is compared with other related algorithms.
Findings
The IM-DBSCAN algorithm is implemented on China–Laos railway's investment risk assessment and compares with other related algorithms. The clustering results validate that the AP-DBSCAN algorithm is feasible and efficient in terms of clustering accuracy and operating time. In addition, the experimental results also indicate that the IM-DBSCAN algorithm can be used as an effective method for the prospective risk assessment in railway investment.
Originality/value
This study proposes IM-DBSCAN algorithm that consists of the AP-DBSCAN and a modified rough set to study the railway investment risk. Different from the existing clustering algorithms, AP-DBSCAN put forward the density calculation method to simplify the process of optimizing DBSCAN parameters. Instead of using Euclidean distance approach, the cutoff distance method is introduced to improve the similarity measure for optimizing the parameters. The developed AP-DBSCAN is used to classify the China–Laos railway's investment risk indicators more accurately. Combined with a modified rough set, the IM-DBSCAN algorithm is proposed to analyze the railway investment risk assessment. The contributions of this study can be summarized as follows: (1) Based on AP, DBSCAN, an integrated methodology AP-DBSCAN, which considers improving the parameter setting and efficiency, is proposed to classify railway risk indicators. (2) As AP-DBSCAN is a risk classification model rather than a risk calculation model, an IM-DBSCAN algorithm that consists of the AP-DBSCAN and a modified rough set is proposed to assess the railway investment risk. (3) Taking the China–Laos railway as a real-life case study, the effectiveness and superiority of the proposed IM-DBSCAN algorithm are verified through a set of experiments compared with other state-of-the-art algorithms.
Details
Keywords
Shuai Luo, Hongwei Liu and Ershi Qi
The purpose of this paper is to recognize and label the faults in wind turbines with a new density-based clustering algorithm, named contour density scanning clustering (CDSC…
Abstract
Purpose
The purpose of this paper is to recognize and label the faults in wind turbines with a new density-based clustering algorithm, named contour density scanning clustering (CDSC) algorithm.
Design/methodology/approach
The algorithm includes four components: (1) computation of neighborhood density, (2) selection of core and noise data, (3) scanning core data and (4) updating clusters. The proposed algorithm considers the relationship between neighborhood data points according to a contour density scanning strategy.
Findings
The first experiment is conducted with artificial data to validate that the proposed CDSC algorithm is suitable for handling data points with arbitrary shapes. The second experiment with industrial gearbox vibration data is carried out to demonstrate that the time complexity and accuracy of the proposed CDSC algorithm in comparison with other conventional clustering algorithms, including k-means, density-based spatial clustering of applications with noise, density peaking clustering, neighborhood grid clustering, support vector clustering, random forest, core fusion-based density peak clustering, AdaBoost and extreme gradient boosting. The third experiment is conducted with an industrial bearing vibration data set to highlight that the CDSC algorithm can automatically track the emerging fault patterns of bearing in wind turbines over time.
Originality/value
Data points with different densities are clustered using three strategies: direct density reachability, density reachability and density connectivity. A contours density scanning strategy is proposed to determine whether the data points with the same density belong to one cluster. The proposed CDSC algorithm achieves automatically clustering, which means that the trends of the fault pattern could be tracked.
Details
Keywords
Toan Van Nguyen, Minh Hoang Do and Jaewon Jo
Collision avoidance is considered as a crucial issue in mobile robotic navigation to guarantee the safety of robots as well as working surroundings, especially for humans…
Abstract
Purpose
Collision avoidance is considered as a crucial issue in mobile robotic navigation to guarantee the safety of robots as well as working surroundings, especially for humans. Therefore, the position and velocity of obstacles appearing in the working space of the self-driving mobile robot should be observed to help the robot predict the collision and choose traversable directions. This paper aims to propose a new approach for obstacle tracking, dubbed MoDeT.
Design/methodology/approach
First, all long lines, such as walls, are extracted from the 2D-laser scan and considered as static obstacles (or mapped obstacles). Second, a density-based procedure is implemented to cluster nonwall obstacles. These clusters are then geometrically fitted as ellipses. Finally, the combination of Kalman filter and global nearest-neighbor (GNN) method is used to track obstacles’ position and velocity.
Findings
The proposed method (MoDeT) is experimentally verified by using an autonomous mobile robot (AMR) named AMR SR300. The MoDeT is found to provide better performance in comparison with previous methods for self-driving mobile robots.
Research limitations/implications
The robot can only see a part of the object, depending on the light detection and ranging scan view. As a consequence, geometrical features of the obstacle are sometimes changed, especially when the robot is moving fast.
Practical implications
This proposed method is to serve the navigation and path planning for the AMR.
Originality/value
(a) Proposing an extended weighted line extractor, (b) proposing a density-based obstacle detection and (c) implementing a combination of methods [in (a) and (b) constant acceleration Kalman and GNN] to obtain obstacles’ properties.
Details
Keywords
Reza Edris Abadi, Mohammad Javad Ershadi and Seyed Taghi Akhavan Niaki
The overall goal of the data mining process is to extract information from an extensive data set and make it understandable for further use. When working with large volumes of…
Abstract
Purpose
The overall goal of the data mining process is to extract information from an extensive data set and make it understandable for further use. When working with large volumes of unstructured data in research information systems, it is necessary to divide the information into logical groupings after examining their quality before attempting to analyze it. On the other hand, data quality results are valuable resources for defining quality excellence programs of any information system. Hence, the purpose of this study is to discover and extract knowledge to evaluate and improve data quality in research information systems.
Design/methodology/approach
Clustering in data analysis and exploiting the outputs allows practitioners to gain an in-depth and extensive look at their information to form some logical structures based on what they have found. In this study, data extracted from an information system are used in the first stage. Then, the data quality results are classified into an organized structure based on data quality dimension standards. Next, clustering algorithms (K-Means), density-based clustering (density-based spatial clustering of applications with noise [DBSCAN]) and hierarchical clustering (balanced iterative reducing and clustering using hierarchies [BIRCH]) are applied to compare and find the most appropriate clustering algorithms in the research information system.
Findings
This paper showed that quality control results of an information system could be categorized through well-known data quality dimensions, including precision, accuracy, completeness, consistency, reputation and timeliness. Furthermore, among different well-known clustering approaches, the BIRCH algorithm of hierarchical clustering methods performs better in data clustering and gives the highest silhouette coefficient value. Next in line is the DBSCAN method, which performs better than the K-Means method.
Research limitations/implications
In the data quality assessment process, the discrepancies identified and the lack of proper classification for inconsistent data have led to unstructured reports, making the statistical analysis of qualitative metadata problems difficult and thus impossible to root out the observed errors. Therefore, in this study, the evaluation results of data quality have been categorized into various data quality dimensions, based on which multiple analyses have been performed in the form of data mining methods.
Originality/value
Although several pieces of research have been conducted to assess data quality results of research information systems, knowledge extraction from obtained data quality scores is a crucial work that has rarely been studied in the literature. Besides, clustering in data quality analysis and exploiting the outputs allows practitioners to gain an in-depth and extensive look at their information to form some logical structures based on what they have found.
Details
Keywords
Desh Deepak Sharma and S.N. Singh
This paper aims to detect abnormal energy uses which relate to undetected consumption, thefts, measurement errors, etc. The detection of irregular power consumption, with…
Abstract
Purpose
This paper aims to detect abnormal energy uses which relate to undetected consumption, thefts, measurement errors, etc. The detection of irregular power consumption, with variation in irregularities, helps the electric utilities in planning and making strategies to transfer reliable and efficient electricity from generators to the end-users. Abnormal peak load demand is a kind of aberration that needs to be detected.
Design/methodology/approach
This paper proposes a Density-Based Micro Spatial Clustering of Applications with Noise (DBMSCAN) clustering algorithm, which is implemented for identification of ranked irregular electricity consumption and occurrence of peak and valley loads. In the proposed algorithm, two parameters, a and ß, are introduced, and, on tuning of these parameters, after setting of global parameters, a varied number of micro-clusters and ranked irregular consumptions, respectively, are obtained. An approach is incorporated with the introduction of a new term Irregularity Variance in the suggested algorithm to find variation in the irregular consumptions according to anomalous behaviors.
Findings
No set of global parameters in DBSCAN is found in clustering of load pattern data of a practical system as the data. The proposed DBMSCAN approach finds clustering results and ranked irregular consumption such as different types of abnormal peak demands, sudden change in the demand, nearly zero demand, etc. with computational ease without any iterative control method.
Originality/value
The DBMSCAN can be applied on any data set to find ranked outliers. It is an unsupervised approach of clustering technique to find the clustering results and ranked irregular consumptions while focusing on the analysis of and variations in anomalous behaviors in electricity consumption.
Details
Keywords
Emilio Pindado and Ramo Barrena
This paper investigates the use of Twitter for studying the social representations of different regions across the world towards new food trends.
Abstract
Purpose
This paper investigates the use of Twitter for studying the social representations of different regions across the world towards new food trends.
Design/methodology/approach
A density-based clustering algorithm was applied to 7,014 tweets to identify regions of consumers sharing content about food trends. The attitude of their social representations was addressed with the sentiment analysis, and grid maps were used to explore subregional differences.
Findings
Twitter users have a weak, positive attitude towards food trends, and significant differences were found across regions identified, which suggests that factors at the regional level such as cultural context determine users' attitude towards food innovations. The subregional analysis showed differences at the local level, which reinforces the evidence that context matters in consumers' attitude expressed in social media.
Research limitations/implications
The social media content is sensitive to spatio-temporal events. Therefore, research should take into account content, location and contextual information to understand consumers' perceptions. The methodology proposed here serves to identify consumers' regions and to characterize their attitude towards specific topics. It considers not only administrative but also cognitive boundaries in order to analyse subsequent contextual influences on consumers' social representations.
Practical implications
The approach presented allows marketers to identify regions of interest and localize consumers' attitudes towards their products using social media data, providing real-time information to contrast with their strategies in different areas and adapt them to consumers' feelings.
Originality/value
This study presents a research methodology to analyse food consumers' understanding and perceptions using not only content but also geographical information of social media data, which provides a means to extract more information than the content analysis applied in the literature.
Details
Keywords
Saad Ahmed Al-Saad, Rana N. Jawarneh and Areej Shabib Aloudat
To test the applicability of the user-generated content (UGC) derived from social travel network sites for online reputation management, the purpose of this study is to analyze…
Abstract
Purpose
To test the applicability of the user-generated content (UGC) derived from social travel network sites for online reputation management, the purpose of this study is to analyze the spatial clustering of the reputable hotels (based on the TripAdvisor Best-Value indicator) and reputable outdoor seating restaurants (based on ranking indicator).
Design/methodology/approach
This study used data mining techniques to obtain the UGC from TripAdvisor. The Hierarchical Density-Based Spatial Clustering method based on algorithm (HDBSCAN) was used for robust cluster analysis.
Findings
The findings of this study revealed that best value (BV) hotels and reputable outdoor seating restaurants are most likely to be located in and around the central districts of the urban tourist destinations where population and economic activities are denser. BV hotels' spatiotemporal cluster analysis formed clusters of different sizes, densities and shape patterns.
Research limitations/implications
This study showed that reputable hotels and restaurants (H&Rs) are concentrated within districts near historic city centers. This should be an impetus for applied research on urban investment environments.
Practical implications
The findings would be rational guidance for entrepreneurs and potential investors on the most attractive tourism investment environments.
Originality/value
There has been a lack of studies focusing on analyzing the spatial clustering of the H&Rs using UGC. Therefore, to the best of the authors’ knowledge, this study is the first to map and analyze the spatiotemporal clustering patterns of reputable hotels (TripAdvisor BV indicator) and restaurants (ranking indicator). As such, this study makes a significant methodological contribution to urban tourism research by showing pattern change in H&Rs clustering using data mining and the HDBSCAN algorithm.
研究目的
为了测试社交旅游网站 (STNS) 的用户生成内容 (UGC) 对在线声誉管理 (ORM) 的适用性, 本研究分析了知名酒店的空间聚类(基于 TripAdvisor 最佳价值指标) 和信誉良好的户外座位 (ODS) 餐厅(基于排名指标)。
研究设计/方法/途径
该研究使用数据挖掘技术从 TripAdvisor 获取 UGC。 基于(HDBSCAN)算法的分层基于密度的空间聚类方法用于鲁棒聚类分析。
研究发现
调查结果显示, 最具价值 (BV) 酒店和信誉良好的 ODS 餐厅最有可能位于人口和经济活动较为密集的城市旅游目的地的中心区及其周边地区。 BV 酒店的时空聚类分析形成了不同大小、密度和形状模式的聚类。
研究原创性
目前的文献扔缺乏专注于分析利用 UGC 的酒店和餐厅 (H&R) 空间聚类的研究。 因此, 本研究首次绘制并分析了知名酒店(TripAdvisor BV 指标)和餐厅(排名指标)的时空聚类模式。 因此, 本研究通过利用数据挖掘和 HDBSCAN 算法显示 H&Rs 聚类的模式变化, 为城市旅游研究做出了重要的方法论贡献。
理论意义
这项研究表明, 著名的 H&R 集中在历史悠久的市中心附近的地区。 这应该是对城市投资环境的应用研究的推动力。
实践意义
研究结果将为企业家和潜在投资者提供最具吸引力的旅游投资环境的理性指导。
Details
Keywords
Qingyuan Wu, Changchen Zhan, Fu Lee Wang, Siyang Wang and Zeping Tang
The quick growth of web-based and mobile e-learning applications such as massive open online courses have created a large volume of online learning resources. Confronting such a…
Abstract
Purpose
The quick growth of web-based and mobile e-learning applications such as massive open online courses have created a large volume of online learning resources. Confronting such a large amount of learning data, it is important to develop effective clustering approaches for user group modeling and intelligent tutoring. The paper aims to discuss these issues.
Design/methodology/approach
In this paper, a minimum spanning tree based approach is proposed for clustering of online learning resources. The novel clustering approach has two main stages, namely, elimination stage and construction stage. During the elimination stage, the Euclidean distance is adopted as a metrics formula to measure density of learning resources. Resources with quite low densities are identified as outliers and therefore removed. During the construction stage, a minimum spanning tree is built by initializing the centroids according to the degree of freedom of the resources. Online learning resources are subsequently partitioned into clusters by exploiting the structure of minimum spanning tree.
Findings
Conventional clustering algorithms have a number of shortcomings such that they cannot handle online learning resources effectively. On the one hand, extant partitional clustering methods use a randomly assigned centroid for each cluster, which usually cause the problem of ineffective clustering results. On the other hand, classical density-based clustering methods are very computationally expensive and time-consuming. Experimental results indicate that the algorithm proposed outperforms the traditional clustering algorithms for online learning resources.
Originality/value
The effectiveness of the proposed algorithms has been validated by using several data sets. Moreover, the proposed clustering algorithm has great potential in e-learning applications. It has been demonstrated how the novel technique can be integrated in various e-learning systems. For example, the clustering technique can classify learners into groups so that homogeneous grouping can improve the effectiveness of learning. Moreover, clustering of online learning resources is valuable to decision making in terms of tutorial strategies and instructional design for intelligent tutoring. Lastly, a number of directions for future research have been identified in the study.
Details
Keywords
Wolfram Höpken, Marcel Müller, Matthias Fuchs and Maria Lexhagen
The purpose of this study is to analyse the suitability of photo-sharing platforms, such as Flickr, to extract relevant knowledge on tourists’ spatial movement and point of…
Abstract
Purpose
The purpose of this study is to analyse the suitability of photo-sharing platforms, such as Flickr, to extract relevant knowledge on tourists’ spatial movement and point of interest (POI) visitation behaviour and compare the most prominent clustering approaches to identify POIs in various application scenarios.
Design/methodology/approach
The study, first, extracts photo metadata from Flickr, such as upload time, location and user. Then, photo uploads are assigned to latent POIs by density-based spatial clustering of applications with noise (DBSCAN) and k-means clustering algorithms. Finally, association rule analysis (FP-growth algorithm) and sequential pattern mining (generalised sequential pattern algorithm) are used to identify tourists’ behavioural patterns.
Findings
The approach has been demonstrated for the city of Munich, extracting 13,545 photos for the year 2015. POIs, identified by DBSCAN and k-means clustering, could be meaningfully assigned to well-known POIs. By doing so, both techniques show specific advantages for different usage scenarios. Association rule analysis revealed strong rules (support: 1.0-4.6 per cent; lift: 1.4-32.1 per cent), and sequential pattern mining identified relevant frequent visitation sequences (support: 0.6-1.7 per cent).
Research limitations/implications
As a theoretic contribution, this study comparatively analyses the suitability of different clustering techniques to appropriately identify POIs based on photo upload data as an input to association rule analysis and sequential pattern mining as an alternative but also complementary techniques to analyse tourists’ spatial behaviour.
Practical implications
From a practical perspective, the study highlights that big data sources, such as Flickr, show the potential to effectively substitute traditional data sources for analysing tourists’ spatial behaviour and movement patterns within a destination. Especially, the approach offers the advantage of being fully automatic and executable in a real-time environment.
Originality/value
The study presents an approach to identify POIs by clustering photo uploads on social media platforms and to analyse tourists’ spatial behaviour by association rule analysis and sequential pattern mining. The study gains novel insights into the suitability of different clustering techniques to identify POIs in different application scenarios.
摘要 研究目的
本论文旨在分析图片分享平台Flickr对截取游客空间动线信息和景点(POI)游览行为的适用性, 并且对比最知名的几种聚类分析手段, 以确定不同情况下的POI。
研究设计/方法/途径
本论文首先从Flickr上摘录下图片大数据, 比如上传时间、地点、用户等。其次, 本论文使用DBSCAN和k-means聚类分析参数来将上传图片分配给POI隐性变量。最后, 本论文采用关联规则挖掘分析(FP-growth参数)和序列样式勘探分析(GSP参数)以确认游客行为模式。
研究结果
本论文以慕尼黑城市为样本, 截取2015年13,545张图片。POIs由DBSCAN和k-means聚类分析将其分配到有名的POIs。由此, 本论文证明了两种技术对不同用法的各自优势。关联规则挖掘分析显示了显著联系(support:1%−4.6%;lift:1.4%−32.1%), 序列样式勘探分析确立了相关频率游览次序(support:0.6%−1.7%。
研究理论限制/意义
本论文的理论贡献在于, 根据图片数据, 通过对比分析不同聚类分析技术对确立POIs, 并且证明关联规则挖掘分析和序列样式勘探分析各有千秋又互相补充的分析技术以确立游客空间行为。
研究现实意义
本论文的现实意义在于, 强调了大数据的来源, 比如Flickr,证明了其对于有效代替传统数据的潜力, 以分析在游客在一个旅游目的地的空间行为和动线模式。特别是这种方法实现了实时自动可操作性等优势。
研究原创性/价值
本论文展示了一种方法, 这种方法通过聚类分析社交媒体上的上传图片以确立POIs, 以及通过关联规则挖掘分析和序列样式勘探分析来分析游客空间行为。本论文对于不同聚类分析以确立不同适用情况下的POIs的确立提出了独到见解。
Details
Keywords
Celia Hireche and Habiba Drias
This paper is an extended version of Hireche and Drias (2018) presented at the WORLD-CIST’18 conference. The major contribution, in this work, is defined in two phases. First of…
Abstract
Purpose
This paper is an extended version of Hireche and Drias (2018) presented at the WORLD-CIST’18 conference. The major contribution, in this work, is defined in two phases. First of all, the use of data mining technologies and especially the tools of data preprocessing for instances of hard and complex problems prior to their resolution. The authors focus on clustering the instance aiming at reducing its complexity. The second phase is to solve the instance using the knowledge acquired in the first step and problem-solving methods. The paper aims to discuss these issues.
Design/methodology/approach
Because different clustering techniques may offer different results for a data set, a prior knowledge on data helps to determine the adequate type of clustering that should be applied. The first part of this work deals with a study on data descriptive characteristics in order to better understand the data. The dispersion and distribution of the variables in the problem instances is especially explored to determine the most suitable clustering technique to apply.
Findings
Several experiments were performed on different kinds of instances and different kinds of data distribution. The obtained results show the importance and the efficiency of the proposed appropriate preprocessing approaches prior to problem solving.
Practical implications
The proposed approach is developed, in this paper, on the Boolean satisfiability problem because of its well-recognised importance, with the aim of complexity reduction which allows an easier resolution of the later problem and particularly an important time saving.
Originality/value
State of the art of problem solving describes plenty of algorithms and solvers of hard problems that are still a challenge because of their complexity. The originality of this work lies on the investigation of appropriate preprocessing techniques to tackle and overcome this complexity prior to the resolution which becomes easier with an important time saving.
Details