Search results

1 – 10 of over 1000
Open Access
Article
Publication date: 24 June 2021

Bo Wang, Guanwei Wang, Youwei Wang, Zhengzheng Lou, Shizhe Hu and Yangdong Ye

Vehicle fault diagnosis is a key factor in ensuring the safe and efficient operation of the railway system. Due to the numerous vehicle categories and different fault mechanisms…

Abstract

Purpose

Vehicle fault diagnosis is a key factor in ensuring the safe and efficient operation of the railway system. Due to the numerous vehicle categories and different fault mechanisms, there is an unbalanced fault category problem. Most of the current methods to solve this problem have complex algorithm structures, low efficiency and require prior knowledge. This study aims to propose a new method which has a simple structure and does not require any prior knowledge to achieve a fast diagnosis of unbalanced vehicle faults.

Design/methodology/approach

This study proposes a novel K-means with feature learning based on the feature learning K-means-improved cluster-centers selection (FKM-ICS) method, which includes the ICS and the FKM. Specifically, this study defines cluster centers approximation to select the initialized cluster centers in the ICS. This study uses improved term frequency-inverse document frequency to measure and adjust the feature word weights in each cluster, retaining the top τ feature words with the highest weight in each cluster and perform the clustering process again in the FKM. With the FKM-ICS method, clustering performance for unbalanced vehicle fault diagnosis can be significantly enhanced.

Findings

This study finds that the FKM-ICS can achieve a fast diagnosis of vehicle faults on the vehicle fault text (VFT) data set from a railway station in the 2017 (VFT) data set. The experimental results on VFT indicate the proposed method in this paper, outperforms several state-of-the-art methods.

Originality/value

This is the first effort to address the vehicle fault diagnostic problem and the proposed method performs effectively and efficiently. The ICS enables the FKM-ICS method to exclude the effect of outliers, solves the disadvantages of the fault text data contained a certain amount of noisy data, which effectively enhanced the method stability. The FKM enhances the distribution of feature words that discriminate between different fault categories and reduces the number of feature words to make the FKM-ICS method faster and better cluster for unbalanced vehicle fault diagnostic.

Details

Smart and Resilient Transportation, vol. 3 no. 2
Type: Research Article
ISSN: 2632-0487

Keywords

Article
Publication date: 6 February 2017

Aytug Onan

The immense quantity of available unstructured text documents serve as one of the largest source of information. Text classification can be an essential task for many purposes in…

Abstract

Purpose

The immense quantity of available unstructured text documents serve as one of the largest source of information. Text classification can be an essential task for many purposes in information retrieval, such as document organization, text filtering and sentiment analysis. Ensemble learning has been extensively studied to construct efficient text classification schemes with higher predictive performance and generalization ability. The purpose of this paper is to provide diversity among the classification algorithms of ensemble, which is a key issue in the ensemble design.

Design/methodology/approach

An ensemble scheme based on hybrid supervised clustering is presented for text classification. In the presented scheme, supervised hybrid clustering, which is based on cuckoo search algorithm and k-means, is introduced to partition the data samples of each class into clusters so that training subsets with higher diversities can be provided. Each classifier is trained on the diversified training subsets and the predictions of individual classifiers are combined by the majority voting rule. The predictive performance of the proposed classifier ensemble is compared to conventional classification algorithms (such as Naïve Bayes, logistic regression, support vector machines and C4.5 algorithm) and ensemble learning methods (such as AdaBoost, bagging and random subspace) using 11 text benchmarks.

Findings

The experimental results indicate that the presented classifier ensemble outperforms the conventional classification algorithms and ensemble learning methods for text classification.

Originality/value

The presented ensemble scheme is the first to use supervised clustering to obtain diverse ensemble for text classification

Details

Kybernetes, vol. 46 no. 2
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 7 November 2019

Andika Rachman and R.M. Chandima Ratnayake

Corrosion loop development is an integral part of the risk-based inspection (RBI) methodology. The corrosion loop approach allows a group of piping to be analyzed simultaneously…

Abstract

Purpose

Corrosion loop development is an integral part of the risk-based inspection (RBI) methodology. The corrosion loop approach allows a group of piping to be analyzed simultaneously, thus reducing non-value adding activities by eliminating repetitive degradation mechanism assessment for piping with similar operational and design characteristics. However, the development of the corrosion loop requires rigorous process that involves a considerable amount of engineering man-hours. Moreover, corrosion loop development process is a type of knowledge-intensive work that involves engineering judgement and intuition, causing the output to have high variability. The purpose of this paper is to reduce the amount of time and output variability of corrosion loop development process by utilizing machine learning and group technology method.

Design/methodology/approach

To achieve the research objectives, k-means clustering and non-hierarchical classification model are utilized to construct an algorithm that allows automation and a more effective and efficient corrosion loop development process. A case study is provided to demonstrate the functionality and performance of the corrosion loop development algorithm on an actual piping data set.

Findings

The results show that corrosion loops generated by the algorithm have lower variability and higher coherence than corrosion loops produced by manual work. Additionally, the utilization of the algorithm simplifies the corrosion loop development workflow, which potentially reduces the amount of time required to complete the development. The application of corrosion loop development algorithm is expected to generate a “leaner” overall RBI assessment process.

Research limitations/implications

Although the algorithm allows a part of corrosion loop development workflow to be automated, it is still deemed as necessary to allow the incorporation of the engineer’s expertise, experience and intuition into the algorithm outputs in order to capture tacit knowledge and refine insights generated by the algorithm intelligence.

Practical implications

This study shows that the advancement of Big Data analytics and artificial intelligence can promote the substitution of machines for human labors to conduct highly complex tasks requiring high qualifications and cognitive skills, including inspection and maintenance management area.

Originality/value

This paper discusses the novel way of developing a corrosion loop. The development of corrosion loop is an integral part of the RBI methodology, but it has less attention among scholars in inspection and maintenance-related subjects.

Details

Journal of Quality in Maintenance Engineering, vol. 26 no. 3
Type: Research Article
ISSN: 1355-2511

Keywords

Article
Publication date: 28 February 2023

Meltem Aksoy, Seda Yanık and Mehmet Fatih Amasyali

When a large number of project proposals are evaluated to allocate available funds, grouping them based on their similarities is beneficial. Current approaches to group proposals…

Abstract

Purpose

When a large number of project proposals are evaluated to allocate available funds, grouping them based on their similarities is beneficial. Current approaches to group proposals are primarily based on manual matching of similar topics, discipline areas and keywords declared by project applicants. When the number of proposals increases, this task becomes complex and requires excessive time. This paper aims to demonstrate how to effectively use the rich information in the titles and abstracts of Turkish project proposals to group them automatically.

Design/methodology/approach

This study proposes a model that effectively groups Turkish project proposals by combining word embedding, clustering and classification techniques. The proposed model uses FastText, BERT and term frequency/inverse document frequency (TF/IDF) word-embedding techniques to extract terms from the titles and abstracts of project proposals in Turkish. The extracted terms were grouped using both the clustering and classification techniques. Natural groups contained within the corpus were discovered using k-means, k-means++, k-medoids and agglomerative clustering algorithms. Additionally, this study employs classification approaches to predict the target class for each document in the corpus. To classify project proposals, various classifiers, including k-nearest neighbors (KNN), support vector machines (SVM), artificial neural networks (ANN), classification and regression trees (CART) and random forest (RF), are used. Empirical experiments were conducted to validate the effectiveness of the proposed method by using real data from the Istanbul Development Agency.

Findings

The results show that the generated word embeddings can effectively represent proposal texts as vectors, and can be used as inputs for clustering or classification algorithms. Using clustering algorithms, the document corpus is divided into five groups. In addition, the results demonstrate that the proposals can easily be categorized into predefined categories using classification algorithms. SVM-Linear achieved the highest prediction accuracy (89.2%) with the FastText word embedding method. A comparison of manual grouping with automatic classification and clustering results revealed that both classification and clustering techniques have a high success rate.

Research limitations/implications

The proposed model automatically benefits from the rich information in project proposals and significantly reduces numerous time-consuming tasks that managers must perform manually. Thus, it eliminates the drawbacks of the current manual methods and yields significantly more accurate results. In the future, additional experiments should be conducted to validate the proposed method using data from other funding organizations.

Originality/value

This study presents the application of word embedding methods to effectively use the rich information in the titles and abstracts of Turkish project proposals. Existing research studies focus on the automatic grouping of proposals; traditional frequency-based word embedding methods are used for feature extraction methods to represent project proposals. Unlike previous research, this study employs two outperforming neural network-based textual feature extraction techniques to obtain terms representing the proposals: BERT as a contextual word embedding method and FastText as a static word embedding method. Moreover, to the best of our knowledge, there has been no research conducted on the grouping of project proposals in Turkish.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 16 no. 3
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 26 July 2019

Seda Yanık and Abdelrahman Elmorsy

The purpose of this paper is to generate customer clusters using self-organizing map (SOM) approach, a machine learning technique with a big data set of credit card consumptions…

Abstract

Purpose

The purpose of this paper is to generate customer clusters using self-organizing map (SOM) approach, a machine learning technique with a big data set of credit card consumptions. The authors aim to use the consumption patterns of the customers in a period of three months deducted from the credit card transactions, specifically the consumption categories (e.g. food, entertainment, etc.).

Design/methodology/approach

The authors use a big data set of almost 40,000 credit card transactions to cluster customers. To deal with the size of the data set and the eliminated the required parametric assumptions the authors use a machine learning technique, SOMs. The variables used are grouped into three as demographical variables, categorical consumption variables and summary consumption variables. The variables are first converted to factors using principal component analysis. Then, the number of clusters is specified by k-means clustering trials. Then, clustering with SOM is conducted by only including the demographical variables and all variables. Then, a comparison is made and the significance of the variables is examined by analysis of variance.

Findings

The appropriate number of clusters is found to be 8 using k-means clusters. Then, the differences in categorical consumption levels are investigated between the clusters. However, they have been found to be insignificant, whereas the summary consumption variables are found to be significant between the clusters, as well as the demographical variables.

Originality/value

The originality of the study is to incorporate the credit card consumption variables of customers to cluster the bank customers. The authors use a big data set and dealt with it with a machine learning technique to deduct the consumption patterns to generate the clusters. Credit card transactions generate a vast amount of data to deduce valuable information. It is mainly used to detect fraud in the literature. To the best of the authors’ knowledge, consumption patterns obtained from credit card transaction are first used for clustering the customers in this study.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 12 no. 3
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 23 November 2010

Yongzheng Zhang, Evangelos Milios and Nur Zincir‐Heywood

Summarization of an entire web site with diverse content may lead to a summary heavily biased towards the site's dominant topics. The purpose of this paper is to present a novel…

Abstract

Purpose

Summarization of an entire web site with diverse content may lead to a summary heavily biased towards the site's dominant topics. The purpose of this paper is to present a novel topic‐based framework to address this problem.

Design/methodology/approach

A two‐stage framework is proposed. The first stage identifies the main topics covered in a web site via clustering and the second stage summarizes each topic separately. The proposed system is evaluated by a user study and compared with the single‐topic summarization approach.

Findings

The user study demonstrates that the clustering‐summarization approach statistically significantly outperforms the plain summarization approach in the multi‐topic web site summarization task. Text‐based clustering based on selecting features with high variance over web pages is reliable; outgoing links are useful if a rich set of cross links is available.

Research limitations/implications

More sophisticated clustering methods than those used in this study are worth investigating. The proposed method should be tested on web content that is less structured than organizational web sites, for example blogs.

Practical implications

The proposed summarization framework can be applied to the effective organization of search engine results and faceted or topical browsing of large web sites.

Originality/value

Several key components are integrated for web site summarization for the first time, including feature selection and link analysis, key phrase and key sentence extraction. Insight into the contributions of links and content to topic‐based summarization was gained. A classification approach is used to minimize the number of parameters.

Details

International Journal of Web Information Systems, vol. 6 no. 4
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 10 August 2021

Elham Amirizadeh and Reza Boostani

The aim of this study is to propose a deep neural network (DNN) method that uses side information to improve clustering results for big datasets; also, the authors show that…

Abstract

Purpose

The aim of this study is to propose a deep neural network (DNN) method that uses side information to improve clustering results for big datasets; also, the authors show that applying this information improves the performance of clustering and also increase the speed of the network training convergence.

Design/methodology/approach

In data mining, semisupervised learning is an interesting approach because good performance can be achieved with a small subset of labeled data; one reason is that the data labeling is expensive, and semisupervised learning does not need all labels. One type of semisupervised learning is constrained clustering; this type of learning does not use class labels for clustering. Instead, it uses information of some pairs of instances (side information), and these instances maybe are in the same cluster (must-link [ML]) or in different clusters (cannot-link [CL]). Constrained clustering was studied extensively; however, little works have focused on constrained clustering for big datasets. In this paper, the authors have presented a constrained clustering for big datasets, and the method uses a DNN. The authors inject the constraints (ML and CL) to this DNN to promote the clustering performance and call it constrained deep embedded clustering (CDEC). In this manner, an autoencoder was implemented to elicit informative low dimensional features in the latent space and then retrain the encoder network using a proposed Kullback–Leibler divergence objective function, which captures the constraints in order to cluster the projected samples. The proposed CDEC has been compared with the adversarial autoencoder, constrained 1-spectral clustering and autoencoder + k-means was applied to the known MNIST, Reuters-10k and USPS datasets, and their performance were assessed in terms of clustering accuracy. Empirical results confirmed the statistical superiority of CDEC in terms of clustering accuracy to the counterparts.

Findings

First of all, this is the first DNN-constrained clustering that uses side information to improve the performance of clustering without using labels in big datasets with high dimension. Second, the author defined a formula to inject side information to the DNN. Third, the proposed method improves clustering performance and network convergence speed.

Originality/value

Little works have focused on constrained clustering for big datasets; also, the studies in DNNs for clustering, with specific loss function that simultaneously extract features and clustering the data, are rare. The method improves the performance of big data clustering without using labels, and it is important because the data labeling is expensive and time-consuming, especially for big datasets.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 14 no. 4
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 17 October 2008

Rui Xu and Donald C. Wunsch

The purpose of this paper is to provide a review of the issues related to cluster analysis, one of the most important and primitive activities of human beings, and of the advances…

1746

Abstract

Purpose

The purpose of this paper is to provide a review of the issues related to cluster analysis, one of the most important and primitive activities of human beings, and of the advances made in recent years.

Design/methodology/approach

The paper investigates the clustering algorithms rooted in machine learning, computer science, statistics, and computational intelligence.

Findings

The paper reviews the basic issues of cluster analysis and discusses the recent advances of clustering algorithms in scalability, robustness, visualization, irregular cluster shape detection, and so on.

Originality/value

The paper presents a comprehensive and systematic survey of cluster analysis and emphasizes its recent efforts in order to meet the challenges caused by the glut of complicated data from a wide variety of communities.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 1 no. 4
Type: Research Article
ISSN: 1756-378X

Keywords

Book part
Publication date: 10 April 2023

Surachai Chancharat and Arisa Phadungviang

This study groups mutual funds using k-means clustering analysis and compares the k-means clustering process with existing clustering techniques using mutual fund data for equity…

Abstract

This study groups mutual funds using k-means clustering analysis and compares the k-means clustering process with existing clustering techniques using mutual fund data for equity funds, general fixed-income funds, and balanced open-end mutual funds rated by the Association of Investment Management Companies. Data are from January 2016 to December 2020 for 60 months and includes information on prices, risks, and investment policies. The sample for this study comprises 173 funds from 10 asset management companies with the highest net assets. The tool used for analysis is the k-means technique using a statistical package set for k = 3. The funds can be divided into three groups: Group 1 has 5 mutual funds (2.89%), Group 2 has 24 mutual funds (13.87%), and Group 3 has a total of 144 mutual funds (83.24%). In Group 1, four of the five mutual funds are equity funds with a track record of beating the market, and fund managers have good market timing skills. Moreover, the efficiency of fund grouping using the k-means technique was compared with the existing grouping with close results at 57.23%. This work provides a methodology to obtain a better categorization of mutual funds by using k-means clustering, allowing the investors to know how mutual funds are. This categorization is very useful for improving the formulation of mutual funds, with the goal of further optimizing investment.

Details

Comparative Analysis of Trade and Finance in Emerging Economies
Type: Book
ISBN: 978-1-80455-758-7

Keywords

Article
Publication date: 14 August 2018

Waqas Khalid and Zaza Nadja Lee Herbert-Hansen

This paper aims to investigate the application of unsupervised machine learning in the international location decision (ILD). This paper addresses the need for a fast…

Abstract

Purpose

This paper aims to investigate the application of unsupervised machine learning in the international location decision (ILD). This paper addresses the need for a fast, quantitative and dynamic location decision framework.

Design/methodology/approach

Unsupervised machine learning technique, i.e. k-means clustering, is used to carry out the analysis. In total, 24 different indicators of 94 countries, categorized into five groups, have been used in the analysis. After the clustering, the clusters have been compared and scored to select the feasible countries.

Findings

A new framework is developed based on k-means clustering that can be used in ILD. This method provides a quantitative output without personal subjectivity. The indicators can be easily added or extracted based on the preferences of the decision-makers. Hence, it was found out that the unsupervised machine learning, i.e. k-means clustering, is a fast and flexible decision support framework that can be used in ILD.

Research limitations/implications

Limitations include the generality of selected indicators and clustering algorithm used. The use of other methods and parameters may lead to alternate results.

Originality/value

The framework developed through the research intends to assist the decision-makers in deciding on the facility locations. The framework can be used in international and national domains. It provides a quantitative, fast and flexible way to shortlist the potential locations. Other methods can also be used to further decide on the specific location.

Details

Journal of Global Operations and Strategic Sourcing, vol. 11 no. 3
Type: Research Article
ISSN: 2398-5364

Keywords

1 – 10 of over 1000