Search results

1 – 10 of over 29000
Article
Publication date: 23 November 2010

Yongzheng Zhang, Evangelos Milios and Nur Zincir‐Heywood

Summarization of an entire web site with diverse content may lead to a summary heavily biased towards the site's dominant topics. The purpose of this paper is to present a novel…

Abstract

Purpose

Summarization of an entire web site with diverse content may lead to a summary heavily biased towards the site's dominant topics. The purpose of this paper is to present a novel topic‐based framework to address this problem.

Design/methodology/approach

A two‐stage framework is proposed. The first stage identifies the main topics covered in a web site via clustering and the second stage summarizes each topic separately. The proposed system is evaluated by a user study and compared with the single‐topic summarization approach.

Findings

The user study demonstrates that the clustering‐summarization approach statistically significantly outperforms the plain summarization approach in the multi‐topic web site summarization task. Text‐based clustering based on selecting features with high variance over web pages is reliable; outgoing links are useful if a rich set of cross links is available.

Research limitations/implications

More sophisticated clustering methods than those used in this study are worth investigating. The proposed method should be tested on web content that is less structured than organizational web sites, for example blogs.

Practical implications

The proposed summarization framework can be applied to the effective organization of search engine results and faceted or topical browsing of large web sites.

Originality/value

Several key components are integrated for web site summarization for the first time, including feature selection and link analysis, key phrase and key sentence extraction. Insight into the contributions of links and content to topic‐based summarization was gained. A classification approach is used to minimize the number of parameters.

Details

International Journal of Web Information Systems, vol. 6 no. 4
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 13 April 2012

Melissa Burt and Chern Li Liew

The use of search engines has become increasingly common. While Google has an overwhelming majority of the market share, new and innovative search techniques are being developed…

1462

Abstract

Purpose

The use of search engines has become increasingly common. While Google has an overwhelming majority of the market share, new and innovative search techniques are being developed. An example of these is the clustering interface used by a number of search engines, whereby results are grouped and visualised according to categories. The purpose of this paper is to examine user perceptions and experience of using clustering.

Design/methodology/approach

In total, 12 Palmerston North City Library (New Zealand) staff members and patrons were recruited and the data were gathered through both observations of a search using a clustering search engine (Carrot2Clustering) and via semi‐structured interviews. The data were analysed according to four themes: features, look and feel, results and clusters.

Findings

The findings from this study revealed that the use of clusters can assist users in the search process in several ways. Evidence was also found to support previous research indicating the importance of labelling the clusters.

Originality/value

This exploratory research provides some insights into users' perceived cognitive load in using a clustering search engine as compared to using a list‐based search engine. The authors explored how searchers compare their overall experience of using clustering search engines to using traditional list‐based engines and the extent to which the clustering presentation influences the progression of a search. The authors also examined the extent to which searchers make use of the feedback a clustering search engine provides to refine, rephrase or redefine their initial search.

Details

Online Information Review, vol. 36 no. 2
Type: Research Article
ISSN: 1468-4527

Keywords

Open Access
Article
Publication date: 14 December 2021

Mariam Elhussein and Samiha Brahimi

This paper aims to propose a novel way of using textual clustering as a feature selection method. It is applied to identify the most important keywords in the profile…

Abstract

Purpose

This paper aims to propose a novel way of using textual clustering as a feature selection method. It is applied to identify the most important keywords in the profile classification. The method is demonstrated through the problem of sick-leave promoters on Twitter.

Design/methodology/approach

Four machine learning classifiers were used on a total of 35,578 tweets posted on Twitter. The data were manually labeled into two categories: promoter and nonpromoter. Classification performance was compared when the proposed clustering feature selection approach and the standard feature selection were applied.

Findings

Radom forest achieved the highest accuracy of 95.91% higher than similar work compared. Furthermore, using clustering as a feature selection method improved the Sensitivity of the model from 73.83% to 98.79%. Sensitivity (recall) is the most important measure of classifier performance when detecting promoters’ accounts that have spam-like behavior.

Research limitations/implications

The method applied is novel, more testing is needed in other datasets before generalizing its results.

Practical implications

The model applied can be used by Saudi authorities to report on the accounts that sell sick-leaves online.

Originality/value

The research is proposing a new way textual clustering can be used in feature selection.

Details

Applied Computing and Informatics, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2634-1964

Keywords

Article
Publication date: 10 August 2021

Elham Amirizadeh and Reza Boostani

The aim of this study is to propose a deep neural network (DNN) method that uses side information to improve clustering results for big datasets; also, the authors show that…

Abstract

Purpose

The aim of this study is to propose a deep neural network (DNN) method that uses side information to improve clustering results for big datasets; also, the authors show that applying this information improves the performance of clustering and also increase the speed of the network training convergence.

Design/methodology/approach

In data mining, semisupervised learning is an interesting approach because good performance can be achieved with a small subset of labeled data; one reason is that the data labeling is expensive, and semisupervised learning does not need all labels. One type of semisupervised learning is constrained clustering; this type of learning does not use class labels for clustering. Instead, it uses information of some pairs of instances (side information), and these instances maybe are in the same cluster (must-link [ML]) or in different clusters (cannot-link [CL]). Constrained clustering was studied extensively; however, little works have focused on constrained clustering for big datasets. In this paper, the authors have presented a constrained clustering for big datasets, and the method uses a DNN. The authors inject the constraints (ML and CL) to this DNN to promote the clustering performance and call it constrained deep embedded clustering (CDEC). In this manner, an autoencoder was implemented to elicit informative low dimensional features in the latent space and then retrain the encoder network using a proposed Kullback–Leibler divergence objective function, which captures the constraints in order to cluster the projected samples. The proposed CDEC has been compared with the adversarial autoencoder, constrained 1-spectral clustering and autoencoder + k-means was applied to the known MNIST, Reuters-10k and USPS datasets, and their performance were assessed in terms of clustering accuracy. Empirical results confirmed the statistical superiority of CDEC in terms of clustering accuracy to the counterparts.

Findings

First of all, this is the first DNN-constrained clustering that uses side information to improve the performance of clustering without using labels in big datasets with high dimension. Second, the author defined a formula to inject side information to the DNN. Third, the proposed method improves clustering performance and network convergence speed.

Originality/value

Little works have focused on constrained clustering for big datasets; also, the studies in DNNs for clustering, with specific loss function that simultaneously extract features and clustering the data, are rare. The method improves the performance of big data clustering without using labels, and it is important because the data labeling is expensive and time-consuming, especially for big datasets.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 14 no. 4
Type: Research Article
ISSN: 1756-378X

Keywords

Open Access
Article
Publication date: 24 June 2021

Bo Wang, Guanwei Wang, Youwei Wang, Zhengzheng Lou, Shizhe Hu and Yangdong Ye

Vehicle fault diagnosis is a key factor in ensuring the safe and efficient operation of the railway system. Due to the numerous vehicle categories and different fault mechanisms…

Abstract

Purpose

Vehicle fault diagnosis is a key factor in ensuring the safe and efficient operation of the railway system. Due to the numerous vehicle categories and different fault mechanisms, there is an unbalanced fault category problem. Most of the current methods to solve this problem have complex algorithm structures, low efficiency and require prior knowledge. This study aims to propose a new method which has a simple structure and does not require any prior knowledge to achieve a fast diagnosis of unbalanced vehicle faults.

Design/methodology/approach

This study proposes a novel K-means with feature learning based on the feature learning K-means-improved cluster-centers selection (FKM-ICS) method, which includes the ICS and the FKM. Specifically, this study defines cluster centers approximation to select the initialized cluster centers in the ICS. This study uses improved term frequency-inverse document frequency to measure and adjust the feature word weights in each cluster, retaining the top τ feature words with the highest weight in each cluster and perform the clustering process again in the FKM. With the FKM-ICS method, clustering performance for unbalanced vehicle fault diagnosis can be significantly enhanced.

Findings

This study finds that the FKM-ICS can achieve a fast diagnosis of vehicle faults on the vehicle fault text (VFT) data set from a railway station in the 2017 (VFT) data set. The experimental results on VFT indicate the proposed method in this paper, outperforms several state-of-the-art methods.

Originality/value

This is the first effort to address the vehicle fault diagnostic problem and the proposed method performs effectively and efficiently. The ICS enables the FKM-ICS method to exclude the effect of outliers, solves the disadvantages of the fault text data contained a certain amount of noisy data, which effectively enhanced the method stability. The FKM enhances the distribution of feature words that discriminate between different fault categories and reduces the number of feature words to make the FKM-ICS method faster and better cluster for unbalanced vehicle fault diagnostic.

Details

Smart and Resilient Transportation, vol. 3 no. 2
Type: Research Article
ISSN: 2632-0487

Keywords

Article
Publication date: 12 March 2019

Prafulla Bafna, Shailaja Shirwaikar and Dhanya Pramod

Text mining is growing in importance proportionate to the growth of unstructured data and its applications are increasing day by day from knowledge management to social media…

Abstract

Purpose

Text mining is growing in importance proportionate to the growth of unstructured data and its applications are increasing day by day from knowledge management to social media analysis. Mapping skillset of a candidate and requirements of job profile is crucial for conducting new recruitment as well as for performing internal task allocation in the organization. The automation in the process of selecting the candidates is essential to avoid bias or subjectivity, which may occur while shuffling through thousands of resumes and other informative documents. The system takes skillset in the form of documents to build the semantic space and then takes appraisals or resumes as input and suggests the persons appropriate to complete a task or job position and employees needing additional training. The purpose of this study is to extend the term-document matrix and achieve refined clusters to produce an improved recommendation. The study also focuses on achieving consistency in cluster quality in spite of increasing size of data set, to solve scalability issues.

Design/methodology/approach

In this study, a synset-based document matrix construction method is proposed where semantically similar terms are grouped to reduce the dimension curse. An automated Task Recommendation System is proposed comprising synset-based feature extraction, iterative semantic clustering and mapping based on semantic similarity.

Findings

The first step in knowledge extraction from the unstructured textual data is converting it into structured form either as Term frequency–Inverse document frequency (TF-IDF) matrix or synset-based TF-IDF. Once in structured form, a range of mining algorithms from classification to clustering can be applied. The algorithm gives a better feature vector representation and improved cluster quality. The synset-based grouping and feature extraction for resume data optimizes the candidate selection process by reducing entropy and error and by improving precision and scalability.

Research limitations/implications

The productivity of any organization gets enhanced by assigning tasks to employees with a right set of skills. Efficient recruitment and task allocation can not only improve productivity but also cater to satisfy employee aspiration and identifying training requirements.

Practical implications

Industries can use the approach to support different processes related to human resource management such as promotions, recruitment and training and, thus, manage the talent pool.

Social implications

The task recommender system creates knowledge by following the steps of the knowledge management cycle and this methodology can be adopted in other similar knowledge management applications.

Originality/value

The efficacy of the proposed approach and its enhancement is validated by carrying out experiments on the benchmarked dataset of resumes. The results are compared with existing techniques and show refined clusters. That is Absolute error is reduced by 30 per cent, precision is increased by 20 per cent and dimensions are lowered by 60 per cent than existing technique. Also, the proposed approach solves issue of scalability by producing improved recommendation for 1,000 resumes with reduced entropy.

Details

VINE Journal of Information and Knowledge Management Systems, vol. 49 no. 2
Type: Research Article
ISSN: 2059-5891

Keywords

Article
Publication date: 26 July 2019

Seda Yanık and Abdelrahman Elmorsy

The purpose of this paper is to generate customer clusters using self-organizing map (SOM) approach, a machine learning technique with a big data set of credit card consumptions…

Abstract

Purpose

The purpose of this paper is to generate customer clusters using self-organizing map (SOM) approach, a machine learning technique with a big data set of credit card consumptions. The authors aim to use the consumption patterns of the customers in a period of three months deducted from the credit card transactions, specifically the consumption categories (e.g. food, entertainment, etc.).

Design/methodology/approach

The authors use a big data set of almost 40,000 credit card transactions to cluster customers. To deal with the size of the data set and the eliminated the required parametric assumptions the authors use a machine learning technique, SOMs. The variables used are grouped into three as demographical variables, categorical consumption variables and summary consumption variables. The variables are first converted to factors using principal component analysis. Then, the number of clusters is specified by k-means clustering trials. Then, clustering with SOM is conducted by only including the demographical variables and all variables. Then, a comparison is made and the significance of the variables is examined by analysis of variance.

Findings

The appropriate number of clusters is found to be 8 using k-means clusters. Then, the differences in categorical consumption levels are investigated between the clusters. However, they have been found to be insignificant, whereas the summary consumption variables are found to be significant between the clusters, as well as the demographical variables.

Originality/value

The originality of the study is to incorporate the credit card consumption variables of customers to cluster the bank customers. The authors use a big data set and dealt with it with a machine learning technique to deduct the consumption patterns to generate the clusters. Credit card transactions generate a vast amount of data to deduce valuable information. It is mainly used to detect fraud in the literature. To the best of the authors’ knowledge, consumption patterns obtained from credit card transaction are first used for clustering the customers in this study.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 12 no. 3
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 5 March 2018

Sajjad Tofighy and Seyed Mostafa Fakhrahmad

This paper aims to propose a statistical and context-aware feature reduction algorithm that improves sentiment classification accuracy. Classification of reviews with different…

Abstract

Purpose

This paper aims to propose a statistical and context-aware feature reduction algorithm that improves sentiment classification accuracy. Classification of reviews with different granularities in two classes of reviews with negative and positive polarities is among the objectives of sentiment analysis. One of the major issues in sentiment analysis is feature engineering while it severely affects time complexity and accuracy of sentiment classification.

Design/methodology/approach

In this paper, a feature reduction method is proposed that uses context-based knowledge as well as synset statistical knowledge. To do so, one-dimensional presentation proposed for SentiWordNet calculates statistical knowledge that involves polarity concentration and variation tendency for each synset. Feature reduction involves two phases. In the first phase, features that combine semantic and statistical similarity conditions are put in the same cluster. In the second phase, features are ranked and then the features which are given lower ranks are eliminated. The experiments are conducted by support vector machine (SVM), naive Bayes (NB), decision tree (DT) and k-nearest neighbors (KNN) algorithms to classify the vectors of the unigram and bigram features in two classes of positive or negative sentiments.

Findings

The results showed that the applied clustering algorithm reduces SentiWordNet synset to less than half which reduced the size of the feature vector by less than half. In addition, the accuracy of sentiment classification is improved by at least 1.5 per cent.

Originality/value

The presented feature reduction method is the first use of the synset clustering for feature reduction. In this paper features reduction algorithm, first aggregates the similar features into clusters then eliminates unsatisfactory cluster.

Details

Kybernetes, vol. 47 no. 5
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 21 November 2008

Chun‐Nan Lin, Chih‐Fong Tsai and Jinsheng Roan

Because of the popularity of digital cameras, the number of personal photographs is increasing rapidly. In general, people manage their photos by date, subject, participants, etc…

Abstract

Purpose

Because of the popularity of digital cameras, the number of personal photographs is increasing rapidly. In general, people manage their photos by date, subject, participants, etc. for future browsing and searching. However, it is difficult and/or takes time to retrieve desired photos from a large number of photographs based on the general personal photo management strategy. In this paper the authors aim to propose a systematic solution to effectively organising and browsing personal photos.

Design/methodology/approach

In their system the authors apply the concept of content‐based image retrieval (CBIR) to automatically extract visual image features of personal photos. Then three well‐known clustering techniques – k‐means, self‐organising maps and fuzzy c‐means – are used to group personal photos. Finally, the clustering results are evaluated by human subjects in terms of retrieval effectiveness and efficiency.

Findings

Experimental results based on the dataset of 1,000 personal photos show that the k‐means clustering method outperforms self‐organising maps and fuzzy c‐means. That is, 12 subjects out of 30 preferred the clustering results of k‐means. In particular, most subjects agreed that larger numbers of clusters (e.g. 15 to 20) enabled more effective browsing of personal photos. For the efficiency evaluation, the clustering results using k‐means allowed subjects to search for relevant images in the least amount of time.

Originality/value

CBIR is applied in many areas, but very few related works focus on personal photo browsing and retrieval. This paper examines the applicability of using CBIR and clustering techniques for browsing personal photos. In addition, the evaluation based on the effectiveness and efficiency strategies ensures the reliability of our findings.

Details

Online Information Review, vol. 32 no. 6
Type: Research Article
ISSN: 1468-4527

Keywords

Article
Publication date: 15 June 2023

Abena Owusu and Aparna Gupta

Although risk culture is a key determinant for an effective risk management, identifying the risk culture of a firm can be challenging due to the abstract concept of culture. This…

Abstract

Purpose

Although risk culture is a key determinant for an effective risk management, identifying the risk culture of a firm can be challenging due to the abstract concept of culture. This paper proposes a novel approach that uses unsupervised machine learning techniques to identify significant features needed to assess and differentiate between different forms of risk culture.

Design/methodology/approach

To convert the unstructured text in our sample of banks' 10K reports into structured data, a two-dimensional dictionary for text mining is built to capture risk culture characteristics and the bank's attitude towards the risk culture characteristics. A principal component analysis (PCA) reduction technique is applied to extract the significant features that define risk culture, before using a K-means unsupervised learning to cluster the reports into distinct risk culture groups.

Findings

The PCA identifies uncertainty, litigious and constraining sentiments among risk culture features to be significant in defining the risk culture of banks. Cluster analysis on the PCA factors proposes three distinct risk culture clusters: good, fair and poor. Consistent with regulatory expectations, a good or fair risk culture in banks is characterized by high profitability ratios, bank stability, lower default risk and good governance.

Originality/value

The relationship between culture and risk management can be difficult to study given that it is hard to measure culture from traditional data sources that are messy and diverse. This study offers a better understanding of risk culture using an unsupervised machine learning approach.

Details

International Journal of Managerial Finance, vol. 20 no. 2
Type: Research Article
ISSN: 1743-9132

Keywords

1 – 10 of over 29000