Search results

1 – 10 of over 22000
Article
Publication date: 23 August 2022

Kamlesh Kumar Pandey and Diwakar Shukla

The K-means (KM) clustering algorithm is extremely responsive to the selection of initial centroids since the initial centroid of clusters determines computational effectiveness…

Abstract

Purpose

The K-means (KM) clustering algorithm is extremely responsive to the selection of initial centroids since the initial centroid of clusters determines computational effectiveness, efficiency and local optima issues. Numerous initialization strategies are to overcome these problems through the random and deterministic selection of initial centroids. The random initialization strategy suffers from local optimization issues with the worst clustering performance, while the deterministic initialization strategy achieves high computational cost. Big data clustering aims to reduce computation costs and improve cluster efficiency. The objective of this study is to achieve a better initial centroid for big data clustering on business management data without using random and deterministic initialization that avoids local optima and improves clustering efficiency with effectiveness in terms of cluster quality, computation cost, data comparisons and iterations on a single machine.

Design/methodology/approach

This study presents the Normal Distribution Probability Density (NDPD) algorithm for big data clustering on a single machine to solve business management-related clustering issues. The NDPDKM algorithm resolves the KM clustering problem by probability density of each data point. The NDPDKM algorithm first identifies the most probable density data points by using the mean and standard deviation of the datasets through normal probability density. Thereafter, the NDPDKM determines K initial centroid by using sorting and linear systematic sampling heuristics.

Findings

The performance of the proposed algorithm is compared with KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms through Davies Bouldin score, Silhouette coefficient, SD Validity, S_Dbw Validity, Number of Iterations and CPU time validation indices on eight real business datasets. The experimental evaluation demonstrates that the NDPDKM algorithm reduces iterations, local optima, computing costs, and improves cluster performance, effectiveness, efficiency with stable convergence as compared to other algorithms. The NDPDKM algorithm minimizes the average computing time up to 34.83%, 90.28%, 71.83%, 92.67%, 69.53% and 76.03%, and reduces the average iterations up to 40.32%, 44.06%, 32.02%, 62.78%, 19.07% and 36.74% with reference to KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms.

Originality/value

The KM algorithm is the most widely used partitional clustering approach in data mining techniques that extract hidden knowledge, patterns and trends for decision-making strategies in business data. Business analytics is one of the applications of big data clustering where KM clustering is useful for the various subcategories of business analytics such as customer segmentation analysis, employee salary and performance analysis, document searching, delivery optimization, discount and offer analysis, chaplain management, manufacturing analysis, productivity analysis, specialized employee and investor searching and other decision-making strategies in business.

Article
Publication date: 10 August 2021

Elham Amirizadeh and Reza Boostani

The aim of this study is to propose a deep neural network (DNN) method that uses side information to improve clustering results for big datasets; also, the authors show that…

Abstract

Purpose

The aim of this study is to propose a deep neural network (DNN) method that uses side information to improve clustering results for big datasets; also, the authors show that applying this information improves the performance of clustering and also increase the speed of the network training convergence.

Design/methodology/approach

In data mining, semisupervised learning is an interesting approach because good performance can be achieved with a small subset of labeled data; one reason is that the data labeling is expensive, and semisupervised learning does not need all labels. One type of semisupervised learning is constrained clustering; this type of learning does not use class labels for clustering. Instead, it uses information of some pairs of instances (side information), and these instances maybe are in the same cluster (must-link [ML]) or in different clusters (cannot-link [CL]). Constrained clustering was studied extensively; however, little works have focused on constrained clustering for big datasets. In this paper, the authors have presented a constrained clustering for big datasets, and the method uses a DNN. The authors inject the constraints (ML and CL) to this DNN to promote the clustering performance and call it constrained deep embedded clustering (CDEC). In this manner, an autoencoder was implemented to elicit informative low dimensional features in the latent space and then retrain the encoder network using a proposed Kullback–Leibler divergence objective function, which captures the constraints in order to cluster the projected samples. The proposed CDEC has been compared with the adversarial autoencoder, constrained 1-spectral clustering and autoencoder + k-means was applied to the known MNIST, Reuters-10k and USPS datasets, and their performance were assessed in terms of clustering accuracy. Empirical results confirmed the statistical superiority of CDEC in terms of clustering accuracy to the counterparts.

Findings

First of all, this is the first DNN-constrained clustering that uses side information to improve the performance of clustering without using labels in big datasets with high dimension. Second, the author defined a formula to inject side information to the DNN. Third, the proposed method improves clustering performance and network convergence speed.

Originality/value

Little works have focused on constrained clustering for big datasets; also, the studies in DNNs for clustering, with specific loss function that simultaneously extract features and clustering the data, are rare. The method improves the performance of big data clustering without using labels, and it is important because the data labeling is expensive and time-consuming, especially for big datasets.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 14 no. 4
Type: Research Article
ISSN: 1756-378X

Keywords

Open Access
Article
Publication date: 15 December 2023

Nicola Castellano, Roberto Del Gobbo and Lorenzo Leto

The concept of productivity is central to performance management and decision-making, although it is complex and multifaceted. This paper aims to describe a methodology based on…

1375

Abstract

Purpose

The concept of productivity is central to performance management and decision-making, although it is complex and multifaceted. This paper aims to describe a methodology based on the use of Big Data in a cluster analysis combined with a data envelopment analysis (DEA) that provides accurate and reliable productivity measures in a large network of retailers.

Design/methodology/approach

The methodology is described using a case study of a leading kitchen furniture producer. More specifically, Big Data is used in a two-step analysis prior to the DEA to automatically cluster a large number of retailers into groups that are homogeneous in terms of structural and environmental factors and assess a within-the-group level of productivity of the retailers.

Findings

The proposed methodology helps reduce the heterogeneity among the units analysed, which is a major concern in DEA applications. The data-driven factorial and clustering technique allows for maximum within-group homogeneity and between-group heterogeneity by reducing subjective bias and dimensionality, which is embedded with the use of Big Data.

Practical implications

The use of Big Data in clustering applied to productivity analysis can provide managers with data-driven information about the structural and socio-economic characteristics of retailers' catchment areas, which is important in establishing potential productivity performance and optimizing resource allocation. The improved productivity indexes enable the setting of targets that are coherent with retailers' potential, which increases motivation and commitment.

Originality/value

This article proposes an innovative technique to enhance the accuracy of productivity measures through the use of Big Data clustering and DEA. To the best of the authors’ knowledge, no attempts have been made to benefit from the use of Big Data in the literature on retail store productivity.

Details

International Journal of Productivity and Performance Management, vol. 73 no. 11
Type: Research Article
ISSN: 1741-0401

Keywords

Article
Publication date: 5 September 2016

Runhai Jiao, Shaolong Liu, Wu Wen and Biying Lin

The large volume of big data makes it impractical for traditional clustering algorithms which are usually designed for entire data set. The purpose of this paper is to focus on…

Abstract

Purpose

The large volume of big data makes it impractical for traditional clustering algorithms which are usually designed for entire data set. The purpose of this paper is to focus on incremental clustering which divides data into series of data chunks and only a small amount of data need to be clustered at each time. Few researches on incremental clustering algorithm address the problem of optimizing cluster center initialization for each data chunk and selecting multiple passing points for each cluster.

Design/methodology/approach

Through optimizing initial cluster centers, quality of clustering results is improved for each data chunk and then quality of final clustering results is enhanced. Moreover, through selecting multiple passing points, more accurate information is passed down to improve the final clustering results. The method has been proposed to solve those two problems and is applied in the proposed algorithm based on streaming kernel fuzzy c-means (stKFCM) algorithm.

Findings

Experimental results show that the proposed algorithm demonstrates more accuracy and better performance than streaming kernel stKFCM algorithm.

Originality/value

This paper addresses the problem of improving the performance of increment clustering through optimizing cluster center initialization and selecting multiple passing points. The paper analyzed the performance of the proposed scheme and proved its effectiveness.

Details

Kybernetes, vol. 45 no. 8
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 3 October 2016

Philipp Max Hartmann, Mohamed Zaki, Niels Feldmann and Andy Neely

The purpose of this paper is to derive a taxonomy of business models used by start-up firms that rely on data as a key resource for business, namely data-driven business models…

13842

Abstract

Purpose

The purpose of this paper is to derive a taxonomy of business models used by start-up firms that rely on data as a key resource for business, namely data-driven business models (DDBMs). By providing a framework to systematically analyse DDBMs, the study provides an introduction to DDBM as a field of study.

Design/methodology/approach

To develop the taxonomy of DDBMs, business model descriptions of 100 randomly chosen start-up firms were coded using a DDBM framework derived from literature, comprising six dimensions with 35 features. Subsequent application of clustering algorithms produced six different types of DDBM, validated by case studies from the study’s sample.

Findings

The taxonomy derived from the research consists of six different types of DDBM among start-ups. These types are characterised by a subset of six of nine clustering variables from the DDBM framework.

Practical implications

A major contribution of the paper is the designed framework, which stimulates thinking about the nature and future of DDBMs. The proposed taxonomy will help organisations to position their activities in the current DDBM landscape. Moreover, framework and taxonomy may lead to a DDBM design toolbox.

Originality/value

This paper develops a basis for understanding how start-ups build business models capture value from data as a key resource, adding a business perspective to the discussion of big data. By offering the scientific community a specific framework of business model features and a subsequent taxonomy, the paper provides reference points and serves as a foundation for future studies of DDBMs.

Details

International Journal of Operations & Production Management, vol. 36 no. 10
Type: Research Article
ISSN: 0144-3577

Keywords

Article
Publication date: 31 May 2022

Jianfang Qi, Yue Li, Haibin Jin, Jianying Feng and Weisong Mu

The purpose of this study is to propose a new consumer value segmentation method for low-dimensional dense market datasets to quickly detect and cluster the most profitable…

Abstract

Purpose

The purpose of this study is to propose a new consumer value segmentation method for low-dimensional dense market datasets to quickly detect and cluster the most profitable customers for the enterprises.

Design/methodology/approach

In this study, the comprehensive segmentation bases (CSB) with richer meanings were obtained by introducing the weighted recency-frequency-monetary (RFM) model into the common segmentation bases (SB). Further, a new market segmentation method, the CSB-MBK algorithm was proposed by integrating the CSB model and the mini-batch k-means (MBK) clustering algorithm.

Findings

The results show that our proposed CSB model can reflect consumers' contributions to a market, as well as improve the clustering performance. Moreover, the proposed CSB-MBK algorithm is demonstrably superior to the SB-MBK, CSB-KMA and CSB-Chameleon algorithms with respect to the Silhouette Coefficient (SC), the Calinski-Harabasz (CH) Index , the average running time and superior to the SB-MBK, RFM-MBK and WRFM-MBK algorithms in terms of the inter-market value and characteristic differentiation.

Practical implications

This paper provides a tool for decision-makers and marketers to segment a market quickly, which can help them grasp consumers' activity, loyalty, purchasing power and other characteristics in a target market timely and achieve the precision marketing.

Originality/value

This study is the first to introduce the CSB-MBK algorithm for identifying valuable customers through the comprehensive consideration of the clustering quality, consumer value and segmentation speed. Moreover, the CSB-MBK algorithm can be considered for applications in other markets.

Details

Kybernetes, vol. 52 no. 10
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 1 February 2021

Narasimhulu K, Meena Abarna KT and Sivakumar B

The purpose of the paper is to study multiple viewpoints which are required to access the more informative similarity features among the tweets documents, which is useful for…

Abstract

Purpose

The purpose of the paper is to study multiple viewpoints which are required to access the more informative similarity features among the tweets documents, which is useful for achieving the robust tweets data clustering results.

Design/methodology/approach

Let “N” be the number of tweets documents for the topics extraction. Unwanted texts, punctuations and other symbols are removed, tokenization and stemming operations are performed in the initial tweets pre-processing step. Bag-of-features are determined for the tweets; later tweets are modelled with the obtained bag-of-features during the process of topics extraction. Approximation of topics features are extracted for every tweet document. These set of topics features of N documents are treated as multi-viewpoints. The key idea of the proposed work is to use multi-viewpoints in the similarity features computation. The following figure illustrates multi-viewpoints based cosine similarity computation of the five tweets documents (here N = 5) and corresponding documents are defined in projected space with five viewpoints, say, v1,v2, v3, v4, and v5. For example, similarity features between two documents (viewpoints v1, and v2) are computed concerning the other three multi-viewpoints (v3, v4, and v5), unlike a single viewpoint in traditional cosine metric.

Findings

Healthcare problems with tweets data. Topic models play a crucial role in the classification of health-related tweets with finding topics (or health clusters) instead of finding term frequency and inverse document frequency (TF–IDF) for unlabelled tweets.

Originality/value

Topic models play a crucial role in the classification of health-related tweets with finding topics (or health clusters) instead of finding TF-IDF for unlabelled tweets.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 14 no. 2
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 8 February 2021

Gianluca Solazzo, Gianluca Elia and Giuseppina Passiante

This study aims to investigate the Big Social Data (BSD) paradigm, which still lacks a clear and shared definition, and causes a lack of clarity and understanding about its…

Abstract

Purpose

This study aims to investigate the Big Social Data (BSD) paradigm, which still lacks a clear and shared definition, and causes a lack of clarity and understanding about its beneficial opportunities for practitioners. In the knowledge management (KM) domain, a clear characterization of the BSD paradigm can lead to more effective and efficient KM strategies, processes and systems that leverage a huge amount of structured and unstructured data sources.

Design/methodology/approach

The study adopts a systematic literature review (SLR) methodology based on a mixed analysis approach (unsupervised machine learning and human-based) applied to 199 research articles on BSD topics extracted from Scopus and Web of Science. In particular, machine learning processing has been implemented by using topic extraction and hierarchical clustering techniques.

Findings

The paper provides a threefold contribution: a conceptualization and a consensual definition of the BSD paradigm through the identification of four key conceptual pillars (i.e. sources, properties, technology and value exploitation); a characterization of the taxonomy of BSD data type that extends previous works on this topic; a research agenda for future research studies on BSD and its applications along with a KM perspective.

Research limitations/implications

The main limits of the research rely on the list of articles considered for the literature review that could be enlarged by considering further sources (in addition to Scopus and Web of Science) and/or further languages (in addition to English) and/or further years (the review considers papers published until 2018). Research implications concern the development of a research agenda organized along with five thematic issues, which can feed future research to deepen the paradigm of BSD and explore linkages with the KM field.

Practical implications

Practical implications concern the usage of the proposed definition of BSD to purposefully design applications and services based on BSD in knowledge-intensive domains to generate value for citizens, individuals, companies and territories.

Originality/value

The original contribution concerns the definition of the big data social paradigm built through an SLR the combines machine learning processing and human-based processing. Moreover, the research agenda deriving from the study contributes to investigate the BSD paradigm in the wider domain of KM.

Details

Journal of Knowledge Management, vol. 25 no. 7
Type: Research Article
ISSN: 1367-3270

Keywords

Open Access
Article
Publication date: 28 April 2022

Manuel Pedro Rodríguez Bolívar and Laura Alcaide Muñoz

This study aims to conduct performance and clustering analyses with the help of Digital Government Reference Library (DGRL) v16.6 database examining the role of emerging…

2769

Abstract

Purpose

This study aims to conduct performance and clustering analyses with the help of Digital Government Reference Library (DGRL) v16.6 database examining the role of emerging technologies (ETs) in public services delivery.

Design/methodology/approach

VOSviewer and SciMAT techniques were used for clustering and mapping the use of ETs in the public services delivery. Collecting documents from the DGRL v16.6 database, the paper uses text mining analysis for identifying key terms and trends in e-Government research regarding ETs and public services.

Findings

The analysis indicates that all ETs are strongly linked to each other, except for blockchain technologies (due to its disruptive nature), which indicate that ETs can be, therefore, seen as accumulative knowledge. In addition, on the whole, findings identify four stages in the evolution of ETs and their application to public services: the “electronic administration” stage, the “technological baseline” stage, the “managerial” stage and the “disruptive technological” stage.

Practical implications

The output of the present research will help to orient policymakers in the implementation and use of ETs, evaluating the influence of these technologies on public services.

Social implications

The research helps researchers to track research trends and uncover new paths on ETs and its implementation in public services.

Originality/value

Recent research has focused on the need of implementing ETs for improving public services, which could help cities to improve the citizens’ quality of life in urban areas. This paper contributes to expanding the knowledge about ETs and its implementation in public services, identifying trends and networks in the research about these issues.

Details

Information Technology & People, vol. 37 no. 8
Type: Research Article
ISSN: 0959-3845

Keywords

Article
Publication date: 19 October 2020

Asefeh Asemi and Andrea Ko

The present study is aimed to determine the infoecology of scientific articles in the field of smart manufacturing (SM). The researchers designed a general framework for the…

Abstract

Purpose

The present study is aimed to determine the infoecology of scientific articles in the field of smart manufacturing (SM). The researchers designed a general framework for the investigation of infoecology.

Design/methodology/approach

The qualitative and quantitative data collection methods are applied to collect data from the Scopus and experts. The bibliometric technique, clustering and graph mining are applied to analysis data by Scopus data analysis tools, VOSviewer and Excel software.

Findings

It is concluded that researchers paid attention to “Flow Control”, “Embedded Systems”, “IoT”, “Big Data” and “Cyber-Physical System” more than other infocenose. Finally, a thematic model presented based on the infoecology of SM in Scopus for future studies. Also, as future work, designing a “research-related” metamodel for SM would be beneficial for the researchers, to highlight the main future research directions.

Practical implications

The results of the present study can be applied to the following issues: (1) To make decisions based on research and scientific evidence and conduct scientific research on real needs and issues in the field of SM, (2) Holding the workshops on infoecology to determine research priorities with the presence of experts in related industries, (3) Determining the most important areas of research in order to improve the index of applied research, (4) Assist in prioritizing research in the field of SM to select a set of research and technological activities and allocate resources effectively to these activities, (5) Helping to increase the relationship between research and technological activities with the economic and long-term goals of industry and society, (6) Helping to prioritize the issues of SM in research and technology in order to target the allocation of financial and human capital and solving the main challenges and take advantage of opportunities, (7) Helping to avoid fragmentation of work and providing educational infrastructure based on prioritized research needs and (8) Helping to hold start-ups and the activities of knowledge-based companies based on research priorities in the field of SM.

Originality/value

The analysis results demonstrated that the information ecosystem of SM studies dynamically developed over time. The continuous conduction flow of scientific studies in this field brought continuous changes into the infoecology of this field.

1 – 10 of over 22000