Search results

1 – 10 of over 33000
Article
Publication date: 9 February 2018

Arshad Ahmad, Chong Feng, Shi Ge and Abdallah Yousif

Software developers extensively use stack overflow (SO) for knowledge sharing on software development. Thus, software engineering researchers have started mining the…

1736

Abstract

Purpose

Software developers extensively use stack overflow (SO) for knowledge sharing on software development. Thus, software engineering researchers have started mining the structured/unstructured data present in certain software repositories including the Q&A software developer community SO, with the aim to improve software development. The purpose of this paper is show that how academics/practitioners can get benefit from the valuable user-generated content shared on various online social networks, specifically from Q&A community SO for software development.

Design/methodology/approach

A comprehensive literature review was conducted and 166 research papers on SO were categorized about software development from the inception of SO till June 2016.

Findings

Most of the studies revolve around a limited number of software development tasks; approximately 70 percent of the papers used millions of posts data, applied basic machine learning methods, and conducted investigations semi-automatically and quantitative studies. Thus, future research should focus on the overcoming existing identified challenges and gaps.

Practical implications

The work on SO is classified into two main categories; “SO design and usage” and “SO content applications.” These categories not only give insights to Q&A forum providers about the shortcomings in design and usage of such forums but also provide ways to overcome them in future. It also enables software developers to exploit such forums for the identified under-utilized tasks of software development.

Originality/value

The study is the first of its kind to explore the work on SO about software development and makes an original contribution by presenting a comprehensive review, design/usage shortcomings of Q&A sites, and future research challenges.

Details

Data Technologies and Applications, vol. 52 no. 2
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 19 May 2020

Praveen Kumar Gopagoni and Mohan Rao S K

Association rule mining generates the patterns and correlations from the database, which requires large scanning time, and the cost of computation associated with the generation…

Abstract

Purpose

Association rule mining generates the patterns and correlations from the database, which requires large scanning time, and the cost of computation associated with the generation of the rules is quite high. On the other hand, the candidate rules generated using the traditional association rules mining face a huge challenge in terms of time and space, and the process is lengthy. In order to tackle the issues of the existing methods and to render the privacy rules, the paper proposes the grid-based privacy association rule mining.

Design/methodology/approach

The primary intention of the research is to design and develop a distributed elephant herding optimization (EHO) for grid-based privacy association rule mining from the database. The proposed method of rule generation is processed as two steps: in the first step, the rules are generated using apriori algorithm, which is the effective association rule mining algorithm. In general, the extraction of the association rules from the input database is based on confidence and support that is replaced with new terms, such as probability-based confidence and holo-entropy. Thus, in the proposed model, the extraction of the association rules is based on probability-based confidence and holo-entropy. In the second step, the generated rules are given to the grid-based privacy rule mining, which produces privacy-dependent rules based on a novel optimization algorithm and grid-based fitness. The novel optimization algorithm is developed by integrating the distributed concept in EHO algorithm.

Findings

The experimentation of the method using the databases taken from the Frequent Itemset Mining Dataset Repository to prove the effectiveness of the distributed grid-based privacy association rule mining includes the retail, chess, T10I4D100K and T40I10D100K databases. The proposed method outperformed the existing methods through offering a higher degree of privacy and utility, and moreover, it is noted that the distributed nature of the association rule mining facilitates the parallel processing and generates the privacy rules without much computational burden. The rate of hiding capacity, the rate of information preservation and rate of the false rules generated for the proposed method are found to be 0.4468, 0.4488 and 0.0654, respectively, which is better compared with the existing rule mining methods.

Originality/value

Data mining is performed in a distributed manner through the grids that subdivide the input data, and the rules are framed using the apriori-based association mining, which is the modification of the standard apriori with the holo-entropy and probability-based confidence replacing the support and confidence in the standard apriori algorithm. The mined rules do not assure the privacy, and hence, the grid-based privacy rules are employed that utilize the adaptive elephant herding optimization (AEHO) for generating the privacy rules. The AEHO inherits the adaptive nature in the standard EHO, which renders the global optimal solution.

Details

Data Technologies and Applications, vol. 54 no. 3
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 19 December 2022

Sukjin You, Soohyung Joo and Marie Katsurai

The purpose of this study is to explore to which extent data mining research would be associated with the library and information science (LIS) discipline. This study aims to…

Abstract

Purpose

The purpose of this study is to explore to which extent data mining research would be associated with the library and information science (LIS) discipline. This study aims to identify data mining related subject terms and topics in representative LIS scholarly publications.

Design/methodology/approach

A large set of bibliographic records over 38,000 was collected from a scholarly database representing the fields of LIS and the data mining, respectively. A multitude of text mining techniques were applied to investigate prevailing subject terms and research topics, such as influential term analysis and Dirichlet multinomial regression topic modeling.

Findings

The findings of this study revealed the relationship between the LIS and data mining research domains. Various data mining method terms were observed in recent LIS publications, such as machine learning, artificial intelligence and neural networks. The topic modeling result identified prevailing data mining related research topics in LIS, such as machine learning, deep learning, big data and among others. In addition, this study investigated the trends of popular topics in LIS over time in the recent decade.

Originality/value

This investigation is one of a few studies that empirically investigated the relationships between the LIS and data mining research domains. Multiple text mining techniques were employed to delineate to which extent the two research domains would be associated with each other based on both at the term-level and topic-level analysis. Methodologically, the study identified influential terms in each domain using multiple feature selection indices. In addition, Dirichlet multinomial regression was applied to explore LIS topics in relation to data mining.

Details

Aslib Journal of Information Management, vol. 76 no. 1
Type: Research Article
ISSN: 2050-3806

Keywords

Article
Publication date: 7 August 2017

Yanyan Wang and Jin Zhang

Data mining has been a popular research area in the past decades. Many researchers study data-mining theories, methods, applications and trends; however, there are very few…

Abstract

Purpose

Data mining has been a popular research area in the past decades. Many researchers study data-mining theories, methods, applications and trends; however, there are very few studies on data-mining-related topics in social media. This paper aims to explore the topics related to data mining based on the data collected from Wikipedia.

Design/methodology/approach

In total, 402 data-mining-related articles were obtained from Wikipedia. These articles were manually classified into several categories by the coding method. Each category formed an article-term matrix. These matrices were analysed and visualized by the self-organizing map approach. Several clusters were observed in each category. Finally, the topics of these clusters were extracted by content analysis.

Findings

The articles obtained were classified into six categories: applications, foundation and concepts, methodologies, organizations, related fields and topics and technology support. Business, biology and security were the three prominent topics of the applications category. The technologies supporting data mining were software, systems, databases, programming languages and so forth. The general public was more interested in data-mining organizations than the researchers. They also focused on the applications of data mining in business more than in other fields.

Originality/value

This study will help researchers gain insight into the general public’s perceptions of data mining and discover the gap between the general public and themselves. It will assist researchers in finding new techniques and methods which will potentially provide them with new data-mining methods and research topics.

Details

The Electronic Library, vol. 35 no. 4
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 3 November 2023

Nihan Yildirim, Derya Gultekin, Cansu Hürses and Abdullah Mert Akman

This paper aims to use text mining methods to explore the similarities and differences between countries’ national digital transformation (DT) and Industry 4.0 (I4.0) policies…

Abstract

Purpose

This paper aims to use text mining methods to explore the similarities and differences between countries’ national digital transformation (DT) and Industry 4.0 (I4.0) policies. The study examines the applicability of text mining as an alternative for comprehensive clustering of national I4.0 and DT strategies, encouraging policy researchers toward data science that can offer rapid policy analysis and benchmarking.

Design/methodology/approach

With an exploratory research approach, topic modeling, principal component analysis and unsupervised machine learning algorithms (k-means and hierarchical clustering) are used for clustering national I4.0 and DT strategies. This paper uses a corpus of policy documents and related scientific publications from several countries and integrate their science and technology performance. The paper also presents the positioning of Türkiye’s I4.0 and DT national policy as a case from a developing country context.

Findings

Text mining provides meaningful clustering results on similarities and differences between countries regarding their national I4.0 and DT policies, aligned with their geographic, economic and political circumstances. Findings also shed light on the DT strategic landscape and the key themes spanning various policy dimensions. Drawing from the Turkish case, political options are discussed in the context of developing (follower) countries’ I4.0 and DT.

Practical implications

The paper reveals meaningful clustering results on similarities and differences between countries regarding their national I4.0 and DT policies, reflecting political proximities aligned with their geographic, economic and political circumstances. This can help policymakers to comparatively understand national DT and I4.0 policies and use this knowledge to reflect collaborative and competitive measures to their policies.

Originality/value

This paper provides a unique combined methodology for text mining-based policy analysis in the DT context, which has not been adopted. In an era where computational social science and machine learning have gained importance and adaptability to political and social science fields, and in the technology and innovation management discipline, clustering applications showed similar and different policy patterns in a timely and unbiased manner.

Details

Journal of Science and Technology Policy Management, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2053-4620

Keywords

Article
Publication date: 1 March 1999

Vijayan Sugumaran and Ranjit Bose

There is a tremendous explosion in the amount of data that organizations generate, collect and store. Managers are beginning to recognize the value of this asset, and are…

1770

Abstract

There is a tremendous explosion in the amount of data that organizations generate, collect and store. Managers are beginning to recognize the value of this asset, and are increasingly relying on intelligent systems to access, analyze, summarize, and interpret information from large and multiple data sources. These systems help them make critical business decisions faster or with a greater degree of confidence. Data mining is a promising new technology that helps bring business intelligence into these systems. While there is a plethora of data mining techniques and tools available, they present inherent problems for end‐users such as complexity, required technical expertise, lack of flexibility and interoperability, etc. These problems can be mitigated by deploying software agents to assist end‐users in their problem solving endeavors. This paper presents the design and development of an intelligent software agent based data analysis and mining environment called IDM, which is utilized in decision making activities.

Details

Industrial Management & Data Systems, vol. 99 no. 2
Type: Research Article
ISSN: 0263-5577

Keywords

Article
Publication date: 13 September 2019

Zirui Jia and Zengli Wang

Frequent itemset mining (FIM) is a basic topic in data mining. Most FIM methods build itemset database containing all possible itemsets, and use predefined thresholds to determine…

Abstract

Purpose

Frequent itemset mining (FIM) is a basic topic in data mining. Most FIM methods build itemset database containing all possible itemsets, and use predefined thresholds to determine whether an itemset is frequent. However, the algorithm has some deficiencies. It is more fit for discrete data rather than ordinal/continuous data, which may result in computational redundancy, and some of the results are difficult to be interpreted. The purpose of this paper is to shed light on this gap by proposing a new data mining method.

Design/methodology/approach

Regression pattern (RP) model will be introduced, in which the regression model and FIM method will be combined to solve the existing problems. Using a survey data of computer technology and software professional qualification examination, the multiple linear regression model is selected to mine associations between items.

Findings

Some interesting associations mined by the proposed algorithm and the results show that the proposed method can be applied in ordinal/continuous data mining area. The experiment of RP model shows that, compared to FIM, the computational redundancy decreased and the results contain more information.

Research limitations/implications

The proposed algorithm is designed for ordinal/continuous data and is expected to provide inspiration for data stream mining and unstructured data mining.

Practical implications

Compared to FIM, which mines associations between discrete items, RP model could mine associations between ordinal/continuous data sets. Importantly, RP model performs well in saving computational resource and mining meaningful associations.

Originality/value

The proposed algorithms provide a novelty view to define and mine association.

Details

Data Technologies and Applications, vol. 54 no. 3
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 1 June 1999

Michael L. Gargano and Bel G. Raggad

Data mining can discover information hidden within valuable data assets. Knowledge discovery, using advanced information technologies, can uncover veins of surprising, golden…

6540

Abstract

Data mining can discover information hidden within valuable data assets. Knowledge discovery, using advanced information technologies, can uncover veins of surprising, golden insights in a mountain of factual data. Data mining consists of a panoply of powerful tools which are intuitive, easy to explain, understandable, and simple to use. These advanced information technologies include artificial intelligence methods (e.g. expert systems, fuzzy logic, etc.), decision trees, rule induction methods, genetic algorithms and genetic programming, neural networks (e.g. backpropagation, associate memories, etc.), and clustering techniques. The synergy created between data warehousing and data mining allows knowledge seekers to leverage their massive data assets, thus improving the quality and effectiveness of their decisions. The growing requirements for data mining and real time analysis of information will be a driving force in the development of new data warehouse architectures and methods and, conversely, the development of new data mining methods and applications.

Details

OCLC Systems & Services: International digital library perspectives, vol. 15 no. 2
Type: Research Article
ISSN: 1065-075X

Keywords

Article
Publication date: 27 March 2009

Sandra S. Liu and Jie Chen

This paper aims to provide an example of how to use data mining techniques to identify patient segments regarding preferences for healthcare attributes and their demographic…

2459

Abstract

Purpose

This paper aims to provide an example of how to use data mining techniques to identify patient segments regarding preferences for healthcare attributes and their demographic characteristics.

Design/methodology/approach

Data were derived from a number of individuals who received in‐patient care at a health network in 2006. Data mining and conventional hierarchical clustering with average linkage and Pearson correlation procedures are employed and compared to show how each procedure best determines segmentation variables.

Findings

Data mining tools identified three differentiable segments by means of cluster analysis. These three clusters have significantly different demographic profiles.

Practical implications

The study reveals, when compared with traditional statistical methods, that data mining provides an efficient and effective tool for market segmentation. When there are numerous cluster variables involved, researchers and practitioners need to incorporate factor analysis for reducing variables to clearly and meaningfully understand clusters.

Originality/value

Interests and applications in data mining are increasing in many businesses. However, this technology is seldom applied to healthcare customer experience management. The paper shows that efficient and effective application of data mining methods can aid the understanding of patient healthcare preferences.

Details

International Journal of Health Care Quality Assurance, vol. 22 no. 2
Type: Research Article
ISSN: 0952-6862

Keywords

Article
Publication date: 13 December 2019

Yang Li and Xuhua Hu

The purpose of this paper is to solve the problem of information privacy and security of social users. Mobile internet and social network are more and more deeply integrated into…

Abstract

Purpose

The purpose of this paper is to solve the problem of information privacy and security of social users. Mobile internet and social network are more and more deeply integrated into people’s daily life, especially under the interaction of the fierce development momentum of the Internet of Things and diversified personalized services, more and more private information of social users is exposed to the network environment actively or unintentionally. In addition, a large amount of social network data not only brings more benefits to network application providers, but also provides motivation for malicious attackers. Therefore, under the social network environment, the research on the privacy protection of user information has great theoretical and practical significance.

Design/methodology/approach

In this study, based on the social network analysis, combined with the attribute reduction idea of rough set theory, the generalized reduction concept based on multi-level rough set from the perspectives of positive region, information entropy and knowledge granularity of rough set theory were proposed. Furthermore, it was traversed on the basis of the hierarchical compatible granularity space of the original information system and the corresponding attribute values are coarsened. The selected test data sets were tested, and the experimental results were analyzed.

Findings

The results showed that the algorithm can guarantee the anonymity requirement of data publishing and improve the effect of classification modeling on anonymous data in social network environment.

Research limitations/implications

In the test and verification of privacy protection algorithm and privacy protection scheme, the efficiency of algorithm and scheme needs to be tested on a larger data scale. However, the data in this study are not enough. In the following research, more data will be used for testing and verification.

Practical implications

In the context of social network, the hierarchical structure of data is introduced into rough set theory as domain knowledge by referring to human granulation cognitive mechanism, and rough set modeling for complex hierarchical data is studied for hierarchical data of decision table. The theoretical research results are applied to hierarchical decision rule mining and k-anonymous privacy protection data mining research, which enriches the connotation of rough set theory and has important theoretical and practical significance for further promoting the application of this theory. In addition, combined the theory of secure multi-party computing and the theory of attribute reduction in rough set, a privacy protection feature selection algorithm for multi-source decision table is proposed, which solves the privacy protection problem of feature selection in distributed environment. It provides a set of effective rough set feature selection method for privacy protection classification mining in distributed environment, which has practical application value for promoting the development of privacy protection data mining.

Originality/value

In this study, the proposed algorithm and scheme can effectively protect the privacy of social network data, ensure the availability of social network graph structure and realize the need of both protection and sharing of user attributes and relational data.

Details

Library Hi Tech, vol. 40 no. 1
Type: Research Article
ISSN: 0737-8831

Keywords

1 – 10 of over 33000